See full post

Gabriel Martín Blázquez

Following · Followers

ML Engineer @hf.co 🤗 Building tools for you to take care of your datasets like Argilla or distilabel!

Joined November 2024

Posts Replies Media Original posts Likes

Gabriel Martín Blázquez gabrielmb.com · Feb 6
SmolLM2 paper is out! We wrote a paper detailing the steps we took to train one of the best smol LM 🤏 out there: pre-training and post-training data, training ablations and some interesting findings 💡 Go check it out and don't hesitate to write your thoughts/questions in the comments section!

View on Bluesky Download image Show all post labels

Gabriel Martín Blázquez gabrielmb.com · Jan 27
distilabel ⚗️ reached the 2k ⭐️ on GitHub!

View on Bluesky Download image Show all post labels

Reposted by Gabriel Martín Blázquez
Lewis Tunstall handle.invalid · Jan 25
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open! Follow along: github.com/huggingface/...
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1

Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.

github.com

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Anton handle.invalid · Dec 19, 2024
Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens! Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH. 🤗 huggingface.co/datasets/Hug... Here’s a breakdown 🧵

View on Bluesky Download image Show all post labels

Reposted by Gabriel Martín Blázquez
José Francisco Calvo jfcalvo.hf.co · Dec 19, 2024
🚀 Argilla v2.6.0 is here! 🎉 Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. 🤩 Take a look to this quick demo 👇 💁‍♂️ More info about the release at github.com/argilla-io/a... #AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla

View on Bluesky Download video Show all post labels

Gabriel Martín Blázquez gabrielmb.com · Dec 13, 2024
How many regular expressions have you written without the help of an LLM since ChatGPT appeared?

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Thomas Wolf handle.invalid · Dec 8, 2024
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Ben Burtenshaw handle.invalid · Dec 3, 2024
For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol. 🧵>>

View on Bluesky Download image Show all post labels

Gabriel Martín Blázquez gabrielmb.com · Nov 27, 2024
It's just me or the latest Claude 3.5 Sonnet is too prone to generate code when asking technical questions not directly related to coding?

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Andi handle.invalid · Nov 26, 2024
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs. SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

View on Bluesky Download image Show all post labels

Gabriel Martín Blázquez gabrielmb.com · Nov 26, 2024
As part of the SmolTalk release, the dataset mixture used for @huggingface.bsky.social SmolLM2 model, we built a new version of the MagPie Ultra dataset using Llama 405B Instruct. It contains 1M rows of multi-turn conversations with diverse instructions! huggingface.co/datasets/arg...
argilla/magpie-ultra-v1.0 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Daniel Vila dvilasuero.hf.co · Nov 26, 2024
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Daniel Vila dvilasuero.hf.co · Nov 24, 2024
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Loubna Ben Allal loubnabnl.hf.co · Nov 24, 2024
Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/... Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos Apache 2.0. V2 data mix coming soon! Which tools should we add next?
GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

github.com

View on Bluesky Show all post labels

Gabriel Martín Blázquez gabrielmb.com · Nov 21, 2024
Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2! The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel. Check out the dataset: huggingface.co/datasets/Hug...
HuggingFaceTB/smoltalk · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Reposted by Gabriel Martín Blázquez
Ben Burtenshaw handle.invalid · Nov 21, 2024
[Not loaded yet]

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙