Gabriel Martín Blázquez
ML Engineer @hf.co 🤗 Building tools for you to take care of your datasets like Argilla or distilabel!
- SmolLM2 paper is out! We wrote a paper detailing the steps we took to train one of the best smol LM 🤏 out there: pre-training and post-training data, training ablations and some interesting findings 💡 Go check it out and don't hesitate to write your thoughts/questions in the comments section!
- Link to Hugging Face paper page: huggingface.co/papers/2502.02737
- distilabel ⚗️ reached the 2k ⭐️ on GitHub!
- How many regular expressions have you written without the help of an LLM since ChatGPT appeared?
- [Not loaded yet]
- That's 100% true. To be honest, all the regular expressions that I've used in the last months have been written by an LLM... Most of the time they work at first try, but when they don't it's a pain.
- It's just me or the latest Claude 3.5 Sonnet is too prone to generate code when asking technical questions not directly related to coding?
- As part of the SmolTalk release, the dataset mixture used for @huggingface.bsky.social SmolLM2 model, we built a new version of the MagPie Ultra dataset using Llama 405B Instruct. It contains 1M rows of multi-turn conversations with diverse instructions! huggingface.co/datasets/arg...
- [Not loaded yet]
- [Not loaded yet]
- Thank you Marco!
- Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2! The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel. Check out the dataset: huggingface.co/datasets/Hug...
- The dataset allowed to enhance the instruction following and reasoning of SmolLM2 with respect to the previous version. It also includes instructions for rewriting, summarization and function calling.
- We will soon release all the distilabel code used to generate the datasets. As a sneak peak, you can already check the code used for MagPie Ultra v1.0 here: github.com/huggingface/...
- The great exile! For those who don’t know me, I’m Gabriel, ML Engineer at @huggingface.bsky.social where I work developing tools like distilabel or Argilla for you to take care of your data 🤗 The content of my posts here will be mainly related to synthetic data and LLM post-training.