See full post

David Berenstein

davidberenstein.hf.co

Following · Followers

ML & DevRel (synthetic) data quality @ Hugging Face 🤗

Joined November 2024

Posts Replies Media Original posts Likes Lists

David Berenstein davidberenstein.hf.co · Apr 10
🔥 Bespoke curator: Synthetic Data Curation for Post-Training & Structured Data Extraction Create synthetic data pipelines with easy! - Retries and caching included - inference via LiteLLM, vLLM, and popular batch APIs - asynchronous operations 🔗 URL: buff.ly/ajPRT1l

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Apr 3
🔥One > token > at > a > time < a < at < token < One 🔥 token-explorer is a simple tool that lets you explore different possible paths that an LLM might sample! - Arrow keys to navigate, pop and append tokens - View the token probabilities and entropies. GitHub: buff.ly/FQgsczM

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Mar 7
🍽️ Let’s dissect the Synthetic Dataset Generator 💬 Natural language prompt to data 🦙 Ollama ensures secure local LLM inference ✍🏼 Argilla’s data curation capabilities complete the workflow 🔗 GitHub: buff.ly/5pX49Xc
GitHub - argilla-io/synthetic-data-generator: Build datasets using natural language

Build datasets using natural language. Contribute to argilla-io/synthetic-data-generator development by creating an account on GitHub.

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Mar 5
🔥 Text2SQL, explore and share any data analysis! 🤗 Hugging Face - Dataset Studio is an amazing new feature. 🚀 Start yourself: buff.ly/pjpOKav

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Mar 4
🔥 Vicinity: SEVEN semantic search BACK-ENDS, ONE single INTERFACE! 🫸 New release to push vector search to the Hub and work with any serialisable objects. 🧑‍🏫 KNN, HNSW, USEARCH, ANNOY, PYNNDESCENT, FAISS, and VOYAGER. 🔗 Library:
GitHub - MinishLab/vicinity: Lightweight Nearest Neighbors with Flexible Backends

Lightweight Nearest Neighbors with Flexible Backends - MinishLab/vicinity

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 27
🔥 NEW cool NO-CODE solution for clicking together AI WEB APPS! 🎨 Gradio released "gradio sketch" 🚼 Really easy way to create web apps with minimal code. ⚙️ Start with `pip install gradio` & `gradio sketch` 📒 Release: buff.ly/41aeLoA

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 27
Vector Search - let's keep it clean and lightweight! ⚡️ <100K records, no problem! >100K, some scaling issues ANN DuckDB index, sub-second response times Notebook:
vector_search_with_hub_as_backend.ipynb

Run, share, and edit Python notebooks

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 25
🔥 The smolagents module has arrived in the agents course! 💻 Code agents optimised for software development 🔧 Tool calling agents that create modular, function-driven workflows 🔍 Retrieval agents designed to access and synthesise information Course: buff.ly/4kcj6Ai

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 25
🧑‍🏫 Awesome. My talk for PyCon Italy 2025 got accepted! Got data problems? Relax. Synthetic data is here to help. Talk: buff.ly/3QzoZKj

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 21
🐳 Announcing docker support to Quickly set up your Synthetic Data Generator with (Gradio + Ollama + Argilla)! 🔥 Build genuinely useful datasets using natural language! ⚖️ Scale however you need. 🔐 Use them privately or share them with the world! 🧑‍💻 GitHub: buff.ly/49IDSmd

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 20
With 80K agent builders joining the agents course, it is time to make agents explorable on the Hub! You can now search and find the perfect agents and tools for your needs! Powered by @Gradio! Start searching:
smolagents and tools gallery - a Hugging Face Space by davidberenstein1957

Discover amazing ML apps made by the community

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 19
Image Generation has landed in Arena form 🎨🤖! 1. Describe your desired image🎨 2. Two anonymous models output images 3. Vote for the winner! Images have been sourced from our Open Image Preference dataset! Dataset: buff.ly/4il0du9 Arena: buff.ly/4142NwH

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 18
Are you, the top of the Agents class?! We just released a bonus unit on function calling (FC). You will learn: ⑴ What is FC? ⑵ Thought → Act → Observe Cycle in FC ⑶ lightweight and efficient fine-tuning Course: buff.ly/3Qn1DHB

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 14
📹 In case you've missed the hype around smolagents, here is a presentation I gave yesterday at an MLOps community event! library: buff.ly/4hj6PrJ slides: buff.ly/3WUzZ8D video:
Smol Agents and Hugging Face - Anote AI Day Summit 2025

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 12
Slides for my MLOps community talk on smolagents! Slides: buff.ly/3WUzZ8D
from bells and whistles to agents and tools

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 12
🚀 Find banger tools for your smolagents! I created the Tools gallery, which makes tools specifically developed by/for smolagents searchable and visible. This will help with: - inspiration - best practices - finding cool tools Space: buff.ly/41cYctx

View on Bluesky Download video Show all post labels

David Berenstein davidberenstein.hf.co · Feb 10
🔥 Come and get those AI agents certificates! Join the cohort of 66K students: buff.ly/4hxb6rK

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 10
Documents or images to structured data using Vision Language Models Outlines has an integration with transformers, which facilitates structured generation based on limiting token sampling probabilities. Blog: buff.ly/4jFHMkr

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 10
Local docker deployments for the synthetic data generator 🫱🏾‍🫲🏼 We would love to hear your thoughts! PR: buff.ly/4hRMny6

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 7
Curious about "Why 🚀", you may wonder? smolagents effortlessness combined with the power of 400,000 AI tools available on the Hub! library: buff.ly/4hj6PrJ

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 6
WOW, this will rock the world! Hibiki is a model for simultaneous speech2speech translation. And it actually works. Available in French-English but super excited to see what the community will do. Hub: buff.ly/3EtmM0f Paper: buff.ly/4jIXNGd

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 6
Agentic RAG: Applied, visual, and step-by-step! 🐾 Get familiar with the Agents and tools, not the bells and whistles! Retrieve - Augment and now GENERATE. Parts: 1: buff.ly/40XNIxM 2: buff.ly/40HkB0x 3:
Agentic RAG Stack (3/5) - Generate responses using a SmolLM

A Blog post by David Berenstein on Hugging Face

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 6
🤯 Bring your own AI data, even if you have none! Describe your dataset for RAG, LLMs or Text Classification Bring your own context! Press play and wait Space: buff.ly/3Y1S99z GitHub: buff.ly/49IDSmd

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 5
Anyone can create free hosted tools for their AI agents! 🔥 Agentic RAG stack part 2 - augment Augment retrieval results by reranking optimises content without increasing time too much part2: buff.ly/40HkB0x part1: buff.ly/40XNIxM code: buff.ly/4hEajpj

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Feb 5
🔥 How to find and install the latest AI apps from the AI app store 1. go to buff.ly/42CnUbU 2. search the app you like 3. go to the bottom settings 4. open the URL 5. press the search bar to install More info: buff.ly/3Csqc2J

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 4
Retrievers and rankers are a crucial part of optimising RAG. Easier to fine-tune than LLMs. More predictable than prompts. Training data is hard to find, so we offer private and free synthetic data on your own documents! Blog:
Fine-tune ModernBERT for RAG with Synthetic Data

A Blog post by Sara Han Díaz on Hugging Face

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Feb 4
Creating an agentic RAG stack on the Hugging Face Hub - part 1 - retrieval (1/5). 🚀 Web apps and microservices included! Chunk, embed and index documents at a huge scale without overhead. Blog:
Index and retrieve documents for vector search using Sentence Transformers and DuckDB

A Blog post by David Berenstein on Hugging Face

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Jan 30
Shit! 24B is the new small. Mistral drops their new model on Hugging Face! Great performance, and low latency. Model: buff.ly/4hwAzBa Code: buff.ly/3CEohrF

View on Bluesky Download video Show all post labels

David Berenstein davidberenstein.hf.co · Jan 30
Deploy a DeepSeek Web App with minimal code! AI Gradio is a Python package that makes it easy for developers to create AI apps powered by various AI providers. Code: buff.ly/40BDsde Library: buff.ly/3CvOQ2n

View on Bluesky Download video Show all post labels

David Berenstein davidberenstein.hf.co · Jan 30
No data for fine-tuning retrieval models? We help you generate it! - Load from Hub - Upload your own files - Generate from a prompt Space: buff.ly/3Y1S99z Code: buff.ly/3PRg4TX

View on Bluesky Download video Show all post labels

David Berenstein davidberenstein.hf.co · Jan 29
The Game is Afoot! Qwen2.5-Max! A model that beats DeepSeek V3 on benchmarks. As of now, only available on Alibaba Cloud 🔐 Space: buff.ly/42xUhZ2 Blog:
Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model

QWEN CHAT API DEMO DISCORD It is widely recognized that continuously scaling both data size and model size can lead to significant improvements in model intelligence. However, the research and…

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Jan 29
⚡️ Embed 1 million records in <10 minutes Load the data, use static embeddings, and reupload. Ready for vector search but might require some reranking. Library: buff.ly/42miwte

View on Bluesky Download video Show all post labels

David Berenstein davidberenstein.hf.co · Jan 29
🔥 The synthetic data for SmolLM and open DeepSeek-R1 relies on this awesome package! 1.2K distilabel datasets on the Hub buff.ly/3PW46si reproducible and sharable pipelines any LLM provider scale however you want library: buff.ly/3MXAB8G

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 28
Today, we are launching the integration of four awesome serverless Inference Providers – fal, Replicate, Sambanova, Together AI! Want to know how it works? Read the blog: buff.ly/3CreCES

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 28
🐳 DeepSeek is on Hugging Face 🤗 Free for inference! 1K requests for free 20K requests with PRO Code: buff.ly/4glAAa5 900 models more: buff.ly/40x1rua

View on Bluesky Download video Show all post labels

David Berenstein davidberenstein.hf.co · Jan 28
🐳 DeepSeek-R1 is also available on your Apple device via Hugging Chat! And, so are Meta, Qwen, SmolLM and many many others! Perfect to test and compare for your use case without lock-in. App Store:
‎Hugging Chat

‎Chat for free with the best open source AIs from Meta, Microsoft, Google and Mistral! With Hugging Chat, you're in control of your AI assistants. Keep in your pocket the most popular open source…

buff.ly

View on Bluesky Show all post labels

Reposted by David Berenstein
Florent Daudens fdaudens.bsky.social · Jan 27
Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after: Original release: 8 models, 540K downloads. Just the beginning... The community turned those open-weight models into +550 NEW models on @huggingface. Total downloads? 2.5M—nearly 5X the originals.

View on Bluesky Download image Show all post labels

Reposted by David Berenstein
Adina Yakup handle.invalid · Jan 27
[Not loaded yet]

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Jan 27
Let's uncover the post-training dataset from Deepseek-R1 with Magpie! Pass pre-query tokens `<｜begin▁of▁sentence｜>User: `, let the model generate the rest. We get realistic examples! Gist: buff.ly/40nPHu0 Library: buff.ly/3MXAB8G

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 26
Is RAG less useful due to longer context models?! Qwen at least sees a place for competition and releases its long-context version of Qwen2.5, supporting 1M-token context lengths. 🔥 Models:
Qwen2.5-1M - a Qwen Collection

The long-context version of Qwen2.5, supporting 1M-token context lengths

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Jan 25
Awesome! A fully open reproduction of DeepSeek-R1 by the Hugging Face Science team. Three steps - Distill data from R1 - RL pipeline to create R1-Zero - RL-tuned via multi-stage training Repo: buff.ly/4jtbp8x Paper session: buff.ly/4awj2H8
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1

Fully open reproduction of DeepSeek-R1. Contribute to huggingface/open-r1 development by creating an account on GitHub.

buff.ly

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Jan 24
🤯 Vector search on top of millions of docs in seconds. no pre-indexing! Model2Vec is an embedding powerhouse that distils good models and makes them up by 500x faster and 15x smaller. Vector Search on Hub Datasets demo: buff.ly/4gYhVlY Library: buff.ly/42miwte
Vectorsearch Hub Datasets - a Hugging Face Space by davidberenstein1957

Add vectors to Hub datasets and do in memory vector search.

huggingface.co

View on Bluesky Show all post labels

David Berenstein davidberenstein.hf.co · Jan 23
You might have thought VLMs could not get smoller? 🐁 Hugging Face proves you wrong and launches SmolVLM 256M & 500M. You can fine-tune it on your laptop and run it on your toaster! 👇 🐘 Beats SOTA 80B from less than 2 years ago! Model: buff.ly/4g9bGur

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 23
ColPali and VLMs are great for multi-modal RAG with truly effective document retrieval. Want to set up this pipeline yourself? Read the blog: buff.ly/42rNPTG

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 22
For a while, companies have been showing off their AI competence on the Hub with their datasets, models, and Spaces. Now, you can do the same with more nuance by linking blogs to your organisation! blog: buff.ly/3C3IzLe

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 22
Bootstrap, optimise and maintain domain-specific embedding and reranking in your RAG pipeline through synthetic data generation and evaluation. RAG optimisation can start easily by focusing on smaller and more manageable models. notebook: buff.ly/3PRg4TX UI: buff.ly/3Y1S99z

View on Bluesky Download image Show all post labels

David Berenstein davidberenstein.hf.co · Jan 21
The RAG's in the bag! You can now use the Synthetic Data Generator with your own domain-specific seed data to generate a dataset for fine-tuning retrieval or reranking model. GitHub: buff.ly/49IDSmd Spaces: buff.ly/3Y1S99z

View on Bluesky Download image Show all post labels

An unhandled error has occurred. Reload 🗙