See full post

Benjamin Warner

benjaminwarner.dev

Following · Followers

R&D at answer.ai

Joined November 2024

Posts Replies Media Original posts Likes

rajmovva rajmovva.bsky.social · Feb 18
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Feb 20
There isn't a canonical version, but there are retrieval models from GTE and Nomic which might work for your task. GTE: huggingface.co/Alibaba-NLP/... Nomic: huggingface.co/nomic-ai/mod...

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Feb 10
One of the questions we debated while training ModernBERT was whether a modern trained encoder would unlock zero-shot reasoning using only it's generative head? Spoilers: the answer is yes.

View on Bluesky Download image Show all post labels
View full thread
Benjamin Warner benjaminwarner.dev · Feb 10
Can all encoders be instruction-tuned? Can we replicate ModernBERT's results with an older model like RoBERTa or peer model like GTE-en-MLM? No. And it's not close.

View on Bluesky Download image Show all post labels
Benjamin Warner benjaminwarner.dev · Feb 10
For more details, including our simple training method, see Benjamin Clavié's twitter announcement, our model, blog post, and paper. Twitter: x.com/bclavie/stat... Model: huggingface.co/answerdotai/... Blog: www.answer.ai/posts/2025-0... Paper: arxiv.org/abs/2502.03793

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Feb 10
Replying to Benjamin Warner
After instruction tuning on Flan, ModernBERT-Large-Instruct outperforms similarly sized LLMs on MMLU & MMLU-Pro, and achieves ~90 percent of Llama 3.2 1B's performance with ~65 percent fewer parameters.

View on Bluesky Download image Show all post labels
Benjamin Warner benjaminwarner.dev · Feb 10
When we finetune ModernBERT-Large-Instruct on task specific datasets, the generative MLM head is better or nearly equal to standard classification heads.

View on Bluesky Download image Show all post labels

Benjamin Warner benjaminwarner.dev · Feb 10
Replying to Benjamin Warner
With @bclavie.bsky.social and @ncoop57.bsky.social, we tried to answer two questions: - Can an instruction-tuned ModernBERT zero-shot tasks using the MLM-head? - Could we then fine-tune instruction-tuned ModernBERT to complete any task? Detailed answers: arxiv.org/abs/2502.03793
It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers

While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability com...

arxiv.org

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Jan 23
In addition to being the best retrieval model under 300M params on METB (without extra work), and top 10 for under 1B, here's a fun tidbit from Alibaba's GTE ModernBERT model card: gte-modernbert-base beats gte-qwen1.5-7b on LoCo long context retrieval with 7B less parameters.

View on Bluesky Download image Show all post labels
Benjamin Warner benjaminwarner.dev · Jan 23
You can find the models on Hugging Face here: - gte-modernbert-base: huggingface.co/Alibaba-NLP/... - gte-reranker-modernbert-base: huggingface.co/Alibaba-NLP/...
Alibaba-NLP/gte-modernbert-base · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Jan 10
ModernBERT is officially released on Transformers v4.48.0. You no longer need to install from git to use. If you are plugging ModernBERT into an existing encoder finetuning pipeline, try increasing the learning rate. We've found that ModernBERT tends to prefer a higher LR than older models.

View on Bluesky Download image Show all post labels
Benjamin Warner benjaminwarner.dev · Jan 10
What's ModernBERT? It's a drop-in replacement for existing BERT models, but smarter, faster, and supports longer context. Check out our announcement post for more details: huggingface.co/blog/modernb...
Finally, a Replacement for BERT: Introducing ModernBERT

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Jan 7
The good: 32GB The bad: $2,000 The Ugly*: PCIe 5 without NVLink

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Jan 7
*Actually, that’s good compared to the 4090’s PCIe 4 without NVLink

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
This week we released ModernBERT, the first encoder to reach SOTA on most common benchmarks across language understanding, retrieval, and code, while running twice as fast as DeBERTaV3 on short context and three times faster than NomicBERT & GTE on long context.

View on Bluesky Download image Show all post labels

deen handle.invalid · Dec 24, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 24, 2024
Replying to deen
ModernBERT is a “foundation model” so you’ll either need to finetune it for entailment/NLI or wait for someone else to finetune it. I suspect it would be good at NLI once finetuned.

View on Bluesky Show all post labels

Phil Nash philna.sh · Dec 24, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 24, 2024
Replying to Phil Nash
We evaluated ModernBERT on MLDR using ColBERT-style retrieval using that code. That process was smaller scale than a full ColBERT finetune, which would need additional contrastive training, likely use multiple teacher models, etc as detailed here by @bclavie.bsky.social www.answer.ai/posts/2024-0...

View on Bluesky Show all post labels

Phil Nash philna.sh · Dec 24, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 24, 2024
Replying to Phil Nash
Thanks. ModernBERT is a base model. It’ll need additional contrastive pretraining to really shine as a retrieval model, but our early results in the paper look promising. Hopefully there will be multiple open source retrieval tuned models to choose from early next year, including ColBERT finetunes.

View on Bluesky Show all post labels

Clayton Thorrez handle.invalid · Dec 22, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Thanks for the kind words. We tried to fit as much information within our page limit as possible and have a comprehensive appendix. As far as the name goes, all I’ll say is be careful not to use an overly strong code name.

View on Bluesky Show all post labels

John F Wu handle.invalid · Dec 22, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Thanks. It’ll need additional contrastive pretraining to really shine as a retrieval model, but our early results look promising. Hopefully there will be multiple open source retrieval tuned models to choose from early next year.

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
(early results in our paper)

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Replying to Benjamin Warner
I'm looking forward to seeing what you all will build with a modern encoder.

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
PS: BlueSky needs to make their really long account tags not count against the character limit.

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Replying to Benjamin Warner
Thanks to my two co-leads: @nohtow.bsky.social , @bclavie.bsky.social , & the rest of our stacked author cast: @orionweller.bsky.social, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, @tomaarsen.com , @ncoop57.bsky.social , Griffin Adams, @howard.fm , & Iacopo Poli

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
A big thanks to Iacopo Poli and @lightonai.bsky.social for sponsoring the compute to train ModernBERT, @bclavie.bsky.social for organizing the ModernBERT project, and to everyone who offered assistance and advice along the way. Also h/t to Johno Whitaker for the illustrations.

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Replying to Benjamin Warner
Last, we trained ModernBERT on variety of data sources, including web docs, code, & scientific articles, for a total of 2 trillion tokens of English text & code. 1.7 trillion tokens at a short 1024 sequence length, followed by 300 billion tokens at a long 8192 sequence length.

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
For all the model design, training, and evaluation details, check out our Arxiv preprint: arxiv.org/abs/2412.13663
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of n...

arxiv.org

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Replying to Benjamin Warner
How did we do it? First, we brought all the modern LLM architectural improvements to encoders, including alternating global & local attention, RoPE, and GeGLU layers, and added full model unpadding using Flash Attention for maximum performance (illustrated in the next post).

View on Bluesky Download image Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Second, we carefully designed ModernBERT's architecture run to efficiently across most common GPUs. Many common older models don't consider the hardware they will run on and are slower than they should be. Not so with ModernBERT. (Full model sequence packing illustrated below)

View on Bluesky Download image Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Replying to Benjamin Warner
ModernBERT-base is the first encoder to beat DeBERTaV3-base on GLUE. ModernBERT is also competitive or top scoring on single vector retrieval, ColBERT retrieval, and programming benchmarks.

View on Bluesky Download image Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 22, 2024
ModernBERT was designed from the ground up for speed and memory efficiency. ModernBERT is both faster and more memory efficient than every major encoder released since the original BERT.

View on Bluesky Download image Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 22, 2024
Replying to Benjamin Warner
ModernBERT is available to use today on Transformers (pip install from main). More details in our announcement post. huggingface.co/blog/modernb...
Finally, a Replacement for BERT: Introducing ModernBERT

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

View on Bluesky Show all post labels

Ted Underwood tedunderwood.me · Dec 19, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Benjamin Warner benjaminwarner.dev · Dec 20, 2024
Good codenames are dangerous, as they have staying power.

View on Bluesky Show all post labels

Benjamin Warner benjaminwarner.dev · Dec 13, 2024
I feel the need for speed.

View on Bluesky Download image Show all post labels

An unhandled error has occurred. Reload 🗙