See full post

Jade

Following · Followers

Researcher @nousresearch.com

Joined November 2024

Posts Replies Media Original posts Likes

Reposted by Jade
Ai2 handle.invalid · May 1
We're excited to round out the OLMo 2 family with its smallest member, OLMo 2 1B, surpassing peer models like Gemma 3 1B or Llama 3.2 1B. The 1B model should enable rapid iteration for researchers, more local development, and a more complete picture of how our recipe scales.

View on Bluesky Download image Show all post labels

Reposted by Jade
Alisa Liu handle.invalid · Mar 21
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

View on Bluesky Download image Show all post labels

Reposted by Jade
Alex Nichol unixpickle.bsky.social · Feb 24
I really feel like fashion retail websites should let you browse in latent space. Don't select a category like "shirt". Instead, see some product images, and select one that's a shirt. Then see a grid of shirts, and pick your favorites. See a grid of shirts like these. Repeat.

View on Bluesky Show all post labels

Jade euclaise.xyz · Feb 23
"Has anyone tried... oh, it's already in the NanoGPT speedruns"

View on Bluesky Show all post labels

Jade euclaise.xyz · Feb 7
Dynamically scaled softplus attention Builds on log-scaled attention, replacing the softmax exp with softplus but keeping the normalization. Reminiscent of sigmoid attention - which it reports as also outperforming softmax results when applying the same modifications. arxiv.org/abs/2501.13428
Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical ...

arxiv.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Jan 30
New Mistral model just dropped mistral.ai/news/mistral...
Mistral Small 3

Apache 2.0, 81% MMLU, 150 tokens/s

mistral.ai

View on Bluesky Show all post labels

Reposted by Jade
Nous Research nousresearch.com · Jan 27
Recent AI breakthroughs challenge the status quo narrative that only closed, mega labs have the ability to push the frontier of superintelligence. Today we announce Nous Psyche built on @solana.com www.youtube.com/watch?v=XMWI...
The Story of Psyche

YouTube video by Nous Research

youtube.com

View on Bluesky Show all post labels

Jade euclaise.xyz · Jan 30
A general framework for constructing sequence models via test-time regression. This sort of idea has been floating around for a while, but we haven't done much with it, with a few mostly-recent exceptions (e.g. TTT, Titans). arxiv.org/abs/2501.12352
Test-time regression: a unifying framework for designing sequence models with associative memory

Sequences provide a remarkably general way to represent and process information. This powerful abstraction has placed sequence modeling at the center of modern deep learning applications, inspiring nu...

arxiv.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Jan 30
arxiv.org/abs/2501.00658
Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that th...

arxiv.org

View on Bluesky Show all post labels

Reposted by Jade
Felix felix-red-panda.bsky.social · Dec 20, 2024
the o3 announcement feels very analogous to the GPT-3 announcement in early 2020: "here is new tech, it's very powerful, but it's super expensive to run and we're super selective with who we give access to"

View on Bluesky Show all post labels

Reposted by Jade
Alexander Doria handle.invalid · Dec 5, 2024
“They said it could not be done”. We’re releasing Pleias 1.0, the first suite of models trained on open data (either permissibly licensed or uncopyrighted): Pleias-3b, Pleias-1b and Pleias-350m, all based on the two trillion tokens set from Common Corpus.

View on Bluesky Download image Show all post labels

Jade euclaise.xyz · Dec 6, 2024
Interesting and somewhat unintuitive: Distilling reasoning by placing the teacher CoT post-answer performs better than prefixing, suggesting that the student isn't learning to do CoT. Further, permuting the CoT tokens doesn't harm performance, and only a small number of the tokens are needed.
Investigating Mysteries of CoT-Augmented Distillation

Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More r...

arxiv.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 27, 2024
Qwen team at the forefront of open AI, as usual
- Knut Jägersberg handle.invalid · Nov 27, 2024
  QwQ: Reflect Deeply on the Boundaries of the Unknown What you see with these new models is that they don't need reflection tokens, but simply benefit a lot from self-talk. qwenlm.github.io/blog/qwq-32b...
View on Bluesky Show all post labels

Reposted by Jade
janbam janbam.bsky.social · Nov 27, 2024
How to publish your Bluesky scrape without anybody noticing: zenodo.org/records/1108...
Bluesky Social Dataset

Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing...

zenodo.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 25, 2024
There now seems to be an HF group for such projects: huggingface.co/bluesky-comm...
bluesky-community (Bluesky Community)

Tools for Bluesky 🦋

huggingface.co
- Jade not.euclaise.xyz · Nov 24, 2024
View on Bluesky Show all post labels

Reposted by Jade
brady bfloat8.bsky.social · Nov 19, 2024
theres a new kaggle challenge for building an efficient chess bot that fits within 64KiB www.kaggle.com/competitions...
FIDE & Google Efficient Chess AI Challenge

Create agents to play chess with resource constraints

kaggle.com

View on Bluesky Show all post labels

Reposted by Jade
Alexander Doria handle.invalid · Nov 24, 2024
So first version of an ml anon starter pack. go.bsky.app/VgWL5L Kept half-anons (like me and Vic). Not all anime pfp, but generally drawn.
at://did:plc:vg3thtvfbgfrr3u6pf6hy3yk/app.bsky.graph.starterpack/3lbphjvucu32k

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 25, 2024
arxiv.org/abs/2411.12537 arxiv.org/abs/2405.17394 Allowing non-negative gates widens the class of regular expressions, specifically for periodic state tracking (e.g. modular counting, bitstring parity) I had a similar thought back when I was first toying with linear RNNs, so I had tried tanh.

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 16, 2024
try to fix NaNs -> model trains until it randomly kills the machine try to fix model killing the machine -> model too slow try to fix slowness -> NaN try to fix NaNs -> loss doesn't go down 🙃

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 12, 2024
Yet another proposal for an attention variant allowing for negative weights arxiv.org/abs/2411.07176
More Expressive Attention with Negative Weights

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift th...

arxiv.org

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙