See full post

Jade

Following · Followers

Researcher @nousresearch.com

Joined November 2024

Posts Replies Media Original posts Likes

Jade euclaise.xyz · Feb 23
"Has anyone tried... oh, it's already in the NanoGPT speedruns"

View on Bluesky Show all post labels

Alexander Doria handle.invalid · Feb 23
[Not loaded yet]

View on Bluesky Show all post labels
Jade euclaise.xyz · Feb 23
the woke lesson

View on Bluesky Show all post labels

Jade euclaise.xyz · Feb 7
Dynamically scaled softplus attention Builds on log-scaled attention, replacing the softmax exp with softplus but keeping the normalization. Reminiscent of sigmoid attention - which it reports as also outperforming softmax results when applying the same modifications. arxiv.org/abs/2501.13428
Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical ...

arxiv.org

View on Bluesky Show all post labels
Jade euclaise.xyz · Feb 7
Some focus has been given to sigmoid attention recently due to newer results, but I first came across it in Shatter from 2021, where it was proposed for single-headed attention. arxiv.org/abs/2108.13032
Shatter: An Efficient Transformer Encoder with Single-Headed Self-Attention and Relative Sequence Partitioning

The highly popular Transformer architecture, based on self-attention, is the foundation of large pretrained models such as BERT, that have become an enduring paradigm in NLP. While powerful, the compu...

arxiv.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Jan 30
New Mistral model just dropped mistral.ai/news/mistral...
Mistral Small 3

Apache 2.0, 81% MMLU, 150 tokens/s

mistral.ai

View on Bluesky Show all post labels
Jade euclaise.xyz · Jan 30
24B

View on Bluesky Download image (1)Download image (2)Download image (3)Show all post labels

Jade euclaise.xyz · Jan 30
A general framework for constructing sequence models via test-time regression. This sort of idea has been floating around for a while, but we haven't done much with it, with a few mostly-recent exceptions (e.g. TTT, Titans). arxiv.org/abs/2501.12352
Test-time regression: a unifying framework for designing sequence models with associative memory

Sequences provide a remarkably general way to represent and process information. This powerful abstraction has placed sequence modeling at the center of modern deep learning applications, inspiring nu...

arxiv.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Jan 30
arxiv.org/abs/2501.00658
Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that th...

arxiv.org

View on Bluesky Show all post labels

aurelium aurelium.me · Dec 5, 2024
[Not loaded yet]

View on Bluesky Show all post labels

aurelium aurelium.me · Dec 5, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Jade euclaise.xyz · Dec 6, 2024
Replying to aurelium
like TTT?

View on Bluesky Show all post labels

Jade euclaise.xyz · Dec 6, 2024
Interesting and somewhat unintuitive: Distilling reasoning by placing the teacher CoT post-answer performs better than prefixing, suggesting that the student isn't learning to do CoT. Further, permuting the CoT tokens doesn't harm performance, and only a small number of the tokens are needed.
Investigating Mysteries of CoT-Augmented Distillation

Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More r...

arxiv.org

View on Bluesky Show all post labels
Jade euclaise.xyz · Dec 6, 2024
It's also not just more tokens - tokens not from the CoT reationale don't match the CoT performance, even though heavily corrupted CoTs work fine

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 27, 2024
Qwen team at the forefront of open AI, as usual
- Knut Jägersberg handle.invalid · Nov 27, 2024
  QwQ: Reflect Deeply on the Boundaries of the Unknown What you see with these new models is that they don't need reflection tokens, but simply benefit a lot from self-talk. qwenlm.github.io/blog/qwq-32b...
View on Bluesky Show all post labels
Jade euclaise.xyz · Nov 27, 2024
OpenAI really needs to rename

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 25, 2024
There now seems to be an HF group for such projects: huggingface.co/bluesky-comm...
bluesky-community (Bluesky Community)

Tools for Bluesky 🦋

huggingface.co
- Jade not.euclaise.xyz · Nov 24, 2024
View on Bluesky Show all post labels

xjdr xjdr.bsky.social · Nov 25, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Jade euclaise.xyz · Nov 25, 2024
I use "Popular with friends" by default. I really didn't like Discover

View on Bluesky Show all post labels

nopainkiller nopainkiller.bsky.social · Nov 25, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Jade euclaise.xyz · Nov 25, 2024
technically I am not anon

View on Bluesky Show all post labels

Quentin quentin.moe · Nov 24, 2024
[Not loaded yet]

View on Bluesky Show all post labels
Jade euclaise.xyz · Nov 25, 2024
this was an instant follow

View on Bluesky Download image Show all post labels

Jade euclaise.xyz · Nov 25, 2024
arxiv.org/abs/2411.12537 arxiv.org/abs/2405.17394 Allowing non-negative gates widens the class of regular expressions, specifically for periodic state tracking (e.g. modular counting, bitstring parity) I had a similar thought back when I was first toying with linear RNNs, so I had tried tanh.

View on Bluesky Show all post labels
Jade euclaise.xyz · Nov 25, 2024
Regarding periodicity in general, another thing I had expiremented with at some points was just inserting periodic position encodings, inspired by arxiv.org/abs/2402.00236
Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary

This study reports an unintuitive finding that positional encoding enhances learning of recurrent neural networks (RNNs). Positional encoding is a high-dimensional representation of time indices on in...

arxiv.org

View on Bluesky Show all post labels
Jade euclaise.xyz · Nov 25, 2024
I wonder if keeping track of cumulative multipliers like in LRNNs, but then using them as a soft mask for attention, would help this in transformers. Sort of like CoPE arxiv.org/abs/2405.18719
Contextual Position Encoding: Learning to Count What's Important

The attention mechanism is a critical component of Large Language Models (LLMs) that allows tokens in a sequence to interact with each other, but is order-invariant. Incorporating position encoding (P...

arxiv.org

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 16, 2024
try to fix NaNs -> model trains until it randomly kills the machine try to fix model killing the machine -> model too slow try to fix slowness -> NaN try to fix NaNs -> loss doesn't go down 🙃

View on Bluesky Show all post labels

Jade euclaise.xyz · Nov 12, 2024
Yet another proposal for an attention variant allowing for negative weights arxiv.org/abs/2411.07176
More Expressive Attention with Negative Weights

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift th...

arxiv.org

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙