Nathan Lambert: No lol was lazy, I should edit

Nathan Lambert natolambert.bsky.social · Apr 19
OpenAI's o3: Over-optimization is back and weirder than ever 1. RL for control: happens because our environments are brittle and tasks are unrealistic. 2. RLHF: happens because our reward functions suck. 3. RLVR: happens and makes our models super effective and weird as f. buff.ly/W9vcF9S

View on Bluesky Show all post labels
Eugene Yan eugeneyan.com · Apr 19
did you define what rlvr was in the writeup?

View on Bluesky Show all post labels
Nathan Lambert natolambert.bsky.social
No lol was lazy, I should edit
Apr 19, 2025 18:56
0 reposts 0 quotes 0 likes

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙