- OpenAI's o3: Over-optimization is back and weirder than ever 1. RL for control: happens because our environments are brittle and tasks are unrealistic. 2. RLHF: happens because our reward functions suck. 3. RLVR: happens and makes our models super effective and weird as f. buff.ly/W9vcF9S
- did you define what rlvr was in the writeup?
- No lol was lazy, I should editApr 19, 2025 18:56