✨
AI Summary
- Josh McGrath at OpenAI describes post-training evolution from 2023 PPO vs DPO debates to current RLVR era where data quality and signal trust matter more than optimization method
- RLHF and RLVR are both policy gradient methods; difference is input data (verifiable math signals vs human preferences); GRPO from DeepSeek Math represents underappreciated shift toward trustworthy rewards
- Token efficiency now matters more than wall-clock time for scaling; GPT-5 to 5.1 improved evals while reducing tokens; Codex changed workflows from 40-min design to 15-min agent sprints
Guests on This Episode
JM
Josh McGrath
1 podcast appearance