[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI — Latent Space: The AI Engineer Podcast

✨ AI Summary

Josh McGrath at OpenAI describes post-training evolution from 2023 PPO vs DPO debates to current RLVR era where data quality and signal trust matter more than optimization method
RLHF and RLVR are both policy gradient methods; difference is input data (verifiable math signals vs human preferences); GRPO from DeepSeek Math represents underappreciated shift toward trustworthy rewards
Token efficiency now matters more than wall-clock time for scaling; GPT-5 to 5.1 improved evals while reducing tokens; Codex changed workflows from 40-min design to 15-min agent sprints