✨
AI Summary
- Discusses code editing benchmarks (WebArena, Sotopia), OpenDevin agent framework, and tensions between academic research and industry implementation of AI systems
- Covers SWEBench for software engineering tasks, dataset contamination detection methods, GAIA benchmark, and Moritz Hardt's research on the science of benchmarking
- Explores Self-RAG approach for reasoning and post-training, examining how LLMs can learn to retrieve, generate, and critique through self-reflection mechanisms
Guests on This Episode
AS
Aman Sanger
1 podcast appearance
GN
Graham Neubig
1 podcast appearance
MH
Moritz Hardt
1 podcast appearance