✨
AI Summary
- John Yang recaps SWE-bench's evolution from ignored (Oct 2023) to industry standard after Devin's launch, expanding from Django-heavy to 9 languages across 40 repos
- Discusses limitations of unit tests for verification and proposes long-running agent tournaments (CodeClash) where agents maintain codebases and compete iteratively
- Details proliferation of SWE-bench variants including Pro, Live, and Multimodal/Multilingual versions adopted by Cognition, OpenAI, and Anthropic for evaluating coding agents
Guests on This Episode
JY
John Yang
1 podcast appearance