[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang — Latent Space: The AI Engineer Podcast

✨ AI Summary

John Yang recaps SWE-bench's evolution from ignored (Oct 2023) to industry standard after Devin's launch, expanding from Django-heavy to 9 languages across 40 repos
Discusses limitations of unit tests for verification and proposes long-running agent tournaments (CodeClash) where agents maintain codebases and compete iteratively
Details proliferation of SWE-bench variants including Pro, Live, and Multimodal/Multilingual versions adopted by Cognition, OpenAI, and Anthropic for evaluating coding agents