PodcastIntel
Sign in Get Started Free
Latent Space: The AI Engineer Podcast
Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Dec 31, 2025 · 17m
AI Summary
  • John Yang recaps SWE-bench's evolution from ignored (Oct 2023) to industry standard after Devin's launch, expanding from Django-heavy to 9 languages across 40 repos
  • Discusses limitations of unit tests for verification and proposes long-running agent tournaments (CodeClash) where agents maintain codebases and compete iteratively
  • Details proliferation of SWE-bench variants including Pro, Live, and Multimodal/Multilingual versions adopted by Cognition, OpenAI, and Anthropic for evaluating coding agents

Guests on This Episode

JY
John Yang
1 podcast appearance

More from Latent Space: The AI Engineer Podcast

View all episodes →

Get AI Summaries for Every New Episode

Subscribe to Latent Space: The AI Engineer Podcast and get AI summaries, guest tracking, and email digests delivered automatically.

Sign Up Free →