PodcastIntel
Sign in Get Started Free
Latent Space: The AI Engineer Podcast
Latent Space: The AI Engineer Podcast

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Jul 12, 2024 · 58m
AI Summary
  • Clémentine Fourrier leads HuggingFace's OpenLLM Leaderboard, which standardizes model evaluation using high-quality benchmarks with reproducible, centralized scoring to replace lab-specific reports
  • Frontier models have plateaued at ~90% on MMLU and HumanEval, indicating benchmarks are stale and models likely memorizing; leaderboards address non-reproducibility of model-reported scores
  • Evolution from static benchmarks to dynamic leaderboards to arena-based evaluation addresses the pace of model development outstripping benchmark updates

More from Latent Space: The AI Engineer Podcast

View all episodes →

Get AI Summaries for Every New Episode

Subscribe to Latent Space: The AI Engineer Podcast and get AI summaries, guest tracking, and email digests delivered automatically.

Sign Up Free →