✨
AI Summary
- Clémentine Fourrier leads HuggingFace's OpenLLM Leaderboard, which standardizes model evaluation using high-quality benchmarks with reproducible, centralized scoring to replace lab-specific reports
- Frontier models have plateaued at ~90% on MMLU and HumanEval, indicating benchmarks are stale and models likely memorizing; leaderboards address non-reproducibility of model-reported scores
- Evolution from static benchmarks to dynamic leaderboards to arena-based evaluation addresses the pace of model development outstripping benchmark updates