ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt) — Latent Space: The AI Engineer Podcast

✨ AI Summary

Discusses code editing benchmarks (WebArena, Sotopia), OpenDevin agent framework, and tensions between academic research and industry implementation of AI systems
Covers SWEBench for software engineering tasks, dataset contamination detection methods, GAIA benchmark, and Moritz Hardt's research on the science of benchmarking
Explores Self-RAG approach for reasoning and post-training, examining how LLMs can learn to retrieve, generate, and critique through self-reflection mechanisms