FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI — Latent Space: The AI Engineer Podcast

✨ AI Summary

Tri Dao explains FlashAttention: I/O-aware optimization reducing attention memory from O(N²) to sub-quadratic O(N) while maintaining exact computation without approximation
FlashAttention-2 achieves 800% speedups, adopted by most open models (LLaMA, Falcon, RedPajama, MPT) and became foundational optimization for LLM efficiency
Papers Explained series launches to cover foundational research; FlashAttention demonstrates how algorithmic innovations at scale dramatically impact practical LLM deployment

More from Latent Space: The AI Engineer Podcast

Apr 3, 2026 · 1h 16m

Apr 2, 2026 · 1h 6m

Mar 30, 2026 · 48m

Mar 24, 2026 · 35m