PodcastIntel
Sign in Get Started Free
Latent Space: The AI Engineer Podcast
Latent Space: The AI Engineer Podcast

FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI

Jul 26, 2023 · 54m
AI Summary
  • Tri Dao explains FlashAttention: I/O-aware optimization reducing attention memory from O(N²) to sub-quadratic O(N) while maintaining exact computation without approximation
  • FlashAttention-2 achieves 800% speedups, adopted by most open models (LLaMA, Falcon, RedPajama, MPT) and became foundational optimization for LLM efficiency
  • Papers Explained series launches to cover foundational research; FlashAttention demonstrates how algorithmic innovations at scale dramatically impact practical LLM deployment

More from Latent Space: The AI Engineer Podcast

View all episodes →

Get AI Summaries for Every New Episode

Subscribe to Latent Space: The AI Engineer Podcast and get AI summaries, guest tracking, and email digests delivered automatically.

Sign Up Free →