Hardware-Efficient Attention for Fast Decoding

Abstract

LLM decoding is bottlenecked for large batches and long contexts by loadingthe key-value (KV) cache from high-bandwidth memory, which inflates per-tokenlatency, while the sequential nature of decoding limits parallelism. We analyzethe interplay among arithmetic intensity, parallelization, and model qualityand question whether current architectures fully exploit modern hardware. Thiswork redesigns attention to perform more computation per byte loaded frommemory to maximize hardware efficiency without trading off parallelscalability. We first propose Grouped-Tied Attention (GTA), a simple variantthat combines and reuses key and value states, reducing memory transferswithout compromising model quality. We then introduce Grouped Latent Attention(GLA), a parallel-friendly latent attention paired with low-level optimizationsfor fast decoding while maintaining high model quality. Experiments show thatGTA matches Grouped-Query Attention (GQA) quality while using roughly half theKV cache and that GLA matches Multi-head Latent Attention (MLA) and is easierto shard. Our optimized GLA kernel is up to 2$\times$ faster than FlashMLA, forexample, in a speculative decoding setting when the query length exceeds one.Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-endlatency and increases throughput in online serving benchmarks by up to2$\times$.

Quick Read (beta)

loading the full paper ...