DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Abstract

Deploying long-context large language models (LLMs) is essential but posessignificant computational and memory challenges. Caching all Key and Value (KV)states across all attention heads consumes substantial memory. Existing KVcache pruning methods either damage the long-context capabilities of LLMs oroffer only limited efficiency improvements. In this paper, we identify thatonly a fraction of attention heads, a.k.a, Retrieval Heads, are critical forprocessing long contexts and require full attention across all tokens. Incontrast, all other heads, which primarily focus on recent tokens and attentionsinks--referred to as Streaming Heads--do not require full attention. Based onthis insight, we introduce DuoAttention, a framework that only applies a fullKV cache to retrieval heads while using a light-weight, constant-length KVcache for streaming heads, which reduces both LLM's decoding and pre-fillingmemory and latency without compromising its long-context abilities.DuoAttention uses a lightweight, optimization-based algorithm with syntheticdata to identify retrieval heads accurately. Our method significantly reduceslong-context inference memory by up to 2.55x for MHA and 1.67x for GQA modelswhile speeding up decoding by up to 2.18x and 1.50x and acceleratingpre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, withminimal accuracy loss compared to full attention. Notably, combined withquantization, DuoAttention enables Llama-3-8B decoding with 3.3 million contextlength on a single A100 GPU. Code is provided inhttps://github.com/mit-han-lab/duo-attention.

Quick Read (beta)

loading the full paper ...