Abstract
Transformer-based large Language Models (LLMs) become increasingly importantin various domains. However, the quadratic time complexity of attentionoperation poses a significant challenge for scaling to longer contexts due tothe extremely high inference latency and GPU memory consumption for cachingkey-value (KV) vectors. This paper proposes RetrievalAttention, a training-freeapproach to accelerate attention computation. To leverage the dynamic sparseproperty of attention, RetrievalAttention builds approximate nearest neighborsearch (ANNS) indexes upon KV vectors in CPU memory and retrieves the mostrelevant ones via vector search during generation. Due to theout-of-distribution (OOD) between query vectors and key vectors, off-the-shelfANNS indexes still need to scan O(N) (usually 30% of all keys) data foraccurate retrieval, which fails to exploit the high sparsity.RetrievalAttention first identifies the OOD challenge of ANNS-based attention,and addresses it via an attention-aware vector search algorithm that can adaptto queries and only access 1--3% of data, thus achieving a sub-linear timecomplexity. RetrievalAttention greatly reduces the inference cost oflong-context LLM with much lower GPU memory requirements while maintaining themodel accuracy. Especially, RetrievalAttention only needs 16GB GPU memory forserving 128K tokens in LLMs with 8B parameters, which is capable of generatingone token in 0.188 seconds on a single NVIDIA RTX4090 (24GB).