RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Abstract

Transformer-based Large Language Models (LLMs) have become increasinglyimportant. However, due to the quadratic time complexity of attentioncomputation, scaling LLMs to longer contexts incurs extremely slow inferencelatency and high GPU memory consumption for caching key-value (KV) vectors.This paper proposes RetrievalAttention, a training-free approach to bothaccelerate attention computation and reduce GPU memory consumption. Byleveraging the dynamic sparsity of attention mechanism, RetrievalAttentionproposes to use approximate nearest neighbor search (ANNS) indexes for KVvectors in CPU memory and retrieves the most relevant ones with vector searchduring generation. Unfortunately, we observe that the off-the-shelf ANNSindexes are often ineffective for such retrieval tasks due to theout-of-distribution (OOD) between query vectors and key vectors in attentionmechanism. RetrievalAttention addresses the OOD challenge by designing anattention-aware vector search algorithm that can adapt to the distribution ofquery vectors. Our evaluation shows that RetrievalAttention only needs toaccess 1--3% of data while maintaining high model accuracy. This leads tosignificant reduction in the inference cost of long-context LLMs with muchlower GPU memory footprint. In particular, RetrievalAttention only needs asingle NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8Bparameters, which is capable of generating one token in 0.188 seconds.

Quick Read (beta)

loading the full paper ...