KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Abstract

LLMs are seeing growing use for applications such as document analysis andsummarization which require large context windows, and with these large contextwindows KV cache activations surface as the dominant contributor to memoryconsumption during inference. Quantization is a promising approach forcompressing KV cache activations; however, existing solutions fail to representactivations accurately in ultra-low precisions, such as sub-4-bit. In thiswork, we present KVQuant, which addresses this problem by incorporating novelmethods for quantizing cached KV activations, including: (i) Per-Channel KeyQuantization, where we adjust the dimension along which we quantize the Keyactivations to better match the distribution; (ii) Pre-RoPE Key Quantization,where we quantize Key activations before the rotary positional embedding tomitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization,where we derive per-layer sensitivity-weighted non-uniform datatypes thatbetter represent the distributions; (iv) Per-Vector Dense-and-SparseQuantization, where we isolate outliers separately for each vector to minimizeskews in quantization ranges; and (v) Q-Norm, where we normalize quantizationcentroids in order to mitigate distribution shift, providing additionalbenefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2,and Mistral models, we achieve $<0.1$ perplexity degradation with 3-bitquantization on both Wikitext-2 and C4, outperforming existing approaches. Ourmethod enables serving the LLaMA-7B model with a context length of up to 1million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.

Quick Read (beta)

loading the full paper ...