xKV: Cross-Layer SVD for KV-Cache Compression

Abstract

Large Language Models (LLMs) with long context windows enable powerfulapplications but come at the cost of high memory consumption to store the Keyand Value states (KV-Cache). Recent studies attempted to merge KV-cache frommultiple layers into shared representations, yet these approaches eitherrequire expensive pretraining or rely on assumptions of high per-token cosinesimilarity across layers which generally does not hold in practice. We findthat the dominant singular vectors are remarkably well-aligned across multiplelayers of the KV-Cache. Exploiting this insight, we propose xKV, a simplepost-training method that applies Singular Value Decomposition (SVD) on theKV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layersinto a shared low-rank subspace, significantly reducing KV-Cache sizes. Throughextensive evaluations on the RULER long-context benchmark with widely-used LLMs(e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression ratesthan state-of-the-art inter-layer technique while improving accuracy by 2.7%.Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA)(e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on codingtasks without performance degradation. These results highlight xKV's strongcapability and versatility in addressing memory bottlenecks for long-contextLLM inference. Our code is publicly available at:https://github.com/abdelfattah-lab/xKV.

Quick Read (beta)

loading the full paper ...