TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Abstract

The Key-Value (KV) cache in generative large language models (LLMs)introduces substantial memory overhead. Existing works mitigate this burden byoffloading or compressing the KV cache. However, loading the entire cacheincurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPUcommunication, while aggressive compression causes notable performancedegradation. We identify that certain layers in the LLM need to maintain globalinformation and are unsuitable for selective loading. In contrast, other layersprimarily focus on a few tokens with dominant activations that potentiallyincur substantial quantization error. This observation leads to a key insightthat loading dominant tokens and quantizing all tokens can complement eachother. Building on this insight, we propose a hybrid compression method,TailorKV, which seamlessly integrates quantization and offloading. TailorKVdevelops an inference framework along with a hardware-friendly implementationthat leverages these complementary characteristics. Extensive long-contextevaluations exhibit that TailorKV achieves nearly lossless performance underaggressive compression settings, outperforming the state-of-the-art.Particularly, the Llama-3.1-8B with 128k context can be served within a singleRTX 3090 GPU, reaching 82 ms per token during decoding.

Quick Read (beta)

loading the full paper ...