DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

Abstract

Efficient KV cache management in LLMs is crucial for long-context tasks likeRAG and summarization. Existing KV cache compression methods enforce a fixedpattern, neglecting task-specific characteristics and reducing the retention ofessential information. However, we observe distinct activation patterns acrosslayers in various tasks, highlighting the need for adaptive strategies tailoredto each task's unique demands. Based on this insight, we propose DynamicKV, amethod that dynamically optimizes token retention by adjusting the number oftokens retained at each layer to adapt to the specific task. DynamicKVestablishes global and per-layer maximum KV cache budgets, temporarilyretaining the maximum budget for the current layer, and periodically updatingthe KV cache sizes of all preceding layers during inference. Our method retainsonly 1.7% of the KV cache size while achieving ~85% of the Full KV cacheperformance on LongBench. Notably, even under extreme compression (0.9%),DynamicKV surpasses state-of-the-art (SOTA) methods by 11% in theNeedle-in-a-Haystack test using Mistral-7B-Instruct-v0.2. The code will bereleased.

Quick Read (beta)

loading the full paper ...