Abstract
Transformer-based Large Language Models rely critically on the KV cache toefficiently handle extended contexts during the decode phase. Yet, the size ofthe KV cache grows proportionally with the input length, burdening both memorybandwidth and capacity as decoding progresses. To address this challenge, wepresent RocketKV, a training-free KV cache compression strategy containing twoconsecutive stages. In the first stage, it performs coarse-grain permanent KVcache eviction on the input sequence tokens. In the second stage, it adopts ahybrid sparse attention method to conduct fine-grain top-k sparse attention,approximating the attention scores by leveraging both head and sequencedimensionality reductions. We show that RocketKV provides a compression ratioof up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peakmemory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPUcompared to the full KV cache baseline, while achieving negligible accuracyloss on a variety of long-context tasks. We also propose a variant of RocketKVfor multi-turn scenarios, which consistently outperforms other existing methodsand achieves accuracy nearly on par with an oracle top-k attention scheme. Thesource code is available here: https://github.com/NVlabs/RocketKV.