Abstract
The increasing size of the Key-Value (KV) cache during the Large LanguageModels long-context inference is the main obstacle for its balance between thedeployment cost and task accuracy. To reduce the KV cache size in suchscenarios, most previous efforts leveraged on the attention weight to evictnon-critical cache tokens. But there is a trade-off in those methods, theyusually require major modification of the inference infrastructure andsignificant computation overhead. Based on the fact that the Large Languagemodels are autoregressive models, we propose LagKV, a KV compression strategyonly relying on straight forward comparison among KV themselves. It is atotally attention free method which offers easy integration to the main streaminference platform and comparable performance comparing to other complicated KVcompression methods. Results on RULER benchmark show that, our approachoutperforms SnapKV and StreamingLLM in different compression ratios. Especiallyin the 64-digit passkey retrieval task, our method outperforms the attentionweight based method $H_2O$ over $50\%$ with same compression ratios. Our codeis available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.