PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Abstract

KV caching significantly improves the efficiency of Large Language Model(LLM) inference by storing attention states from previously processed tokens,enabling faster generation of subsequent tokens. However, as sequence lengthincreases, the KV cache quickly becomes a major memory bottleneck. To addressthis, we propose PagedEviction, a novel fine-grained, structured KV cachepruning strategy that enhances the memory efficiency of vLLM's PagedAttention.Unlike existing approaches that rely on attention-based token importance orevict tokens across different vLLM pages, PagedEviction introduces an efficientblock-wise eviction algorithm tailored for paged memory layouts. Our methodintegrates seamlessly with PagedAttention without requiring any modificationsto its CUDA attention kernels. We evaluate PagedEviction acrossLlama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct modelson the LongBench benchmark suite, demonstrating improved memory usage withbetter accuracy than baselines on long context tasks.

Quick Read (beta)

loading the full paper ...