Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Abstract

Many computational factors limit broader deployment of large language models.In this paper, we focus on a memory bottleneck imposed by the key-value (KV)cache, a computational shortcut that requires storing previous KV pairs duringdecoding. While existing KV cache methods approach this problem by pruning orevicting large swaths of relatively less important KV pairs to dramaticallyreduce the memory footprint of the cache, they can have limited success intasks that require recollecting a majority of previous tokens. To alleviatethis issue, we propose LESS, a simple integration of a (nearly free) constantsized cache with eviction-based cache methods, such that all tokens can bequeried at later decoding steps. Its ability to retain information throughouttime shows merit on a variety of tasks where we demonstrate LESS can helpreduce the performance gap from caching everything, sometimes even matching it,all while being efficient.

Quick Read (beta)

loading the full paper ...