Inference-Time Hyper-Scaling with KV Cache Compression

Abstract

Inference-time scaling trades efficiency for increased reasoning accuracy bygenerating longer or more parallel sequences. However, in Transformer LLMs,generation cost is bottlenecked by the size of the key-value (KV) cache, ratherthan the number of generated tokens. Hence, we explore inference-timehyper-scaling: by compressing the KV cache, we can generate more tokens withinthe same compute budget and further improve the accuracy of scaled inference.The success of this approach, however, hinges on the ability of compressionmethods to preserve accuracy even at high compression ratios. To makehyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), anovel method for sparsifying KV caches that only requires 1K training steps toachieve 8$\times$ compression, while maintaining better accuracy thantraining-free sparse attention. Instead of prematurely discarding cachedtokens, DMS delays token eviction, implicitly merging representations andpreserving critical information. We demonstrate the effectiveness ofinference-time hyper-scaling with DMS on multiple families of LLMs, showingthat it boosts accuracy for comparable inference runtime and memory load. Forinstance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 onGPQA, and 9.6 on LiveCodeBench across compute budgets.

Quick Read (beta)

loading the full paper ...