SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

  • 2025-09-02 11:29:34
  • Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu
  • 0

Abstract

Long-context inference in large language models (LLMs) is increasinglyconstrained by the KV cache bottleneck: memory usage grows linearly withsequence length, while attention computation scales quadratically. Existingapproaches address this issue by compressing the KV cache along the temporalaxis through strategies such as token eviction or merging to reduce memory andcomputational overhead. However, these methods often neglect fine-grainedimportance variations across feature dimensions (i.e., the channel axis),thereby limiting their ability to effectively balance efficiency and modelaccuracy. In reality, we observe that channel saliency varies dramaticallyacross both queries and positions: certain feature channels carry near-zeroinformation for a given query, while others spike in relevance. To address thisoversight, we propose SPARK, a training-free plug-and-play method that appliesunstructured sparsity by pruning KV at the channel level, while dynamicallyrestoring the pruned entries during attention score computation. Notably, ourapproach is orthogonal to existing KV compression and quantization techniques,making it compatible for integration with them to achieve further acceleration.By reducing channel-level redundancy, SPARK enables processing of longersequences within the same memory budget. For sequences of equal length, SPARKnot only preserves or improves model accuracy but also reduces KV cache storageby over 30% compared to eviction-based methods. Furthermore, even with anaggressive pruning ratio of 80%, SPARK maintains performance with lessdegradation than 5% compared to the baseline eviction method, demonstrating itsrobustness and effectiveness. Our code will be available athttps://github.com/Xnhyacinth/SparK.

 

Quick Read (beta)

loading the full paper ...