Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

  • 2025-11-04 12:04:06
  • Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao
  • 0

Abstract

Leveraging attention sparsity to accelerate long-context large languagemodels (LLMs) has been a hot research topic. However, current algorithms suchas sparse attention or key-value (KV) cache compression tend to use a fixedbudget, which presents a significant challenge during deployment because itfails to account for the dynamic nature of real-world scenarios, where theoptimal balance between accuracy and efficiency can vary greatly. In thispaper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparseattention can surprisingly achieve adaptive budgeting. Based on this, wepropose Twilight, a framework to bring adaptive sparsity to any existing sparseattention algorithm without sacrificing their accuracy. Empirical results showthat Twilight can adaptively prune at most 98% of redundant tokens, leading to$15.4\times$ acceleration in self-attention operations and $3.9\times$acceleration in end-to-end per token latency in long context LLM decoding.

 

Quick Read (beta)

loading the full paper ...