Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Abstract

Leveraging attention sparsity to accelerate long-context large languagemodels (LLMs) has been a hot research topic. However, current algorithms suchas sparse attention or key-value (KV) cache compression tend to use a fixedbudget, which presents a significant challenge during deployment because itfails to account for the dynamic nature of real-world scenarios, where theoptimal balance between accuracy and efficiency can vary greatly. In thispaper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparseattention can surprisingly achieve adaptive budgeting. Based on this, wepropose Twilight, a framework to bring adaptive sparsity to any existing sparseattention algorithm without sacrificing their accuracy. Empirical results showthat Twilight can adaptively prune at most 98% of redundant tokens, leading to$15.4\times$ acceleration in self-attention operations and $3.9\times$acceleration in end-to-end per token latency in long context LLM decoding.

Quick Read (beta)

loading the full paper ...