Abstract
Long-Context Transformer Models (LCTMs) are vital for real-world applicationsbut suffer high computational costs due to attention's quadratic complexity.Block-sparse attention mitigates this by focusing computation on criticalregions, yet existing methods struggle with balancing accuracy and efficiencydue to costly block importance measurements. In this paper, we introduceXAttention, a plug-and-play framework that dramatically accelerateslong-context inference in Transformers models using sparse attention.XAttention's key innovation is the insight that the sum of antidiagonal values(i.e., from the lower-left to upper-right) in the attention matrix provides apowerful proxy for block importance. This allows for precise identification andpruning of non-essential blocks, resulting in high sparsity and dramaticallyaccelerated inference. Across comprehensive evaluations on demandinglong-context benchmarks-including RULER and LongBench for language, VideoMMEfor video understanding, and VBench for video generation. XAttention achievesaccuracy comparable to full attention while delivering substantialcomputational gains. We demonstrate up to 13.5x acceleration in attentioncomputation. These results underscore XAttention's ability to unlock thepractical potential of block sparse attention, paving the way for scalable andefficient deployment of LCTMs in real-world applications. Code is available athttps://github.com/mit-han-lab/x-attention.