Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

  • 2025-08-04 16:14:03
  • Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
  • 0

Abstract

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning andparallel decoding but suffer from prohibitive quadratic computationalcomplexity and memory overhead during inference. Current caching techniquesaccelerate decoding by storing full-layer states, yet impose substantial memoryusage that limit long-context applications. Our analysis of attention patternsin dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remainingsalient across decoding steps and low-relevance tokens staying unimportant,motivating selective cache eviction. We propose Sparse-dLLM, the firsttraining-free framework integrating dynamic cache eviction with sparseattention via delayed bidirectional sparse caching. By leveraging the stabilityof token saliency over steps, it retains critical tokens and dynamically evictsunimportant prefix/suffix entries using an attention-guided strategy. Extensiveexperiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to10$\times$ higher throughput than vanilla dLLMs, with comparable performanceand similar peak memory costs, outperforming previous methods in efficiency andeffectiveness.

 

Quick Read (beta)

loading the full paper ...