Abstract
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning andparallel decoding but suffer from prohibitive quadratic computationalcomplexity and memory overhead during inference. Current caching techniquesaccelerate decoding by storing full-layer states, yet impose substantial memoryusage that limit long-context applications. Our analysis of attention patternsin dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remainingsalient across decoding steps and low-relevance tokens staying unimportant,motivating selective cache eviction. We propose Sparse-dLLM, the firsttraining-free framework integrating dynamic cache eviction with sparseattention via delayed bidirectional sparse caching. By leveraging the stabilityof token saliency over steps, it retains critical tokens and dynamically evictsunimportant prefix/suffix entries using an attention-guided strategy. Extensiveexperiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to10$\times$ higher throughput than vanilla dLLMs, with comparable performanceand similar peak memory costs, outperforming previous methods in efficiency andeffectiveness.