Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Abstract

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning andparallel decoding but suffer from prohibitive quadratic computationalcomplexity and memory overhead during inference. Current caching techniquesaccelerate decoding by storing full-layer states, yet impose substantial memoryusage that limit long-context applications. Our analysis of attention patternsin dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remainingsalient across decoding steps and low-relevance tokens staying unimportant,motivating selective cache eviction. We propose Sparse-dLLM, the firsttraining-free framework integrating dynamic cache eviction with sparseattention via delayed bidirectional sparse caching. By leveraging the stabilityof token saliency over steps, it retains critical tokens and dynamically evictsunimportant prefix/suffix entries using an attention-guided strategy. Extensiveexperiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to10$\times$ higher throughput than vanilla dLLMs, with comparable performanceand similar peak memory costs, outperforming previous methods in efficiency andeffectiveness.

Quick Read (beta)

loading the full paper ...