Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Abstract

Diffusion-based large language models (Diffusion LLMs) have shown promise fornon-autoregressive text generation with parallel decoding capabilities.However, the practical inference speed of open-sourced Diffusion LLMs oftenlags behind autoregressive models due to the lack of Key-Value (KV) Cache andquality degradation when decoding multiple tokens simultaneously. To bridgethis gap, we introduce a novel block-wise approximate KV Cache mechanismtailored for bidirectional diffusion models, enabling cache reuse withnegligible performance drop. Additionally, we identify the root cause ofgeneration quality degradation in parallel decoding as the disruption of tokendependencies under the conditional independence assumption. To address this, wepropose a confidence-aware parallel decoding strategy that selectively decodestokens exceeding a confidence threshold, mitigating dependency violations andmaintaining generation quality. Experimental results on LLaDA and Dream modelsacross multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$throughput} improvement with minimal accuracy loss, closing the performance gapwith autoregressive models and paving the way for practical deployment ofDiffusion LLMs.

Quick Read (beta)

loading the full paper ...