Abstract
Diffusion language models have emerged as a promising approach for textgeneration. One would naturally expect this method to be an efficientreplacement for autoregressive models since multiple tokens can be sampled inparallel during each diffusion step. However, its efficiency-accuracy trade-offis not yet well understood. In this paper, we present a rigorous theoreticalanalysis of a widely used type of diffusion language model, the MaskedDiffusion Model (MDM), and find that its effectiveness heavily depends on thetarget evaluation metric. Under mild conditions, we prove that when usingperplexity as the metric, MDMs can achieve near-optimal perplexity in samplingsteps regardless of sequence length, demonstrating that efficiency can beachieved without sacrificing performance. However, when using the sequenceerror rate--which is important for understanding the "correctness" of asequence, such as a reasoning chain--we show that the required sampling stepsmust scale linearly with sequence length to obtain "correct" sequences, therebyeliminating MDM's efficiency advantage over autoregressive models. Our analysisestablishes the first theoretical foundation for understanding the benefits andlimitations of MDMs. All theoretical findings are supported by empiricalstudies.