Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Abstract

Diffusion language models offer unique benefits over autoregressive modelsdue to their potential for parallelized generation and controllability, yetthey lag in likelihood modeling and are limited to fixed-length generation. Inthis work, we introduce a class of block diffusion language models thatinterpolate between discrete denoising diffusion and autoregressive models.Block diffusion overcomes key limitations of both approaches by supportingflexible-length generation and improving inference efficiency with KV cachingand parallel token sampling. We propose a recipe for building effective blockdiffusion models that includes an efficient training algorithm, estimators ofgradient variance, and data-driven noise schedules to minimize the variance.Block diffusion sets a new state-of-the-art performance among diffusion modelson language modeling benchmarks and enables generation of arbitrary-lengthsequences. We provide the code, along with the model weights and blog post onthe project page: https://m-arriola.com/bd3lms/

Quick Read (beta)

loading the full paper ...