Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

  • 2025-03-12 18:43:40
  • Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov
  • 0

Abstract

Diffusion language models offer unique benefits over autoregressive modelsdue to their potential for parallelized generation and controllability, yetthey lag in likelihood modeling and are limited to fixed-length generation. Inthis work, we introduce a class of block diffusion language models thatinterpolate between discrete denoising diffusion and autoregressive models.Block diffusion overcomes key limitations of both approaches by supportingflexible-length generation and improving inference efficiency with KV cachingand parallel token sampling. We propose a recipe for building effective blockdiffusion models that includes an efficient training algorithm, estimators ofgradient variance, and data-driven noise schedules to minimize the variance.Block diffusion sets a new state-of-the-art performance among diffusion modelson language modeling benchmarks and enables generation of arbitrary-lengthsequences. We provide the code, along with the model weights and blog post onthe project page: https://m-arriola.com/bd3lms/

 

Quick Read (beta)

loading the full paper ...