Diffusion Beats Autoregressive in Data-Constrained Settings

Abstract

Autoregressive (AR) models have long dominated the landscape of largelanguage models, driving progress across a wide range of tasks. Recently,diffusion-based language models have emerged as a promising alternative, thoughtheir advantages over AR models remain underexplored. In this paper, wesystematically study masked diffusion models in data-constrained settings-wheretraining involves repeated passes over limited data-and find that theysignificantly outperform AR models when compute is abundant but data is scarce.Diffusion models make better use of repeated data, achieving lower validationloss and superior downstream performance. We interpret this advantage asimplicit data augmentation: masked diffusion exposes the model to a diversedistribution of token orderings and prediction tasks, unlike AR's fixedleft-to-right factorization. We find new scaling laws for diffusion models andderive a closed-form expression for the critical compute threshold at whichdiffusion begins to outperform AR. These results suggest that when data, notcompute, is the bottleneck, diffusion models offer a compelling alternative tothe standard AR paradigm. Our code is available at:https://diffusion-scaling.github.io.

Quick Read (beta)

loading the full paper ...