Abstract
Autoregressive (AR) models have long dominated the landscape of largelanguage models, driving progress across a wide range of tasks. Recently,diffusion-based language models have emerged as a promising alternative, thoughtheir advantages over AR models remain underexplored. In this paper, wesystematically study masked diffusion models in data-constrained settings-wheretraining involves repeated passes over limited data and find that theysignificantly outperform AR models when compute is abundant but data is scarce.Diffusion models make better use of repeated data, achieving lower validationloss and superior downstream performance. We find new scaling laws fordiffusion models and derive a closed-form expression for the critical computethreshold at which diffusion begins to outperform AR. Finally, we explain whydiffusion models excel in this regime: their randomized masking objectiveimplicitly trains over a rich distribution of token orderings, acting as animplicit data augmentation that AR's fixed left-to-right factorization lacks.Our results suggest that when data, not compute, is the bottleneck, diffusionmodels offer a compelling alternative to the standard AR paradigm. Our code isavailable at: https://diffusion-scaling.github.io.