Abstract
Diffusion models have gained significant attention in the realm of imagegeneration due to their exceptional performance. Their success has beenrecently expanded to text generation via generating all tokens within asequence concurrently. However, natural language exhibits a far more pronouncedsequential dependency in comparison to images, and the majority of existinglanguage models are trained utilizing a left-to-right auto-regressive approach.To account for the inherent sequential characteristic of natural language, weintroduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures thatthe generation of tokens on the right depends on the generated ones on theleft, a mechanism achieved through employing a dynamic number of denoisingsteps that vary based on token position. This results in tokens on the leftundergoing fewer denoising steps than those on the right, thereby enabling themto generate earlier and subsequently influence the generation of tokens on theright. In a series of experiments on various text generation tasks includingtext summarization, machine translation, and common sense generation,AR-Diffusion clearly demonstrated the superiority over existing diffusionlanguage models and that it can be $100\times\sim600\times$ faster whenachieving comparable results. Our code will be publicly released.