Abstract
Recent large language models (LLMs) have demonstrated strong reasoningcapabilities that benefits from online reinforcement learning (RL). Thesecapabilities have primarily been demonstrated within the left-to-rightautoregressive (AR) generation paradigm. In contrast, non-autoregressiveparadigms based on diffusion generate text in a coarse-to-fine manner. Althoughrecent diffusion-based large language models (dLLMs) have achieved competitivelanguage modeling performance compared to their AR counterparts, it remainsunclear if dLLMs can also leverage recent advances in LLM reasoning. To thisend, we propose d1, a framework to adapt pre-trained masked dLLMs intoreasoning models via a combination of supervised finetuning (SFT) and RL.Specifically, we develop and extend techniques to improve reasoning inpretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledgeand instill self-improvement behavior directly from existing datasets, and (b)we introduce a novel critic-free, policy-gradient based RL algorithm calleddiffu-GRPO. Through empirical studies, we investigate the performance ofdifferent post-training recipes on multiple mathematical and logical reasoningbenchmarks. We find that d1 yields the best performance and significantlyimproves performance of a state-of-the-art dLLM.