Abstract
Diffusion language models, especially masked discrete diffusion models, haveachieved great success recently. While there are some theoretical and primaryempirical results showing the advantages of latent reasoning with loopedtransformers or continuous chain-of-thoughts, continuous diffusion modelstypically underperform their discrete counterparts. In this paper, we arguethat diffusion language models do not necessarily need to be in the discretespace. In particular, we prove that continuous diffusion models have strongerexpressivity than discrete diffusions and looped transformers. We attribute thecontradiction between the theoretical expressiveness and empirical performanceto their practical trainability: while continuous diffusion providesintermediate supervision that looped transformers lack, they introduceadditional difficulty decoding tokens into the discrete token space from thecontinuous representation space. We therefore propose Coevolutionary ContinuousDiscrete Diffusion (CCDD), which defines a joint multimodal diffusion processon the union of a continuous representation space and a discrete token space,leveraging a single model to simultaneously denoise in the joint space. Bycombining two modalities, CCDD is expressive with rich semantics in the latentspace, as well as good trainability and sample quality with the help ofexplicit discrete tokens. We also propose effective architectures and advancedtraining/sampling techniques for CCDD, which reveals strong empiricalperformance in extensive language modeling experiments on real-world tasks.