Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Abstract

In recent years, masked diffusion models (MDMs) have emerged as a promisingalternative approach for generative modeling over discrete domains. Compared toautoregressive models (ARMs), MDMs trade off complexity at training time withflexibility at inference time. At training time, they must learn to solve anexponentially large number of infilling problems, but at inference time, theycan decode tokens in essentially arbitrary order. In this work, we closelyexamine these two competing effects. On the training front, we theoreticallyand empirically demonstrate that MDMs indeed train on computationallyintractable subproblems compared to their autoregressive counterparts. On theinference front, we show that a suitable strategy for adaptively choosing thetoken decoding order significantly enhances the capabilities of MDMs, allowingthem to sidestep hard subproblems. On logic puzzles like Sudoku, we show thatadaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to$\approx 90$%, even outperforming ARMs with $7\times$ as many parameters andthat were explicitly trained via teacher forcing to learn the right order ofdecoding.

Quick Read (beta)

loading the full paper ...