Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Abstract

Autoregressive (AR) modeling, known for its next-token prediction paradigm,underpins state-of-the-art language and visual generative models.Traditionally, a ``token'' is treated as the smallest prediction unit, often adiscrete symbol in language or a quantized patch in vision. However, theoptimal token definition for 2D image structures remains an open question.Moreover, AR models suffer from exposure bias, where teacher forcing duringtraining leads to error accumulation at inference. In this paper, we proposexAR, a generalized AR framework that extends the notion of a token to an entityX, which can represent an individual patch token, a cell (a $k\times k$grouping of neighboring patches), a subsample (a non-local grouping of distantpatches), a scale (coarse-to-fine resolution), or even a whole image.Additionally, we reformulate discrete token classification as\textbf{continuous entity regression}, leveraging flow-matching methods at eachAR step. This approach conditions training on noisy entities instead of groundtruth tokens, leading to Noisy Context Learning, which effectively alleviatesexposure bias. As a result, xAR offers two key advantages: (1) it enablesflexible prediction units that capture different contextual granularity andspatial structures, and (2) it mitigates exposure bias by avoiding reliance onteacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B(172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ fasterinference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24,running 2.2$\times$ faster than the previous best-performing model withoutrelying on vision foundation modules (\eg, DINOv2) or advanced guidanceinterval sampling.

Quick Read (beta)

loading the full paper ...