Abstract
Autoregressive (AR) models, long dominant in language generation, areincreasingly applied to image synthesis but are often considered lesscompetitive than Diffusion-based models. A primary limitation is thesubstantial number of image tokens required for AR models, which constrainsboth training and inference efficiency, as well as image resolution. To addressthis, we present Token-Shuffle, a novel yet simple method that reduces thenumber of image tokens in Transformer. Our key insight is the dimensionalredundancy of visual vocabularies in Multimodal Large Language Models (MLLMs),where low-dimensional visual codes from visual encoder are directly mapped tohigh-dimensional language vocabularies. Leveraging this, we consider two keyoperations: token-shuffle, which merges spatially local tokens along channeldimension to decrease the input token number, and token-unshuffle, whichuntangles the inferred tokens after Transformer blocks to restore the spatialarrangement for output. Jointly training with textual prompts, our strategyrequires no additional pretrained text-encoder and enables MLLMs to supportextremely high-resolution image synthesis in a unified next-token predictionway while maintaining efficient training and inference. For the first time, wepush the boundary of AR text-to-image generation to a resolution of 2048x2048with gratifying generation performance. In GenAI-benchmark, our 2.7B modelachieves 0.77 overall score on hard prompts, outperforming AR models LlamaGenby 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale humanevaluations also demonstrate our prominent image generation ability in terms oftext-alignment, visual flaw, and visual appearance. We hope that Token-Shufflecan serve as a foundational design for efficient high-resolution imagegeneration within MLLMs.