Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

  • 2025-04-24 18:59:56
  • Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu
  • 0

Abstract

Autoregressive (AR) models, long dominant in language generation, areincreasingly applied to image synthesis but are often considered lesscompetitive than Diffusion-based models. A primary limitation is thesubstantial number of image tokens required for AR models, which constrainsboth training and inference efficiency, as well as image resolution. To addressthis, we present Token-Shuffle, a novel yet simple method that reduces thenumber of image tokens in Transformer. Our key insight is the dimensionalredundancy of visual vocabularies in Multimodal Large Language Models (MLLMs),where low-dimensional visual codes from visual encoder are directly mapped tohigh-dimensional language vocabularies. Leveraging this, we consider two keyoperations: token-shuffle, which merges spatially local tokens along channeldimension to decrease the input token number, and token-unshuffle, whichuntangles the inferred tokens after Transformer blocks to restore the spatialarrangement for output. Jointly training with textual prompts, our strategyrequires no additional pretrained text-encoder and enables MLLMs to supportextremely high-resolution image synthesis in a unified next-token predictionway while maintaining efficient training and inference. For the first time, wepush the boundary of AR text-to-image generation to a resolution of 2048x2048with gratifying generation performance. In GenAI-benchmark, our 2.7B modelachieves 0.77 overall score on hard prompts, outperforming AR models LlamaGenby 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale humanevaluations also demonstrate our prominent image generation ability in terms oftext-alignment, visual flaw, and visual appearance. We hope that Token-Shufflecan serve as a foundational design for efficient high-resolution imagegeneration within MLLMs.

 

Quick Read (beta)

loading the full paper ...