Abstract
We present Lumina-mGPT, a family of multimodal autoregressive models capableof various vision and language tasks, particularly excelling in generatingflexible photorealistic images from text descriptions. Unlike existingautoregressive image generation approaches, Lumina-mGPT employs a pretraineddecoder-only transformer as a unified framework for modeling multimodal tokensequences. Our key insight is that a simple decoder-only transformer withmultimodal Generative PreTraining (mGPT), utilizing the next-token predictionobjective on massive interleaved text-image sequences, can learn broad andgeneral multimodal capabilities, thereby illuminating photorealistictext-to-image generation. Building on these pretrained models, we proposeFlexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-textpairs to fully unlock their potential for high-aesthetic image synthesis at anyresolution while maintaining their general multimodal capabilities.Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT),transforming Lumina-mGPT into a foundation model that seamlessly achievesomnipotent task unification. The resulting model demonstrates versatilemultimodal capabilities, including visual generation tasks like flexibletext-to-image generation and controllable generation, visual recognition taskslike segmentation and depth estimation, and vision-language tasks likemultiturn visual question answering. Additionally, we analyze the differencesand similarities between diffusion-based and autoregressive methods in a directcomparison.