Abstract
Recent progress in unified models for image understanding and generation hasbeen impressive, yet most approaches remain limited to single-modal generationconditioned on multiple modalities. In this paper, we present Mogao, a unifiedframework that advances this paradigm by enabling interleaved multi-modalgeneration through a causal approach. Mogao integrates a set of key technicalimprovements in architecture design, including a deep-fusion design, dualvision encoders, interleaved rotary position embeddings, and multi-modalclassifier-free guidance, which allow it to harness the strengths of bothautoregressive models for text generation and diffusion models for high-qualityimage synthesis. These practical improvements also make Mogao particularlyeffective to process interleaved sequences of text and images arbitrarily. Tofurther unlock the potential of unified models, we introduce an efficienttraining strategy on a large-scale, in-house dataset specifically curated forjoint text and image generation. Extensive experiments show that Mogao not onlyachieves state-of-the-art performance in multi-modal understanding andtext-to-image generation, but also excels in producing high-quality, coherentinterleaved outputs. Its emergent capabilities in zero-shot image editing andcompositional generation highlight Mogao as a practical omni-modal foundationmodel, paving the way for future development and scaling the unifiedmulti-modal systems.