Abstract
Unifying image understanding and generation has gained growing attention inrecent research on multimodal models. Although design choices for imageunderstanding have been extensively studied, the optimal model architecture andtraining recipe for a unified framework with image generation remainunderexplored. Motivated by the strong potential of autoregressive anddiffusion models for high-quality generation and scalability, we conduct acomprehensive study of their use in unified multimodal settings, with emphasison image representations, modeling objectives, and training strategies.Grounded in these investigations, we introduce a novel approach that employs adiffusion transformer to generate semantically rich CLIP image features, incontrast to conventional VAE-based representations. This design yields bothhigher training efficiency and improved generative quality. Furthermore, wedemonstrate that a sequential pretraining strategy for unified models-firsttraining on image understanding and subsequently on image generation-offerspractical advantages by preserving image understanding capability whiledeveloping strong image generation ability. Finally, we carefully curate ahigh-quality instruction-tuning dataset BLIP3o-60k for image generation byprompting GPT-4o with a diverse set of captions covering various scenes,objects, human gestures, and more. Building on our innovative model design,training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-artunified multimodal models. BLIP3-o achieves superior performance across most ofthe popular benchmarks spanning both image understanding and generation tasks.To facilitate future research, we fully open-source our models, including code,model weights, training scripts, and pretraining and instruction tuningdatasets.