Abstract
Unified multimodal models aim to integrate understanding (text output) andgeneration (pixel output), but aligning these different modalities within asingle architecture often demands complex training recipes and careful databalancing. We introduce MetaQueries, a set of learnable queries that act as anefficient interface between autoregressive multimodal LLMs (MLLMs) anddiffusion models. MetaQueries connects the MLLM's latents to the diffusiondecoder, enabling knowledge-augmented image generation by leveraging the MLLM'sdeep understanding and reasoning capabilities. Our method simplifies training,requiring only paired image-caption data and standard diffusion objectives.Notably, this transfer is effective even when the MLLM backbone remains frozen,thereby preserving its state-of-the-art multimodal understanding capabilitieswhile achieving strong generative performance. Additionally, our method isflexible and can be easily instruction-tuned for advanced applications such asimage editing and subject-driven generation.