OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Abstract

In this report, we present OpenUni, a simple, lightweight, and fullyopen-source baseline for unifying multimodal understanding and generation.Inspired by prevailing practices in unified model learning, we adopt anefficient training strategy that minimizes the training complexity and overheadby bridging the off-the-shelf multimodal large language models (LLMs) anddiffusion models through a set of learnable queries and a light-weighttransformer-based connector. With a minimalist choice of architecture, wedemonstrate that OpenUni can: 1) generate high-quality and instruction-alignedimages, and 2) achieve exceptional performance on standard benchmarks such asGenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. Tosupport open research and community advancement, we release all model weights,training code, and our curated training datasets (including 23M image-textpairs) at https://github.com/wusize/OpenUni.

Quick Read (beta)

loading the full paper ...