MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

  • 2025-09-19 17:58:00
  • Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen
  • 0

Abstract

Unified multimodal Large Language Models (LLMs) that can both understand andgenerate visual content hold immense potential. However, existing open-sourcemodels often suffer from a performance trade-off between these capabilities. Wepresent Manzano, a simple and scalable unified framework that substantiallyreduces this tension by coupling a hybrid image tokenizer with a well-curatedtraining recipe. A single shared vision encoder feeds two lightweight adaptersthat produce continuous embeddings for image-to-text understanding and discretetokens for text-to-image generation within a common semantic space. A unifiedautoregressive LLM predicts high-level semantics in the form of text and imagetokens, with an auxiliary diffusion decoder subsequently translating the imagetokens into pixels. The architecture, together with a unified training recipeover understanding and generation data, enables scalable joint learning of bothcapabilities. Manzano achieves state-of-the-art results among unified models,and is competitive with specialist models, particularly on text-richevaluation. Our studies show minimal task conflicts and consistent gains fromscaling model size, validating our design choice of a hybrid tokenizer.

 

Quick Read (beta)

loading the full paper ...