OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

  • 2025-03-11 18:59:46
  • Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
  • 0

Abstract

Recent advancements in unified multimodal understanding and visual generation(or multimodal generation) models have been hindered by their quadraticcomputational complexity and dependence on large-scale training data. Wepresent OmniMamba, the first linear-architecture-based multimodal generationmodel that generates both text and images through a unified next-tokenprediction paradigm. The model fully leverages Mamba-2's high computational andmemory efficiency, extending its capabilities from text generation tomultimodal generation. To address the data inefficiency of existing unifiedmodels, we propose two key innovations: (1) decoupled vocabularies to guidemodality-specific generation, and (2) task-specific LoRA forparameter-efficient adaptation. Furthermore, we introduce a decoupled two-stagetraining strategy to mitigate data imbalance between two tasks. Equipped withthese techniques, OmniMamba achieves competitive performance with JanusFlowwhile surpassing Show-o across benchmarks, despite being trained on merely 2Mimage-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMambastands out with outstanding inference efficiency, achieving up to a 119.2 timesspeedup and 63% GPU memory reduction for long-sequence generation compared toTransformer-based counterparts. Code and models are released athttps://github.com/hustvl/OmniMamba

 

Quick Read (beta)

loading the full paper ...