OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Abstract

Recent advancements in unified multimodal understanding and visual generation(or multimodal generation) models have been hindered by their quadraticcomputational complexity and dependence on large-scale training data. Wepresent OmniMamba, the first linear-architecture-based multimodal generationmodel that generates both text and images through a unified next-tokenprediction paradigm. The model fully leverages Mamba-2's high computational andmemory efficiency, extending its capabilities from text generation tomultimodal generation. To address the data inefficiency of existing unifiedmodels, we propose two key innovations: (1) decoupled vocabularies to guidemodality-specific generation, and (2) task-specific LoRA forparameter-efficient adaptation. Furthermore, we introduce a decoupled two-stagetraining strategy to mitigate data imbalance between two tasks. Equipped withthese techniques, OmniMamba achieves competitive performance with JanusFlowwhile surpassing Show-o across benchmarks, despite being trained on merely 2Mimage-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMambastands out with outstanding inference efficiency, achieving up to a 119.2 timesspeedup and 63% GPU memory reduction for long-sequence generation compared toTransformer-based counterparts. Code and models are released athttps://github.com/hustvl/OmniMamba

Quick Read (beta)

loading the full paper ...