Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Abstract

Recent Multimodal Large Language Models (MLLMs) have achieved remarkableperformance but face deployment challenges due to their quadratic computationalcomplexity, growing Key-Value cache requirements, and reliance on separatevision encoders. We propose mmMamba, a framework for developinglinear-complexity native multimodal state space models through progressivedistillation from existing MLLMs using moderate academic computationalresources. Our approach enables the direct conversion of trained decoder-onlyMLLMs to linear-complexity architectures without requiring pre-trainedRNN-based LLM or vision encoders. We propose an seeding strategy to carve Mambafrom trained Transformer and a three-stage distillation recipe, which caneffectively transfer the knowledge from Transformer to Mamba while preservingmultimodal capabilities. Our method also supports flexible hybrid architecturesthat combine Transformer and Mamba layers for customizableefficiency-performance trade-offs. Distilled from the Transformer-baseddecoder-only HoVLE, mmMamba-linear achieves competitive performance againstexisting linear and quadratic-complexity VLMs, while mmMamba-hybrid furtherimproves performance significantly, approaching HoVLE's capabilities. At 103Ktokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memoryreduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedupand 60.2% memory savings. Code and models are released athttps://github.com/hustvl/mmMamba

Quick Read (beta)

loading the full paper ...