Abstract
Recent Multimodal Large Language Models (MLLMs) have achieved remarkableperformance but face deployment challenges due to their quadratic computationalcomplexity, growing Key-Value cache requirements, and reliance on separatevision encoders. We propose mmMamba, a framework for developinglinear-complexity native multimodal state space models through progressivedistillation from existing MLLMs using moderate academic computationalresources. Our approach enables the direct conversion of trained decoder-onlyMLLMs to linear-complexity architectures without requiring pre-trainedRNN-based LLM or vision encoders. We propose an seeding strategy to carve Mambafrom trained Transformer and a three-stage distillation recipe, which caneffectively transfer the knowledge from Transformer to Mamba while preservingmultimodal capabilities. Our method also supports flexible hybrid architecturesthat combine Transformer and Mamba layers for customizableefficiency-performance trade-offs. Distilled from the Transformer-baseddecoder-only HoVLE, mmMamba-linear achieves competitive performance againstexisting linear and quadratic-complexity VLMs, while mmMamba-hybrid furtherimproves performance significantly, approaching HoVLE's capabilities. At 103Ktokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memoryreduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedupand 60.2% memory savings. Code and models are released athttps://github.com/hustvl/mmMamba