Abstract
We propose a novel hybrid Mamba-Transformer backbone, MambaVision,specifically tailored for vision applications. Our core contribution includesredesigning the Mamba formulation to enhance its capability for efficientmodeling of visual features. Through a comprehensive ablation study, wedemonstrate the feasibility of integrating Vision Transformers (ViT) withMamba. Our results show that equipping the Mamba architecture withself-attention blocks in the final layers greatly improves its capacity tocapture long-range spatial dependencies. Based on these findings, we introducea family of MambaVision models with a hierarchical architecture to meet variousdesign criteria. For classification on the ImageNet-1K dataset, MambaVisionvariants achieve state-of-the-art (SOTA) performance in terms of both Top-1accuracy and throughput. In downstream tasks such as object detection, instancesegmentation, and semantic segmentation on MS COCO and ADE20K datasets,MambaVision outperforms comparably sized backbones while demonstratingfavorable performance. Code: https://github.com/NVlabs/MambaVision