MambaVision: A Hybrid Mamba-Transformer Vision Backbone

  • 2025-03-25 18:54:37
  • Ali Hatamizadeh, Jan Kautz
  • 0

Abstract

We propose a novel hybrid Mamba-Transformer backbone, MambaVision,specifically tailored for vision applications. Our core contribution includesredesigning the Mamba formulation to enhance its capability for efficientmodeling of visual features. Through a comprehensive ablation study, wedemonstrate the feasibility of integrating Vision Transformers (ViT) withMamba. Our results show that equipping the Mamba architecture withself-attention blocks in the final layers greatly improves its capacity tocapture long-range spatial dependencies. Based on these findings, we introducea family of MambaVision models with a hierarchical architecture to meet variousdesign criteria. For classification on the ImageNet-1K dataset, MambaVisionvariants achieve state-of-the-art (SOTA) performance in terms of both Top-1accuracy and throughput. In downstream tasks such as object detection, instancesegmentation, and semantic segmentation on MS COCO and ADE20K datasets,MambaVision outperforms comparably sized backbones while demonstratingfavorable performance. Code: https://github.com/NVlabs/MambaVision

 

Quick Read (beta)

loading the full paper ...