MambaOut: Do We Really Need Mamba for Vision?

Abstract

Mamba, an architecture with RNN-like token mixer of state space model (SSM),was recently introduced to address the quadratic complexity of the attentionmechanism and subsequently applied to vision tasks. Nevertheless, theperformance of Mamba for vision is often underwhelming when compared withconvolutional and attention-based models. In this paper, we delve into theessence of Mamba, and conceptually conclude that Mamba is ideally suited fortasks with long-sequence and autoregressive characteristics. For vision tasks,as image classification does not align with either characteristic, wehypothesize that Mamba is not necessary for this task; Detection andsegmentation tasks are also not autoregressive, yet they adhere to thelong-sequence characteristic, so we believe it is still worthwhile to exploreMamba's potential for these tasks. To empirically verify our hypotheses, weconstruct a series of models named MambaOut through stacking Mamba blocks whileremoving their core token mixer, SSM. Experimental results strongly support ourhypotheses. Specifically, our MambaOut model surpasses all visual Mamba modelson ImageNet image classification, indicating that Mamba is indeed unnecessaryfor this task. As for detection and segmentation, MambaOut cannot match theperformance of state-of-the-art visual Mamba models, demonstrating thepotential of Mamba for long-sequence visual tasks. The code is available athttps://github.com/yuweihao/MambaOut

Quick Read (beta)

loading the full paper ...