Abstract
Mamba, an architecture with RNN-like token mixer of state space model (SSM),was recently introduced to address the quadratic complexity of the attentionmechanism and subsequently applied to vision tasks. Nevertheless, theperformance of Mamba for vision is often underwhelming when compared withconvolutional and attention-based models. In this paper, we delve into theessence of Mamba, and conceptually conclude that Mamba is ideally suited fortasks with long-sequence and autoregressive characteristics. For vision tasks,as image classification does not align with either characteristic, wehypothesize that Mamba is not necessary for this task; Detection andsegmentation tasks are also not autoregressive, yet they adhere to thelong-sequence characteristic, so we believe it is still worthwhile to exploreMamba's potential for these tasks. To empirically verify our hypotheses, weconstruct a series of models named \emph{MambaOut} through stacking Mambablocks while removing their core token mixer, SSM. Experimental resultsstrongly support our hypotheses. Specifically, our MambaOut model surpasses allvisual Mamba models on ImageNet image classification, indicating that Mamba isindeed unnecessary for this task. As for detection and segmentation, MambaOutcannot match the performance of state-of-the-art visual Mamba models,demonstrating the potential of Mamba for long-sequence visual tasks. The codeis available at https://github.com/yuweihao/MambaOut