Abstract
Long-range sequence processing poses a significant challenge for Transformersdue to their quadratic complexity in input length. A promising alternative isMamba, which demonstrates high performance and achieves Transformer-levelcapabilities while requiring substantially fewer computational resources. Inthis paper we explore the length-generalization capabilities of Mamba, which wefind to be relatively limited. Through a series of visualizations and analyseswe identify that the limitations arise from a restricted effective receptivefield, dictated by the sequence length used during training. To address thisconstraint, we introduce DeciMamba, a context-extension method specificallydesigned for Mamba. This mechanism, built on top of a hidden filteringmechanism embedded within the S6 layer, enables the trained model toextrapolate well even without additional training. Empirical experiments overreal-world long-range NLP tasks show that DeciMamba can extrapolate to contextlengths that are significantly longer than the ones seen during training, whileenjoying faster inference.