Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Abstract

Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequencelength, making them highly attractive for processing visual data. However, thetraining methodologies and their potential are still not sufficiently explored.In this paper, we investigate strategies for Vim and propose StochasticLayer-Wise Shuffle (SLWS), a novel regularization method that can effectivelyimprove the Vim training. Without architectural modifications, this approachenables the non-hierarchical Vim to get leading performance on ImageNet-1Kcompared with the similar type counterparts. Our method operates through foursimple steps per layer: probability allocation to assign layer-dependentshuffle rates, operation sampling via Bernoulli trials, sequence shuffling ofinput tokens, and order restoration of outputs. SLWS distinguishes itselfthrough three principles: \textit{(1) Plug-and-play:} No architecturalmodifications are needed, and it is deactivated during inference. \textit{(2)Simple but effective:} The four-step process introduces only randompermutations and negligible overhead. \textit{(3) Intuitive design:} Shufflingprobabilities grow linearly with layer depth, aligning with the hierarchicalsemantic abstraction in vision models. Our work underscores the importance oftailored training strategies for Vim models and provides a helpful way toexplore their scalability.

Quick Read (beta)

loading the full paper ...