Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Abstract

While Transformers have been the main architecture behind deep learning'ssuccess in language modeling, state-space models (SSMs) such as Mamba haverecently been shown to match or outperform Transformers at small to mediumscale. We show that these families of models are actually quite closelyrelated, and develop a rich framework of theoretical connections between SSMsand variants of attention, connected through various decompositions of awell-studied class of structured semiseparable matrices. Our state spaceduality (SSD) framework allows us to design a new architecture (Mamba-2) whosecore layer is an a refinement of Mamba's selective SSM that is 2-8X faster,while continuing to be competitive with Transformers on language modeling.

Quick Read (beta)

loading the full paper ...