Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Abstract

Efficiently modeling sequences with infinite context length has been along-standing problem. Past works suffer from either the quadratic computationcomplexity or the limited extrapolation ability on length generalization. Inthis work, we present Samba, a simple hybrid architecture that layer-wisecombines Mamba, a selective State Space Model (SSM), with Sliding WindowAttention (SWA). Samba selectively compresses a given sequence into recurrenthidden states while still maintaining the ability to precisely recall memorieswith the attention mechanism. We scale Samba up to 3.8B parameters with 3.2Ttraining tokens and show that Samba substantially outperforms thestate-of-the-art models based on pure attention or SSMs on a wide range ofbenchmarks. When trained on 4K length sequences, Samba can be efficientlyextrapolated to 256K context length with perfect memory recall and showimproved token predictions up to 1M context length. As a linear-time sequencemodel, Samba enjoys a 3.73x higher throughput compared to Transformers withgrouped-query attention when processing user prompts of 128K length, and 3.64xspeedup when generating 64K tokens with unlimited streaming. A sampleimplementation of Samba is publicly available inhttps://github.com/microsoft/Samba.

Quick Read (beta)

loading the full paper ...