Abstract
In the 1990s, the constant error carousel and gating were introduced as thecentral ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs havestood the test of time and contributed to numerous deep learning successstories, in particular they constituted the first Large Language Models (LLMs).However, the advent of the Transformer technology with parallelizableself-attention at its core marked the dawn of a new era, outpacing LSTMs atscale. We now raise a simple question: How far do we get in language modelingwhen scaling LSTMs to billions of parameters, leveraging the latest techniquesfrom modern LLMs, but mitigating known limitations of LSTMs? Firstly, weintroduce exponential gating with appropriate normalization and stabilizationtechniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTMwith a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM thatis fully parallelizable with a matrix memory and a covariance update rule.Integrating these LSTM extensions into residual block backbones yields xLSTMblocks that are then residually stacked into xLSTM architectures. Exponentialgating and modified memory structures boost xLSTM capabilities to performfavorably when compared to state-of-the-art Transformers and State SpaceModels, both in performance and scaling.