xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Abstract

Recent breakthroughs in solving reasoning, math and coding problems withLarge Language Models (LLMs) have been enabled by investing substantialcomputation budgets at inference time. Therefore, inference speed is one of themost critical properties of LLM architectures, and there is a growing need forLLMs that are efficient and fast at inference. Recently, LLMs built on thexLSTM architecture have emerged as a powerful alternative to Transformers,offering linear compute scaling with sequence length and constant memory usage,both highly desirable properties for efficient inference. However, suchxLSTM-based LLMs have yet to be scaled to larger models and assessed andcompared with respect to inference speed and efficiency. In this work, weintroduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM'sarchitectural benefits with targeted optimizations for fast and efficientinference. Our experiments demonstrate that xLSTM 7B achieves performance ondownstream tasks comparable to other similar-sized LLMs, while providingsignificantly faster inference speeds and greater efficiency compared to Llama-and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and mostefficient 7B LLM, offering a solution for tasks that require large amounts oftest-time computation. Our work highlights xLSTM's potential as a foundationalarchitecture for methods building on heavy use of LLM inference. Our modelweights, model code and training code are open-source.

Quick Read (beta)

loading the full paper ...