When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Abstract

Autoregressive Large Language Models (LLMs) have achieved impressiveperformance in language tasks but face two significant bottlenecks: (1)quadratic complexity in the attention module as the number of tokens increases,and (2) limited efficiency due to the sequential processing nature ofautoregressive LLMs during generation. While linear attention and speculativedecoding offer potential solutions, their applicability and synergisticpotential for enhancing autoregressive LLMs remain uncertain. We conduct thefirst comprehensive study on the efficacy of existing linear attention methodsfor autoregressive LLMs, integrating them with speculative decoding. Weintroduce an augmentation technique for linear attention that ensurescompatibility with speculative decoding, enabling more efficient training andserving of LLMs. Extensive experiments and ablation studies involving sevenexisting linear attention models and five encoder/decoder-based LLMsconsistently validate the effectiveness of our augmented linearized LLMs.Notably, our approach achieves up to a 6.67 reduction in perplexity on theLLaMA model and up to a 2$\times$ speedup during generation compared to priorlinear attention methods. Codes and models are available athttps://github.com/GATECH-EIC/Linearized-LLM.

Quick Read (beta)

loading the full paper ...