ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Abstract

Large language models (LLMs) have shown impressive performance on languagetasks but face challenges when deployed on resource-constrained devices due totheir extensive parameters and reliance on dense multiplications, resulting inhigh memory demands and latency bottlenecks. Shift-and-add reparameterizationoffers a promising solution by replacing costly multiplications withhardware-friendly primitives in both the attention and multi-layer perceptron(MLP) layers of an LLM. However, current reparameterization techniques requiretraining from scratch or full parameter fine-tuning to restore accuracy, whichis resource-intensive for LLMs. To address this, we propose acceleratingpretrained LLMs through post-training shift-and-add reparameterization,creating efficient multiplication-free models, dubbed ShiftAddLLM.Specifically, we quantize each weight matrix into binary matrices paired withgroup-wise scaling factors. The associated multiplications are reparameterizedinto (1) shifts between activations and scaling factors and (2) queries andadds according to the binary matrices. To reduce accuracy loss, we present amulti-objective optimization method to minimize both weight and outputactivation reparameterization errors. Additionally, based on varyingsensitivity across layers to reparameterization, we develop an automated bitallocation strategy to further reduce memory usage and latency. Experiments onfive LLM families and eight tasks consistently validate the effectiveness ofShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 pointsat comparable or lower latency compared to the most competitive quantized LLMsat 3 and 2 bits, respectively, and more than 80% memory and energy reductionsover the original LLMs. Codes and models are available athttps://github.com/GATECH-EIC/ShiftAddLLM.

Quick Read (beta)

loading the full paper ...