SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Abstract

Large language models (LLMs) show excellent performance but are compute- andmemory-intensive. Quantization can reduce memory and accelerate inference.However, for LLMs beyond 100 billion parameters, existing methods cannotmaintain accuracy or do not run efficiently on hardware. We proposeSmoothQuant, a training-free, accuracy-preserving, and general-purposepost-training quantization (PTQ) solution to enable 8-bit weight, 8-bitactivation (W8A8) quantization for LLMs that can be implemented efficiently. Weobserve that systematic outliers appear at fixed activation channels. Based onthe fact that weights are easy to quantize while activations are not,SmoothQuant smooths the activation outliers by migrating the quantizationdifficulty from activations to weights with a mathematically equivalenttransformation. SmoothQuant enables an INT8 quantization of both weights andactivations for all the GEMMs in LLMs, including OPT-175B, BLOOM-176B andGLM-130B. SmoothQuant has better hardware efficiency than existing techniquesusing mixed-precision activation quantization or weight-only quantization. Wedemonstrate up to 1.56x speedup and 2x memory reduction for LLMs withnegligible loss in accuracy. Thanks to the hardware-friendly design, weintegrate SmoothQuant into FasterTransformer, a state-of-the-art LLM servingframework, and achieve faster inference speed with half the number of GPUscompared to FP16. Our work offers a turn-key solution that reduces hardwarecosts and democratizes LLMs. Code will be released at:https://github.com/mit-han-lab/smoothquant.

Quick Read (beta)

loading the full paper ...