AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Abstract

Large language models (LLMs) have shown excellent performance on varioustasks, but the astronomical model size raises the hardware barrier for serving(memory size) and slows down token generation (memory bandwidth). In thispaper, we propose Activation-aware Weight Quantization (AWQ), ahardware-friendly approach for LLM low-bit weight-only quantization. Our methodis based on the observation that weights are not equally important: protectingonly 1% of salient weights can greatly reduce quantization error. We thenpropose to search for the optimal per-channel scaling that protects the salientweights by observing the activation, not weights. AWQ does not rely on anybackpropagation or reconstruction, so it can well preserve LLMs' generalizationability on different domains and modalities, without overfitting to thecalibration set; it also does not rely on any data layout reordering,maintaining the hardware efficiency. AWQ outperforms existing work on variouslanguage modeling, common sense QA, and domain-specific benchmarks. Thanks tobetter generalization, it achieves excellent quantization performance forinstruction-tuned LMs and, for the first time, multi-modal LMs. We alsoimplement efficient tensor core kernels with reorder-free online dequantizationto accelerate AWQ, achieving a 1.45x speedup over GPTQ and is 1.85x faster thanthe cuBLAS FP16 implementation. Our method provides a turn-key solution tocompress LLMs to 3/4 bits for efficient deployment.

Quick Read (beta)

loading the full paper ...