Extreme Compression of Large Language Models via Additive Quantization

Abstract

The emergence of accurate open large language models (LLMs) has led to a racetowards quantization techniques for such models enabling execution on end-userdevices. In this paper, we revisit the problem of "extreme" LLMcompression--defined as targeting extremely low bit counts, such as 2 to 3 bitsper parameter, from the point of view of classic methods in Multi-CodebookQuantization (MCQ). Our work builds on top of Additive Quantization, a classicalgorithm from the MCQ family, and adapts it to the quantization of languagemodels. The resulting algorithm advances the state-of-the-art in LLMcompression, outperforming all recently-proposed techniques in terms ofaccuracy at a given compression budget. For instance, when compressing Llama 2models to 2 bits per parameter, our algorithm quantizes the 7B model to 6.93perplexity (a 1.29 improvement relative to the best prior work, and 1.81 pointsfrom FP16), the 13B model to 5.70 perplexity (a .36 improvement) and the 70Bmodel to 3.94 perplexity (a .22 improvement) on WikiText2. We release ourimplementation of Additive Quantization for Language Models AQLM as a baselineto facilitate future research in LLM quantization.

Quick Read (beta)

loading the full paper ...