Quantization of Large Language Models with an Overdetermined Basis

Abstract

In this paper, we introduce an algorithm for data quantization based on theprinciples of Kashin representation. This approach hinges on decomposing anygiven vector, matrix, or tensor into two factors. The first factor maintains asmall infinity norm, while the second exhibits a similarly constrained normwhen multiplied by an orthogonal matrix. Surprisingly, the entries of factorsafter decomposition are well-concentrated around several peaks, which allows usto efficiently replace them with corresponding centroids for quantizationpurposes. We study the theoretical properties of the proposed approach andrigorously evaluate our compression algorithm in the context of next-wordprediction tasks and on a set of downstream tasks for text classification. Ourfindings demonstrate that Kashin Quantization achieves competitive or superiorquality in model performance while ensuring data compression, marking asignificant advancement in the field of data quantization.

Quick Read (beta)

loading the full paper ...