Abstract
We introduce GPTQv2, a novel finetuning-free quantization method forcompressing large-scale transformer architectures. Unlike the previous GPTQmethod, which independently calibrates each layer, we always match thequantized layer's output to the exact output in the full-precision model,resulting in a scheme that we call asymmetric calibration. Such a scheme caneffectively reduce the quantization error accumulated in previous layers. Weanalyze this problem using optimal brain compression to derive a close-formedsolution. The new solution explicitly minimizes the quantization error as wellas the accumulated asymmetry error. Furthermore, we utilize various techniquesto parallelize the solution calculation, including channel parallelization,neuron decomposition, and Cholesky reformulation for matrix fusion. As aresult, GPTQv2 is easy to implement, simply using 20 more lines of code thanGPTQ but improving its performance under low-bit quantization. Remarkably, on asingle GPU, we quantize a 405B language transformer as well as EVA-02 the rankfirst vision transformer that achieves 90% pretraining Imagenet accuracy. Codeis available at github.com/Intelligent-Computing-Lab-Yale/GPTQv2.