Abstract
Quantization and pruning form the foundation of compression for neuralnetworks, enabling efficient inference for large language models (LLMs).Recently, various quantization and pruning techniques have demonstratedremarkable performance in a post-training setting. They rely upon calibrationdata, a small set of unlabeled examples that are used to generate layeractivations. However, no prior work has systematically investigated how thecalibration data impacts the effectiveness of model compression methods. Inthis paper, we present the first extensive empirical study on the effect ofcalibration data upon LLM performance. We trial a variety of quantization andpruning methods, datasets, tasks, and models. Surprisingly, we find substantialvariations in downstream task performance, contrasting existing work thatsuggests a greater level of robustness to the calibration data. Finally, wemake a series of recommendations for the effective use of calibration data inLLM quantization and pruning.