Efficient Post-training Quantization with FP8 Formats

Abstract

Recent advances in deep learning methods such as LLMs and Diffusion modelshave created a need for improved quantization methods that can meet thecomputational demands of these modern architectures while maintaining accuracy.Towards this goal, we study the advantages of FP8 data formats forpost-training quantization across 75 unique network architectures covering awide range of tasks, including machine translation, language modeling, textgeneration, image classification, generation, and segmentation. We examinethree different FP8 representations (E5M2, E4M3, and E3M4) to study the effectsof varying degrees of trade-off between dynamic range and precision on modelaccuracy. Based on our extensive study, we developed a quantization workflowthat generalizes across different network architectures. Our empirical resultsshow that FP8 formats outperform INT8 in multiple aspects, including workloadcoverage (92.64% vs. 65.87%), model accuracy and suitability for a broaderrange of operations. Furthermore, our findings suggest that E4M3 is bettersuited for NLP models, whereas E3M4 performs marginally better than E4M3 oncomputer vision tasks. The code is publicly available on Intel NeuralCompressor: https://github.com/intel/neural-compressor.

Quick Read (beta)

loading the full paper ...