A White Paper on Neural Network Quantization

Abstract

While neural networks have advanced the frontiers in many applications, theyoften come at a high computational cost. Reducing the power and latency ofneural network inference is key if we want to integrate modern networks intoedge devices with strict power and compute requirements. Neural networkquantization is one of the most effective ways of achieving these savings butthe additional noise it induces can lead to accuracy degradation. In this whitepaper, we introduce state-of-the-art algorithms for mitigating the impact ofquantization noise on the network's performance while maintaining low-bitweights and activations. We start with a hardware motivated introduction toquantization and then consider two main classes of algorithms: Post-TrainingQuantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires nore-training or labelled data and is thus a lightweight push-button approach toquantization. In most cases, PTQ is sufficient for achieving 8-bit quantizationwith close to floating-point accuracy. QAT requires fine-tuning and access tolabeled training data but enables lower bit quantization with competitiveresults. For both solutions, we provide tested pipelines based on existingliterature and extensive experimentation that lead to state-of-the-artperformance for common deep learning models and tasks.

Quick Read (beta)

loading the full paper ...