Quantizing deep convolutional networks for efficient inference: A whitepaper

Abstract

We present an overview of techniques for quantizing convolutional neuralnetworks for inference with integer weights and activations. Per-channelquantization of weights and per-layer quantization of activations to 8-bits ofprecision post-training produces classification accuracies within 2% offloating point networks for a wide variety of CNN architectures. Model sizescan be reduced by a factor of 4 by quantizing weights to 8-bits, even when8-bit arithmetic is not supported. This can be achieved with simple, posttraining quantization of weights.We benchmark latencies of quantized networkson CPUs and DSPs and observe a speedup of 2x-3x for quantized implementationscompared to floating point on CPUs. Speedups of up to 10x are observed onspecialized processors with fixed point SIMD capabilities, like the QualcommQDSPs with HVX. Quantization-aware training can provide further improvements, reducing thegap to floating point to 1% at 8-bit precision. Quantization-aware trainingalso allows for reducing the precision of weights to four bits with accuracylosses ranging from 2% to 10%, with higher accuracy drop for smallernetworks.We introduce tools in TensorFlow and TensorFlowLite for quantizingconvolutional networks and review best practices for quantization-awaretraining to obtain high accuracy with quantized weights and activations. Werecommend that per-channel quantization of weights and per-layer quantizationof activations be the preferred quantization scheme for hardware accelerationand kernel optimization. We also propose that future processors and hardwareaccelerators for optimized inference support precisions of 4, 8 and 16 bits.

Quick Read (beta)

loading the full paper ...