Pruning Ternary Quantization

Abstract

Inference time, model size, and accuracy are three key factors in deep modelcompression. Most of the existing work addresses these three key factors separately as itis difficult to optimize them all at the same time. For example, low-bit quantization aims at obtaining a faster model; weightsharing quantization aims at improving compression ratio and accuracy; andmixed-precision quantization aims at balancing accuracy and inference time. Tosimultaneously optimize bit-width, model size, and accuracy, we propose pruningternary quantization (PTQ): a simple, effective, symmetric ternary quantizationmethod. We integrate L2 normalization, pruning, and the weight decay term toreduce the weight discrepancy in the gradient estimator during quantization,thus producing highly compressed ternary weights. Our method brings the highesttest accuracy and the highest compression ratio. For example, it produces a939kb (49$\times$) 2bit ternary ResNet-18 model with only 4\% accuracy drop onthe ImageNet dataset. It compresses 170MB Mask R-CNN to 5MB (34$\times$) withonly 2.8\% average precision drop. Our method is verified on imageclassification, object detection/segmentation tasks with different networkstructures such as ResNet-18, ResNet-50, and MobileNetV2.

Quick Read (beta)

loading the full paper ...