Non-structured DNN Weight Pruning Considered Harmful

Abstract

Large deep neural network (DNN) models pose the key challenge to energyefficiency due to the significantly higher energy consumption of off-chip DRAMaccesses than arithmetic or SRAM operations. It motivates the intensiveresearch on model compression with two main approaches. Weight pruningleverages the redundancy in the number of weights and can be performed in anon-structured, which has higher flexibility and pruning rate but incurs indexaccesses due to irregular weights, or structured manner, which preserves thefull matrix structure with lower pruning rate. Weight quantization leveragesthe redundancy in the number of bits in weights. Compared to pruning,quantization is much more hardware-friendly, and has become a "must-do" stepfor FPGA and ASIC implementations. This paper provides a definitive answer tothe question for the first time. First, we build ADMM-NN-S by extending andenhancing ADMM-NN, a recently proposed joint weight pruning and quantizationframework. Second, we develop a methodology for fair and fundamental comparisonof non-structured and structured pruning in terms of both storage andcomputation efficiency. Our results show that ADMM-NN-S consistentlyoutperforms the prior art: (i) it achieves 348x, 36x, and 8x overall weightpruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zeroaccuracy loss; (ii) we demonstrate the first fully binarized (for all layers)DNNs can be lossless in accuracy in many cases. These results provide a strongbaseline and credibility of our study. Based on the proposed comparisonframework, with the same accuracy and quantization, the results show thatnon-structrued pruning is not competitive in terms of both storage andcomputation efficiency. Thus, we conclude that non-structured pruning isconsidered harmful. We urge the community not to continue the DNN inferenceacceleration for non-structured sparsity.

Quick Read (beta)

loading the full paper ...