Non-structured DNN Weight Pruning Considered Harmful

  • 2019-07-03 20:27:51
  • Yanzhi Wang, Shaokai Ye, Zhezhi He, Xiaolong Ma, Linfeng Zhang, Sheng Lin, Geng Yuan, Sia Huat Tan, Zhengang Li, Deliang Fan, Xuehai Qian, Xue Lin, Kaisheng Ma
  • 20

Abstract

Large deep neural network (DNN) models pose the key challenge to energyefficiency due to the significantly higher energy consumption of off-chip DRAMaccesses than arithmetic or SRAM operations. It motivates the intensiveresearch on model compression with two main approaches. Weight pruningleverages the redundancy in the number of weights and can be performed in anon-structured, which has higher flexibility and pruning rate but incurs indexaccesses due to irregular weights, or structured manner, which preserves thefull matrix structure with lower pruning rate. Weight quantization leveragesthe redundancy in the number of bits in weights. Compared to pruning,quantization is much more hardware-friendly, and has become a "must-do" stepfor FPGA and ASIC implementations. This paper provides a definitive answer tothe question for the first time. First, we build ADMM-NN-S by extending andenhancing ADMM-NN, a recently proposed joint weight pruning and quantizationframework. Second, we develop a methodology for fair and fundamental comparisonof non-structured and structured pruning in terms of both storage andcomputation efficiency. Our results show that ADMM-NN-S consistentlyoutperforms the prior art: (i) it achieves 348x, 36x, and 8x overall weightpruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zeroaccuracy loss; (ii) we demonstrate the first fully binarized (for all layers)DNNs can be lossless in accuracy in many cases. These results provide a strongbaseline and credibility of our study. Based on the proposed comparisonframework, with the same accuracy and quantization, the results show thatnon-structrued pruning is not competitive in terms of both storage andcomputation efficiency. Thus, we conclude that non-structured pruning isconsidered harmful. We urge the community not to continue the DNN inferenceacceleration for non-structured sparsity.

 

Quick Read (beta)

Non-structured DNN Weight Pruning Considered Harmful

Yanzhi Wang1, Shaokai Ye2, Zhezhi He3, Xiaolong Ma1, Linfeng Zhang2, Sheng Lin1, Geng Yuan1, Sia Huat Tan2,
Zhengang Li1, Deliang Fan3, Xuehai Qian4, Xue Lin1, Kaisheng Ma2
1Dept. of Electrical & Computer Engineering, Northeastern University, Boston, MA, USA
2Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
3Dept. of Electrical & Computer Engineering, University of Central Florida, Orlando, FL, USA
4Dept. of Electrical & Computer Engineering, University of Southern California, Los Angeles, CA, USA
1 [email protected], {ma.xiaol, lin.sheng, yuan.geng, li.zhen}@husky.neu.edu, [email protected]
2 [email protected], [email protected], [email protected], [email protected]
3 [email protected], [email protected], 4 [email protected]
Abstract

Large deep neural network (DNN) models pose the key challenge to energy efficiency due to the significantly higher energy consumption of off-chip DRAM accesses than arithmetic or SRAM operations. It motivates the intensive research on model compression with two main approaches. Weight pruning leverages the redundancy in the number of weights and can be performed in a non-structured, which has higher flexibility and pruning rate but incurs index accesses due to irregular weights, or structured manner, which preserves the full matrix structure with lower pruning rate. Weight quantization leverages the redundancy in the number of bits in weights. Compared to pruning, quantization is much more hardware-friendly, and has become a “must-do” step for FPGA and ASIC implementations. Thus, any evaluation of the effectiveness of pruning should be on top of quantization. The key open question is, with quantization, what kind of pruning (non-structured vs. structured) is most beneficial? This question is fundamental because the answer will determine the design aspects that we should really focus on to avoid diminishing return of certain optimizations.

This paper provides a definitive answer to the question for the first time. First, we build ADMM-NN-S by extending and enhancing ADMM-NN, a recently proposed joint weight pruning and quantization framework, with the algorithmic supports for structured pruning, dynamic ADMM regulation, and masked mapping and retraining. Second, we develop a methodology for fair and fundamental comparison of non-structured and structured pruning in terms of both storage and computation efficiency. Our results show that ADMM-NN-S consistently outperforms the prior art: (i) it achieves 348×, 36×, and 8× overall weight pruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zero accuracy loss; (ii) we demonstrate the first fully binarized (for all layers) DNNs can be lossless in accuracy in many cases. These results provide a strong baseline and credibility of our study. Based on the proposed comparison framework, with the same accuracy and quantization, the results show that non-structrued pruning is not competitive in terms of both storage and computation efficiency. Thus, we conclude that non-structured pruning is considered harmful. We urge the community not to continue the DNN inference acceleration for non-structured sparsity.

Non-structured DNN Weight Pruning Considered Harmful


Yanzhi Wang1, Shaokai Ye2, Zhezhi He3, Xiaolong Ma1, Linfeng Zhang2, Sheng Lin1, Geng Yuan1, Sia Huat Tan2,
Zhengang Li1, Deliang Fan3, Xuehai Qian4, Xue Lin1, Kaisheng Ma2
1Dept. of Electrical & Computer Engineering, Northeastern University, Boston, MA, USA
2Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
3Dept. of Electrical & Computer Engineering, University of Central Florida, Orlando, FL, USA
4Dept. of Electrical & Computer Engineering, University of Southern California, Los Angeles, CA, USA
1 [email protected], {ma.xiaol, lin.sheng, yuan.geng, li.zhen}@husky.neu.edu, [email protected]
2 [email protected], [email protected], [email protected], [email protected]
3 [email protected], [email protected], 4 [email protected]

Deep neural networks (DNNs) with very large model sizes are the key enabler for the recent success of deep learning. However, large models incur excessive DRAM accesses which consume significant more energy than arithmetic or SRAM operations. Thus, model compression of DNNs became an active and intensively studied research topic. These techniques, which are applied during the training phase of the DNNs, exploit the redundancy in weights. The aim is to simultaneously reduce the model size (thus, the storage requirement) and accelerate the computation for inference, — all to be achieved with minor classification accuracy loss. These techniques are of particular interests to the hardware acceleration of DNN inference engine [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70] since it is more challenging to achieve high processing throughput for the compressed models. Two important model compression techniques are weight pruning and weight quantization.

Weight pruning leverages the redundancy in the number of weights. The pioneering work  [71] used heuristic and iterative weight pruning to achieve considerable weight parameter reduction with negligible accuracy loss. It has been extended in [72, 73, 74, 75] with more sophisticated heuristics. On the downside, such non-structured methods lead to irregular, sparse weight matrices (as shown in Figure Non-structured DNN Weight Pruning Considered Harmful (a), arbitrary weight can be pruned), which rely on indices to be stored in a compressed format. As a result, they are less compatible with the data parallel execution model in GPUs and multicore CPUs. This drawback is confirmed by the throughput degradation reported in recent works [76, 77]. To overcome the limitation of non-structured pruning, recent works [76, 78] proposed the idea of incorporating regularity or “structures” in weight pruning, such as filter pruning, channel pruning, and filter shape pruning, shown in Figure Non-structured DNN Weight Pruning Considered Harmful (b). The structured approaches maintain a full matrix with reduced dimensions, and indices are no longer needed. As a result, it leads to much higher speedups in GPUs.

Figure \thefigure: (a) Non-structured weight pruning (arbitrary weight can be pruned) and (b) three types of structured weight pruning.

Weight quantization is an orthogonal compression technique that leverages the redundancy in the number of bits of weight representation [79, 80, 81, 82, 83, 84, 85, 86]. Compared to weight pruning, weight quantization is inherently more hardware-friendly, since both storage and computation of DNNs will be reduced proportionally to the weight precision without additional overhead due to indices. Moreover, multiplication operations may be eliminated with binary, ternary, or power-of-2 weight quantizations [84, 85, 86]. Thanks to these advantages, weight quantization has been a “must-do” step for DNN inference engines. Besides FPGA and ASIC, it is also well supported in GPU, CPU, and mobile devices, e.g., [87, 88].

Given the pros and cons of non-structured/structured weight pruning and weigh quantization, they need to be investigated jointly to fully understand the interactions between them. In particular, since weight quantization is a must-do step, especially for FPGA and ASIC, i.e., weight pruning will not be performed alone. The key open question is, with quantization, what kind of pruning (non-structured vs. structured) is most beneficial? The answer to the question is far from obvious. Using LeNet-5 (for MNIST data set) as an example, we achieve an unprecedented 348× (non-structured) weight reduction with 3-bit quantization, maintaining 99%+ accuracy. However, each index needs to be at least 9-bit on account of 348× weight pruning. This makes index storage larger than that of weights (in addition, indices cannot be further quantized). In this example, non-structured weight pruning results in larger actual storage than structured pruning. Thus, we can see the importance of answering such question: it will determine the design aspects that we should really focus on to avoid diminishing return of certain optimizations. As shown in Figure Non-structured DNN Weight Pruning Considered Harmful, we need the clear answers for all platforms.

Two recent works ADMM-NN [89] and [79], that perform systematic joint weight pruning and quantization, are in the best position to perform this study. Using advanced variable-splitting optimization method ADMM (Alternating Direction Methods of Multipliers) [90, 91, 92], state-of-the-art results are achieved (e.g., 21× weight reduction [93] in AlexNet), — outperforming heuristic counterparts. Unfortunately, the current framework is insufficient to perform such study. First, ADMM-NN lacks the algorithmic mechanisms to enforce structured weight pruning, and guarantee the solution feasibility. Second, we lack the methodology to fairly and fundamentally compare non-structured and structured pruning in an “apple-to-apple” manner. This paper is the first study to provide the answer to the open question with two key contributions.

Figure \thefigure: Is non-structured pruning beneficial at all?

The first contribution of the paper is the development of ADMM-NN-S by extending and enhancing of ADMM-NN [89]. It is extended with the algorithmic supports for structured pruning. We achieve this by adjusting the constraints in each layer to express the structured requirements. For example, for filter pruning, the constraint for a layer can be specified as number of non-zero filters is less than or equal to a threshold. Moreover, we develop a systematic framework of dynamic ADMM regulation, masked mapping and retraining to guarantee solution feasibility (satisfying all constraints) and provide high solution quality (ensuring pruning and quantization rate under the same accuracy).

The second contribution is the methodology for the fair and fundamental comparison of non-structured and structured weight pruning with quantization in place. We focus on two metrics with the same accuracy: 1) total storage (weight+indices), which is computed based on both absolute and relative indices; 2) computation efficiency, which is captured by a new metrics called pruning-to-performance ratio (PPR). After pruning, suppose α× weight reduction results in β× speedup, the PPR value is defined as α/β. Intuitively, the less the value of PPR, the higher the computation efficiency, — same speedup can be achieved by smaller pruning rate. For structured pruning, PPR value is approximately 1 due to the absence of indices. For non-structured pruning, recent accelerators based on non-structured sparsity [94, 95, 96, 97] show that PPR values are larger than 2.7. We can fairly compare non-structured and structured pruning by conservatively comparing PPR: non-structured pruning is more beneficial if it can achieve 2.7× or higher pruning rate than structured pruning. No prior work has conducted such study and the answer to the above comparison is unknown.

The fairness of the proposed methodology is ensured due to three reasons: 1) it is performed by our the new ADMM-NN-S framework that significantly outperforms prior arts (in both non-structured and structured pruning); 2) the comparison of storage and computation is hardware implementation-agnostic; 3) the comparison is performed at the same rate of accuracy. We also strengthen weight quantization after non-structured pruning by selectively leveraging state-of-art ternary quantization solution [98].

Based on the proposed ideas, we perform extensive and representative testing of our comparison framework with AlexNet, VGGNet, ResNet-18/50, MobileNet, and LeNet-5 models based on ImageNet, CIFAR-10, and MNIST data sets. Due to space limitation, we focus on the convolutional (CONV) layers, which are the most computationally intensive layers in DNNs and are becoming the major storage as well as in state-of-art ResNet and MobileNet models. We do observe similar (and more significant) effect on fully-connected (FC) layers and on RNNs. In the following, we highlight our results and findings.

First, ADMM-NN-S framework guarantees solution feasibility while providing high solution quality. Our results consistently and significantly outperform prior art. This is the key to ensure the credibility of our conclusion. Specifically, we 1) achieve unprecedented 348×, 36×, and 8× overall weight pruning on LeNet-5, AlexNet, and ResNet-50 models, respectively, with (almost) zero accuracy loss; 2) derive the first lossless, fully binarized (for all layers) LeNet-5 for MNIST and VGG-16 for CIFAR-10; and 3) derive the first fully binarized (for all layers) ResNet for ImageNet with reasonable accuracy loss.

Second, comparing non-structured and structured pruning, we find that the storage overhead of indices for non-structured pruning is always more than its additional weight storage reduction, thus the amount of total storage for non-structured pruning is actually larger. In term of computation efficiency, we find that the PPR for structured pruning in all models are less than 2.7×. For the first time, our results show that, despite more flexibility and weight pruning rate, non-structured pruning is not competitive in terms of both storage and computation efficiency with quantization and the same accuracy. In a few cases, the storage size of non-structured pruning is comparable (or slightly better than) to that of structured pruning, however it is still not a desirable choice considering the additional complexity of hardware design to support non-structured sparsity. Moreover, we explain in detail (Section Non-structured DNN Weight Pruning Considered Harmful that the conclusion is unlikely to change for different hardware platforms (e.g., GPUs, multi-core CPUs, FPGA, or ASIC), application scenarios, DNN types, and will still hold with potential pruning/quantization algorithm improvements. Based on this conclusion, we reach the conclusion that non-structured weight pruning is considered harmful, and we recommend not to continue investigating DNN inference engines using non-structured sparsity. We release codes and all the models of this work at anonymous link: http://bit.ly/2WMQSRi.

Figure \thefigure: Compressed sparse row (CSR) format with (a) absolute indices and (b) relative indices.

Non-structured weight pruning. The early work by Han et al. [71] achieved 9× reduction in the number of parameters in AlexNet and 13× in VGG-16. However, most reduction is achieved in FC layers, and the 2.7× reduction achieved in CONV layers will not lead to an overall acceleration in GPUs [76]. Extensions of iterative weight pruning, such as [74] (dynamic network surgery), [72] (NeST) and [99], use more delicate algorithms such as selective weight growing and pruning. But the weight pruning rates on CONV layers are still limited, e.g., 3.1× in [74], 3.23× in [72], and 4.16× in [99] for AlexNet with no accuracy degradation. This level of non-structured weight pruning cannot guarantee sufficient speedups in GPUs. In fact, based on the enhanced ADMM-NN framework, we can achieve 11.2× non-structured weight pruning in CONV layers with almost no accuracy degradation. Ironically, it even results in 20% speed degradation on an NVIDIA 1080Ti GPU.

Structured weight pruning. To overcome the limitation in non-structured, irregular weight pruning, SSL [76] proposes to learn structured sparsity at the levels of filters, channels, filter shapes, layer depth, etc. This work is among the firsts that reported the actually measured GPU accelerations. This is because CONV layers after structured pruning will transform to a full matrix multiplication with reduced matrix size. However, the weight pruning rate is limited in the prior work on structured pruning. The average weight pruning rate on CONV layers of AlexNet is only 1.4× without accuracy loss. More recently, [78] achieved 2× channel pruning with 1% accuracy degradation on VGGNet. More importantly, the structured weight pruning has never been evaluated with weight quantization.

Weight quantization. This method takes advantages of the inherent redundancy in the number of bits for weight representation. Many of the prior works [79, 80, 81, 82, 83, 84, 85, 86] focused on quantization of weights to binary values, ternary values, or powers of 2 to facilitate hardware implementation, with acceptable accuracy loss. The state-of-the-art techniques [86, 79] adopt an iterative quantization and retraining framework, with some degree of randomness incorporated into the quantization step. This method results in less than 3% accuracy loss on AlexNet for binary weight quantization [79].

Compared to weight pruning, weight quantization is the major DNN model compression technique utilized in industry, due to its “hardware-friendliness” and the proportional reduction of computation and storage. Thus, weight quantization has been a must-do step in FPGA and ASIC designs of DNN inference engines. Also, it is well supported in GPUs and mobile devices, e.g., PyTorch [88] in NVIDIA GPU and TensorFlow Lite [87] for mobile devices.

Recent work [89, 79] have incorporated ADMM for DNN weight pruning and weight quantization, respectively. ADMM is a powerful tool for optimization, by decomposing an original problem into two subproblems that can be solved separately and efficiently. For example, considering optimization problem min𝐱f(𝐱)+g(𝐱). In ADMM, this problem is decomposed into two subproblems on 𝐱 and 𝐳 (auxiliary variable), which will be solved iteratively until convergence. The first subproblem derives 𝐱 given 𝐳: min𝐱f(𝐱)+q1(𝐱|𝐳). The second subproblem derives 𝐳 given 𝐱: min𝐳g(𝐳)+q2(𝐳|𝐱). Both q1 and q2 are quadratic functions.

ADMM is conventionally utilized to accelerate the convergence of convex optimization problems and enable distributed optimization, in which the optimality and fast convergence rate has been proven [90, 92]. As a special property, ADMM can effectively deal with a subset of combinatorial constraints and yields optimal (or at least high quality) solutions [100, 101]. Luckily, the associated constraints in the DNN weight pruning and quantization belong to this subset of combinatorial constraints, making ADMM applicable to DNN mode compression. However, due to the non-convex nature of the objective function for DNN training, there is still a lack of guarantee in the prior work [89, 79] on solution feasibility and solution quality. Moreover,  [89] only supports non-structured pruning.

Indices are used to represent weight matrices in the sparse format, thereby achieving storage reduction in non-structured weight pruning. A representative sparse representation format is the compressed sparse row (CSR) format, which was also utilized in prior work [71, 6]. As shown in \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful (a), it represents a matrix by three arrays, which respectively contains nonzero (weight) values, column indices and the extents of rows. This representation requires 2n+r+1 numbers, where n is the number of nonzero values and r is the number of rows.

We call the above representation as CSR with absolute indices. Instead of storing the absolute position, we can compute the index difference and store the indices with relative position. This representation requires 2n numbers, where n is the number of nonzero (weight) values. For further compression, one can restrict the number of bits (3 bits in this example) to represent the relative position and add a dummy zero weight when the relative position exceeds the largest value (8 for this example) that can be represented, which are both shown in \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful (b). These cases are called CSR with relative indices.

Comparing the two options, CSR with relative indices is good for compression [71], while CSR with absolute indices leads to better hardware acceleration [94, 96, 97]. In this work, we aim to allow the highest freedom for non-structured pruning in storage and computation evaluations, — we allow CSR with relative indices in storage calculation and CSR with absolute indices for computation estimation for non-structured pruning.

Wen et al. [76] introduced three types of structured pruning: filter pruning, channel pruning, and filter shape pruning, as shown in \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful (b). Filter pruning removes whole filter(s); channel pruning removes whole channels; and filter shape pruning removes the weights in the same locations of all filters in one specific layer. Moreover, as shown in \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful, filter pruning and channel pruning are correlated. Pruning a filter in layer i is equivalent to pruning the corresponding channel in layer i+1, which is generated by this specific filter. As a result, filter pruning (and channel pruning) has a roughly quadratic effect on the weight parameter reduction (and the amount of computations) of the DNNs.

Figure \thefigure: Relation between filter pruning and channel pruning. Pruned filters in layer i results in pruned feature maps and therefore pruned (inactivated) channels in layer i+1.

The CONV operations in (one layer of) DNNs are commonly transformed to matrix multiplications by converting weight tensors and feature map tensors to matrices [52], named general matrix multiplication or GEMM, as shown in \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful. From \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful (b), filter pruning corresponds to reducing one row, and thus is also termed row pruning. Filter shape pruning corresponds to reducing one column, and thus is also termed column pruning. Channel pruning corresponds to reducing multiple consecutive columns. The three structured pruning techniques, along with their combinations, will reduce the dimensions in GEMM while maintaining a full matrix format. Thus, indices are not needed. It is why structured pruning techniques are in general more suitable for hardware accelerations.

On one hand, the major advantage of filter/channel pruning has the superlinear effect on storage/computation reduction, i.e., α× filter pruning on all layers results in over α× reduction in number of weight parameters. On the other hand, column pruning has a higher degree of flexibility. These techniques can be largely combined in order to achieve the highest rates in reductions of computation and storage, and effective heuristic for the desirable combination is needed.

In this section, we build ADMM-NN-S, a unified solution framework of both non-structured and structured weight pruning, as well as weight quantization problems by extending ADMM-NN, the state-of-the-art ADMM-based framework [89]. The differences between ADMM-NN-S and ADMM-NN are: 1) it supports structured pruning; 2) it can guarantee solution feasibility and provide high solution quality; and 3) we propose effective techniques for enhancing convergence.

This section discusses the extension of ADMM-NN with structured pruning constraints. Consider an N-layer DNN with both CONV and FC layers. The weights and biases of the i-th layer are respectively denoted by 𝐖i and 𝐛i, and the loss function associated with the DNN is denoted by f({𝐖i}i=1N,{𝐛i}i=1N); see [93]. In our discussion, {𝐖i}i=1N and {𝐛i}i=1N respectively characterize the collection of weights and biases from layer 1 to layer N. Then DNN weight pruning or weight quantization is formulated as the following optimization problem:

minimize{𝐖i},{𝐛i} f({𝐖i}i=1N,{𝐛i}i=1N), (1)
subject to 𝐖i𝒮i,i=1,,N,

Next we introduce constraint sets 𝒮i’s corresponding to the non-structured weight pruning, different types of structured pruning, as well as weight quantization. We use CONV layers as illustrative example since CONV layers are the most computationally intensive. The problem formulation can be well applied to FC layers [93].

Figure \thefigure: (a) To support GEMM computation, the weight tensor representation of a CONV layer is transformed into the weight matrix representation. (b) How different structured weight pruning schemes are implemented on the weight matrix representation.

The collection of weights in the i-th CONV layer is a four-dimensional tensor, i.e., 𝐖iRAi×Bi×Ci×Di, where Ai,Bi,Ci, and Di are respectively the number of filters, the number of channels in a filter, the height of the filter, and the width of the filter, in layer i. In the following, if 𝐗 denotes the weight tensor in a specific layer, let (𝐗)a,:,:,: denote the a-th filter in 𝐗, (𝐗):,b,:,: denote the b-th channel, and (𝐗):,b,c,d denote the collection of weights located at position (:,b,c,d) in every filter of 𝐗, as illustrated in \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful (b).

Weight pruning: For non-structured weight pruning, the constraint on the weights in i-th layer is 𝐖i𝒮i:={𝐗 number of nonzero elements in 𝐗 is less than or equal to αi}. For filter pruning (row pruning), the constraint in the i-th CONV layer becomes 𝐖i𝒮i:={𝐗 the number of nonzero filters in 𝐗 is less than or equal to βi}. For channel pruning, the constraint becomes 𝐖i𝒮i:={𝐗 the number of nonzero channels in 𝐗 is less than or equal to γi}. Finally, for filter-shape pruning (column pruning), the constraint in the i-th CONV layer is 𝐖i𝒮i:={𝐗 the number of nonzero vectors in {𝐗:,b,c,d}b,c,d=1Bi,Ci,Di is less than or equal to θi}. These αi, βi, γi, and θi values are hyperparameters determined in prior, and the determination procedure will be discussed in Section Non-structured DNN Weight Pruning Considered Harmful.

Weight quantization: For weight quantization, elements in 𝐖i assume one of qi,1,qi,2,,qi,Mi values, where Mi denotes the number of these fixed values. Here, the qi,j values are quantization levels of weights of layer i in increasing order, and we focus on equal-distance quantization (the same distance between adjacent quantization levels) to facilitate hardware implementation.

In problem (Non-structured DNN Weight Pruning Considered Harmful), the constraint is combinatorial. As a result, this problem cannot be solved directly by stochastic gradient descent methods like original DNN training. However, the form of the combinatorial constraints on 𝐖i is compatible with ADMM which is recently shown to be an effective method to deal with such clustering-like constraints [100, 101].

Despite such compatibility, it is still challenging to directly apply ADMM due to the non-convexity in objective function. To overcome this challenge, we propose dynamic ADMM regularization, masked mapping and retraining steps for both non-structured and structured pruning. By integrating these techniques, ADMM-NN-S can guarantee solution feasibility (satisfying all constraints) and provide high solution quality (pruning/quantization rate under the same accuracy). The procedure of ADMM-NN-S is shown in Figure Non-structured DNN Weight Pruning Considered Harmful.

Figure \thefigure: Procedure of ADMM-NN-S Framework.

ADMM Regularization Step: The ADMM regularization decomposes the original problem (Non-structured DNN Weight Pruning Considered Harmful) into two subproblems through11 1 The details of ADMM are presented in [92, 93]. We omit the details due to space limitation. (i) defining indicator function

gi(𝐖i)={0 if 𝐖i𝒮i,+ otherwise

corresponding to every set 𝒮i; (ii) incorporating auxiliary variables 𝐙i, i=1,,N; and (iii) adopting augmented Lagrangian [92]. These decomposed subproblems will be iteratively solved until convergence. The first subproblem is

minimize{𝐖i},{𝐛i}   f({𝐖i}i=1N,{𝐛i}i=1N)+i=1Nρi2𝐖i-𝐙ik+𝐔ikF2, (2)

where 𝐔ik:=𝐔ik-1+𝐖ik-𝐙ik. The first term in the objective function of (Non-structured DNN Weight Pruning Considered Harmful) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the 𝐖i’s, which is differentiable and convex. As a result (Non-structured DNN Weight Pruning Considered Harmful) can be solved by stochastic gradient descent as original DNN training. Please note that this first subproblem maintains the same form and solution for (non-structured and structured) weight pruning and quantization problems.

On the other hand, the second subproblem is given by

minimize{𝐙i}   i=1Ngi(𝐙i)+i=1Nρi2𝐖ik+1-𝐙i+𝐔ikF2. (3)

Note that gi() is the indicator function of 𝒮i, thus this subproblem can be solved analytically and optimally [92]. For i=1,,N, the optimal solution is the Euclidean projection of 𝐖ik+1+𝐔ik onto 𝒮i. For non-structured weight pruning, we can prove that the Euclidean projection results in keeping αi elements in 𝐖ik+1+𝐔ik with the largest magnitudes and setting the remaining weights to zeros. For filter pruning, we first calculate Oa=(𝐖ik+1+𝐔ik)a,:,:,:F2 for a=1,,Ai, where F denotes the Frobenius norm. We then keep βi elements in (𝐖ik+1+𝐔ik)a,:,:,: corresponding to the βi largest values in {Oa}a=1Ai and set the rest to zero. For channel pruning, we first calculate Ob=(𝐖ik+1+𝐔ik):,b,:,:F2 for b=1,,Bi. We then keep γi elements in (𝐖ik+1+𝐔ik):,b,:,: corresponding to the γi largest values in {Ob}b=1Bi and set the rest to zero. The optimal solution of the second subproblem for filter shape pruning is similar, and is omitted due to space limitation. For weight quantization, we can prove that the Euclidean projection results in mapping every element of 𝐖ik+1+𝐔ik to the quantization level closest to that element.

After both subproblems solved, we update the dual variables 𝐔i’s according to the ADMM rule [92] and thereby complete one iteration in ADMM regularization. Overall the ADMM regularization step can be understood as a smart, dynamic L2 regularization, in which the regularization target 𝐙ik-𝐔ik will change judiciously and analytically in each iteration. On the other hand, conventional regularization methods (based on L1, L2 norms or their combinations) use a fixed regularization target, and the penalty is applied on all the weights. This will inevitably cause accuracy degradation. Sample comparison results are provided in Section Non-structured DNN Weight Pruning Considered Harmful.

Masked mapping and retraining: After ADMM regularization, we obtain intermediate 𝐖i solutions. The subsequent step of masked mapping and retraining will guarantee the solution feasibility and improve solution quality. For non-structured and structured weight pruning, the procedure is more straightforward. We first perform the said Euclidean projection (mapping) to guarantee that pruning constraints are satisfied. Next, we mask the zero weights and retrain the DNN with non-zero weights using training sets, while keeping the masked weights 0. In this way test accuracy (solution quality) can be (partially) restored, and solution feasibility (constraints) will be maintained.

For weight quantization, the procedure is more complicated. The reason is that the retraining process will affect the quantization results, thereby solution feasibility. To deal with this issue, we first perform Euclidean projection (mapping) of weights that are close enough (defined by a threshold value ϵ) to nearby quantization levels. Then we perform retraining on the remaining, unquantized weights (with quantized weights fixed) for accuracy improvement. Finally we perform Euclidean mapping on the remaining weights as well. In this way the solution feasibility will be guaranteed.

In this section we discuss two techniques for enhancing convergence (rate and results): multi-rho method in ADMM regularization, and progressive weight pruning. We abandon the extragradient descent method in [79] as we did not find the advantage in convergence speed, not to mention the additional hyperparameters introduced by this method.

Increasing ρ in ADMM regularization: The ρi values are the most critical hyperparameter in ADMM regularization. We start from smaller ρi values, say ρ1==ρN=1.5×10-3, and gradually increase with ADMM iterations. This coincides with the theory of ADMM convergence [100, 101]. It in general takes 8 - 12 ADMM iterations for convergence, corresponding to 100 - 150 epochs in PyTorch. This convergence rate is comparable with the original DNN training.

Progressive weight pruning: The ADMM regularization is L2 regularization. As a result, there is a large portion of very small weights values after one round of ADMM-based (non-structured or structured) weight pruning. This gives rise the opportunity to perform a second round of weight pruning. In practice, we perform two rounds of ADMM-based weight pruning consecutively, where the weight pruning results in the first round will be the starting point of the second round (weights that are already pruned to zero will not be recovered). This method has an additional benefit of reducing the search space in each step, thereby accelerating convergence.

Hyperparameter determination mainly refers to the determination process of pruning rate (e.g., the αi value) and/or the number of quantization levels per layer of DNN. This is a more challenging task for pruning than quantization in general. For quantization, it is typically preferred for the same number of quantization levels for all (or most of) layers, like binarized or ternarized weights, which is preferred by hardware. For weight pruning, on the other hand, these pruning rate values are flexible and shall be judiciously determined.

As hyperparameter determination is not our primary focus, we use a heuristic method as follows. We observe that we can achieve at least 3× more weight pruning than prior, heuristic weight pruning methods without accuracy loss. Hence, we adopt the per-layer pruning rates reported in prior work, and increase proportionally. In the progressive pruning procedure, we set the target of the first round to be 1.5× pruning than prior work, and the second round to be doubled based on that. We will further increase the pruning rates if there is still margin for weight pruning without accuracy loss.

In this section, we demonstrate the effectiveness of ADMM-NN-S for non-structure pruning and quantization, based on ImageNet ILSVRC-2012, CIFAR-10, and MNIST data sets, using AlexNet [102], VGGNet [103], ResNet-18/ResNet-50 [104], MobileNet V2 [105], and LeNet-5 DNN models. Due to space limitation, we only show the results of the overall DNN model (which has the most prior work for comparison), and binarized quantization of DNNs. Our implementations are based on PyTorch, and the baseline accuracy results are in many cases higher than those utilized in prior work, which reflects the recent training advances. For example, in the AlexNet model we utilize a baseline with Top-1 accuracy 60.0% and Top-5 accuracy 82.2%, both higher than prior work (57.2% Top-1 and 80.2% Top-5). We conduct a fair comparison because we focus on relative accuracy with our baseline instead of the absolute accuracy (which has outperformed prior work).

Thanks to the compatibility of ADMM-NN-S with DNN training, directly training a DNN model using the framework achieves the same result as using a pre-trained DNN model. When a pre-trained DNN model is utilized, we limit the number of epochs in both steps in the progressive framework to be 120, similar to the original DNN training in PyTorch and is much lower than the iterative pruning heuristic [71].

AlexNet Results for ImageNet Dataset: Table Non-structured DNN Weight Pruning Considered Harmful compares the overall pruning rates of the whole AlexNet model (CONV and FC layers) vs. accuracy, between the proposed framework and various prior methods. We can clearly observe that the proposed framework outperforms prior methods, including the prior ADMM method [93]. With almost no accuracy loss even based on the high baseline accuracy, we achieve 36× overall pruning rate. We achieve a notable 61× weight reduction with 79.7% Top-5 accuracy, just slightly below the baseline accuracy in prior work.

Table \thetable: Overall weight pruning rate comparisons on AlexNet model for ImageNet data set.
Method Top-5 accuracy Relative accuracy loss Overall prun. rate
Iter. prun. [71] 80.3% -0.1% 9.1×
NeST [72] 80.3% -0.1% 15.7×
Dyn. surg. [74] 80.0% +0.2% 17.7×
ADMM [93] 80.2% -0.0% 17.7×
Our method 82.0% +0.2% 36×
Our method 80.8% +1.4% 44×
Our method 79.7% +2.5% 61×
Figure \thefigure: Top-5 accuracies for different pruning methods on AlexNet for ImageNet dataset.

Figure Non-structured DNN Weight Pruning Considered Harmful illustrates the absolute top-5 accuracy for different pruning methods, on AlexNet model for ImageNet dataset. These methods include our proposed solution, iterative pruning [71], fixed regularization techniques like L1 and L2 regularizations, and projected gradient descent (PGD). The results clearly show that the proposed method outperforms the others both in absolute accuracy and in relative accuracy loss.

ResNet-50 Results for ImageNet Dataset: Due to the lack of existing effective pruning results, we conduct uniform weight pruning, — use the same pruning rate for all CONV and FC layers. The results are shown in Table Non-structured DNN Weight Pruning Considered Harmful. We achieve 8× overall pruning rate (also 8× pruning rate on CONV layers) on ResNet-50 without accuracy loss. These results clearly outperform the prior work.

Table \thetable: Comparisons of overall weight pruning results on ResNet-50 for ImageNet data set.
Method Top-5 Acc. Loss Pruning rate
Uncompressed 0.0% 1×
Fine-grained [99] 0.1% 2.6×
Our method 0.0% 8×
Our method 0.7% 17.4×

MobileNet V2 Results for CIFAR-10 Dataset: The baseline accuracy is as high as 95.07% due to the adoption of mixup technique. We present our results in Table Non-structured DNN Weight Pruning Considered Harmful due to the lack of prior work for fair comparison. We achieve 5.7× weight pruning with almost no accuracy loss, starting from the high-accuracy baseline. We achieve 10× weight pruning (which is highly challenging for MobileNet) with only 1.3% accuracy loss.

Table \thetable: Our weight pruning results on MobileNet V2 for CIFAR-10 data set.
Method Accuracy Pruning rate
Uncompressed 95.07% 1×
Our method 94.95% 5.7×
Our method 94.70% 6.7×
Our method 93.75% 10×

LeNet-5 Results for MNIST Dataset: Table Non-structured DNN Weight Pruning Considered Harmful demonstrates the comparison results on LeNet-5 model using MNIST data set. We achieve an unprecedented 348× overall weight reduction with almost no accuracy loss. It clearly outperforms prior methods including one-shot ADMM-based method [93].

Table \thetable: Comparisons of overall weight pruning results on LeNet-5 for MNIST data set.
Method Accuracy Pruning rate
Uncompressed 99.2% 1×
Network Pruning [71] 99.2% 12.5×
ADMM [93] 99.2% 71.2×
Our method 99.2% 246×
Our method 99.0% 348×

Due to space limitation, we mainly show the results on fully binarized DNN models (i.e., weights in all layers, including the first and the last, are binarized), which is a highly challenging task for prior work.

Weight Quantization Results on LeNet-5 and CIFAR-10: To the best of our knowledge, we achieve the first lossless, fully binarized LeNet-5 model. The accuracy is still 99.21%, lossless compared with baseline. In prior works, achieving lossless is challenging even for MNIST. For example, recent work [106] results in 2.3% accuracy degradation on MNIST for full binarization, with baseline accuracy 98.66%. We also achieve the first lossless, fully binarized VGG-16 for CIFAR-10. The accuracy is 93.53%. We would like to point out that fully ternarized quantization results in 93.66% accuracy. Table Non-structured DNN Weight Pruning Considered Harmful shows our results and comparisons.

Table \thetable: Comparisons of fully binary (ternary) weight quantization results on VGG-16 for CIFAR-10 data set.
Method Accuracy Num. of bits
Baseline of [106] 84.80% 32
Binary [106] 81.56% 1
Our baseline 93.70% 32
Our ternary 93.66% 2 (ternary)
Our binary 93.53% 1

Binary Weight Quantization Results on ResNet for ImageNet: The binarization of ResNet models on ImageNet data set is widely acknowledged as an extremely challenging task. As a result, there are very limited prior work (e.g., the prior ADMM-based method [79]) with binarization results on ResNet models. As [79] targets ResNet-18, we make a fair comparison on the same model. Table Non-structured DNN Weight Pruning Considered Harmful demonstrates the comparison results (Top-5 accuracy loss). In prior work, by default the first and last layers are not quantized (to 8 bits) as these layers have a significant effect on overall accuracy. When leaving the first and last layers unquantized, we observe the higher accuracy compared with the prior method. The Top-1 accuracy has similar result: 3.8% degradation in our method and 4.3% in [79].

Furthermore, we can derive a fully binarized ResNet-18, in which weights in all layers are binarized. The accuracy degradation is 5.8%, which is noticeable and shows that the full binarization of ResNet is a challenging task even for the proposed framework. We did not find prior work to compare with this result.

Table \thetable: Comparisons of weight quantization results on ResNet-18 for ImageNet data set.
Method Relative Top-5 acc. loss Num. of bits
Uncompressed 0.0% 32
ADMM [79] 2.9% 1 (32 for the first and last)
Our method 2.5% 1 (32 for the first and last)
Our method 5.8% 1

Summary The results presented in this section show that ADMM-NN-S can achieve comparable or better results compared to the state-of-the-art results. In certain cases, ADMM-NN-S achieves unprecedented weight reduction. These results provide a strong baseline and credibility of our study.

A Motivation Example: The previous section has shown the superior results on joint weight pruning and quantization. Using LeNet-5 (MNIST data set) as an example, we achieve an unprecedented 348× non-structured weight reduction together with 3-bit quantization, maintaining 99%+ accuracy. When indices are not accounted for, the overall compression rate is an unprecedented 3,712× compared with the original LeNet-5 model without compression. However, each index needs to be at least 9-bit considering 348× weight pruning. This makes index storage even larger than weights, and indices cannot be further quantized. As a result, non-structured weight pruning in fact results in larger actual storage than structured pruning.

The fundamental phenomena shown here is that, with quantization the weight reduction by non-structured pruning is offset by the extra index storage. It motivates us to study whether it is a common trend with weight quantization in place? If the answer is yes, then the value of non-structured weight pruning will be further in doubt. This is because non-structured pruning is already less preferred for GPU and multi-core CPUs [76, 77], the only benefit is the potentially higher pruning rates due to greater pruning flexibility. If this benefit is also lost, there will be nearly no merit of non-structured sparsity for hardware acceleration of DNNs, considering the impacts on computation efficiency and degraded parallelism. Importantly, such conclusion will also be true for FPGA and ASIC designs and guide us to the design aspects that we should really focus on.

In this section, we conduct the first(to the best of our knowledge) comprehensive study to understand the value of non-structured and structured pruning, with quantization in place and the same accuracy. It is worth noting that without ADMM-NN-S framework, this study is not possible, — we need a framework that achieves competitive results and can jointly perform both weight pruning and quantization.

A Hardware Implementation-Agnostic Comparison Methodology: We conduct a fair comparison between non-structured and structured weight pruning with quantization in place, based on the unified solution framework. Note that the comparison framework is more FPGA and ASIC oriented as flexible weight quantization is assumed. However, we would like to point out that a moderate, fixed weight quantization, e.g., 8 bit, supported in GPU [88], TPU [107], and mobile devices [87], will result in a similar conclusion. Please refer to \textcolorblackSection Non-structured DNN Weight Pruning Considered Harmful for more discussions.

The key characteristic of our comparison framework is that it is hardware implementation-agnostic. Our intention is that the comparison results will be independent of specific hardware implementations, and as a result, the conclusion will unlikely to change for architectural advances in either type of pruning. Therefore, we directly compare the amounts of storage and estimated computation efficiency for non-structured and structured weight pruning with quantization in place, which capture the fundamental trade-offs. Intuitively, storage is measured as the total weight and index storage with quantization in place. Storage of intermediate results is not considered, and this favors non-structured pruning, — structured, filter/channel pruning will likely benefit more in intermediate results storage reduction.

On the other hand, computation efficiency is estimated using the pruning-to-performance ratio (PPR) values derived from prior work on non-structured sparsity accelerators [94, 95, 96, 97]. For structured pruning, α× weight reduction results in around α× speedup (slightly higher or lower depending on platform and problem), and the PPR value is approximately 1. For non-structured pruning, α× weight reduction only results in β× speedup with β<α. In the state-of-art tapeouts [94], the PPR value α/β>3, which is close to 3 with a low pruning rate and higher than 4 for a high pruning rate. In synthesis results [95, 96, 97], this PPR value ranges from 2.7 to 3.5. We use the smallest value 2.7 that favors non-structured pruning the most. In other words, if non-structured pruning achieves more than 2.7× pruning rate than structured one (or equivalently, structured pruning rate is less than 37% of non-structured one) under the same accuracy and quantization level, the former is more preferred in terms of computation. Otherwise, the latter is more preferred.

Figure \thefigure: Procedure for maintaining accuracy.

Maintaining the Same Accuracy for Comparison: The proposed comparison is performed under the same accuracy for non-structured and structured pruning with quantization in place. The precise accuracy control, which is challenging for prior work, is enabled by the unified solution framework. For most cases, we would like to have (almost) no accuracy degradation compared with the baseline DNN model without pruning or quantization. For non-structured pruning, it is achieved in two steps: 1) perform weight pruning to the maximum extent such that there will be no accuracy loss; and 2) perform weight quantization (hopefully) not to cause accuracy loss. For structured pruning, we give priority to column pruning, and perform three steps: 1) perform column pruning to the maximum extent without accuracy loss; 2) perform filter pruning and reduce corresponding redundant channels; and 3) perform weight quantization (hopefully) without accuracy loss. \textcolorblackFigure Non-structured DNN Weight Pruning Considered Harmful illustrates the procedure for maintaining accuracy. Of course the proposed framework is also applicable if certain accuracy degradation is allowed. A larger margin of accuracy loss in general favors structured pruning, because higher pruning rates can be achieved for both pruning schemes, but non-structured pruning requires more bits for (relative) indices.

There is more subtlety in the combination of non-structured pruning and quantization. If a weight is non-zero after pruning but quantized to zero, this weight can be added to the pruned list to achieve a higher pruning rate. Please note that this phenomenon does not apply to structured pruning. To better exploit this phenomenon and achieve even higher storage/computation reduction for non-structured pruning (plus quantization), we leverage the state-of-art ternary quantization technique [98] with dedicated optimizations. We apply this technique for weight quantization after non-structured pruning in cases when it outperforms our proposed method, thereby providing enough opportunity and optimizations to non-structured weight pruning.

Due to space limitation, we focus on CONV layers, which are the most computationally intensive layers in DNNs and are becoming the major storage as well in state-of-art ResNet and MobileNet models. We do observe similar (and more significant) effect on FC layers and on RNNs, with more discussions in Section Non-structured DNN Weight Pruning Considered Harmful.

As discussed in Section Non-structured DNN Weight Pruning Considered Harmful, our implementations are based on PyTorch with high baseline accuracies. We limit the number of epochs in both structured pruning and non-structured pruning to be 240 (much lower than the iterative pruning heuristic [71]), and the number of epochs in weight quantization to be 120. We adopt the hyperparameter determination heuristic discussed in Section Non-structured DNN Weight Pruning Considered Harmful for both structured and non-structured pruning.

For non-structured weight pruning, we show results on both CSR with relative indices and with absolute indices. The former is more appropriate for storage reduction, but the latter achieves higher computation efficiency. For absolute indices we assume 4K=64×64 blocks that are reasonable for hardware [94]. Besides the comparison between two pruning schemes, our results also consistently outperform prior work, in terms of both non-structured and structured pruning, as well as combination with weight quantization.

Table Non-structured DNN Weight Pruning Considered Harmful and Table Non-structured DNN Weight Pruning Considered Harmful demonstrate the comparison results using AlexNet and ResNet-18 models on ImageNet dataset. In these tables, “CONV Prune Rate" refers to the reduction ratio in the number of weights in overall CONV layers, and the number of remaining weights is “CONV No. of Weights". "CONV Quant Bits" refers to the number of bits used for equal-distance weight quantization, while “CONV Weight Store" is the storage required only for weights (not account for indices). “Index Bits" refers to the number of bits in CSR with relative indices. In our results, we already optimized this index bit value to minimize the overall storage (accounting for the additional dummy zeros as well). The next two columns refer to the total storage size accounting for relative indices and absolute indices, respectively. For structured pruning, they are the same as weight storage. The final column “CONV Compress Rate" refers to the storage compression rate compared with the original baseline DNN model without compression, assuming relative indices that are more favorable to non-structured pruning. We use “N/A" if the specific prior work only focuses on weight pruning without performing quantization.

It can be observed that we achieve significant pruning rate gains for both non-structured and structured pruning. Especially for structured pruning, we achieve 5.1× and 2.5× structured weight pruning in CONV layers of AlexNet and ResNet-18 models, respectively, without accuracy loss. We further achieve 4.3× structured pruning with minor accuracy loss around 1%. For ResNet on ImageNet dataset, it is difficult for prior work to achieve lossless structured pruning. For example, [78] results in 1% accuracy loss with 2× structured pruning, on ResNet-50 model with more redundancy.

When comparing non-structured vs. structured pruning, the overall CONV compression rate is comparable for the AlexNet case and the 1% accuracy loss case for ResNet-18. For the lossless case in ResNet-18, non-structured pruning is slightly better in storage, especially when relative indices are utilized. This is because the number of bits for indexing is relatively small in this case, and the slight benefit will diminish if certain accuracy loss is tolerable. The occasional gain cannot outweigh the difficulty in hardware support of non-structured sparsity. It would be difficult to choose non-structured pruning over the other one even if the storage results are comparable.

Table \thetable: Comparison Results on Non-Structured vs. Structured Pruning using AlexNet on ImageNet Dataset
Method \makecellTop-5
Accuracy
\makecellCONV
Prune Rate
\makecellCONV No.
of Weights
\makecellCONV
Quant Bits
\makecellCONV
Weight Store
\makecellIndex
Bits
\makecellWeight+Index
Storage (Relative)
\makecellWeight+Index
Storage (Absolute)
\makecellCONV
Compress Rate
Baseline AlexNet 82.2% 1.0× 2.3M 32 9.3MB - 9.3MB 9.3MB 1.0×
\makecellNon-
structured
Han [108] 80.3% 2.7× 0.86M 8 0.86MB 4 1.3MB N/A 7.1×
Dyn. surg. [74] 80.0% 3.1× 0.74M N/A N/A N/A N/A N/A N/A
Nest [72] 80.3% 3.23× 0.71M N/A N/A N/A N/A N/A N/A
Fine-grained [99] 80.3% 4.16× 0.55M N/A N/A N/A N/A N/A N/A
our’s 81.9% 11.2× 0.3M 7 0.26MB 6 0.51MB 0.61MB 25.5×
\makecellStructured SSL [76] 80.4% 1.4× 1.6M N/A N/A - N/A N/A N/A
our’s 81.8% 5.1× 0.65M 7 0.56MB - 0.56MB 0.56MB 23.3×
Table \thetable: Comparison Results on Non-Structured vs. Structured Pruning using ResNet-18 on ImageNet Dataset
Method \makecell Accuracy \makecellCONV
Prune Rate
\makecellCONV No.
of Weights
\makecellCONV
Quant Bits
\makecellCONV
Weight Store
\makecellIndex
Bits
\makecellWeight+Index
Storage (Relative)
\makecellWeight+Index
Storage (Absolute)
\makecellCONV
Compress Rate
Baseline ResNet-18 89.1% 1.0× 11.2M 32 44.7MB - 44.7MB 44.7MB 1.0×
Non-Structured our’s 89.1% 6.4× 1.75M 6 1.32MB 5 2.47MB 3.11MB 18.1×
Non-Structured our’s 87.9% 8.9× 1.26M 6 0.94MB 5 1.89MB 2.29MB 23.6×
Structured our’s 89.1% 2.5× 4.46M 6 3.34MB - 3.34MB 3.34MB 13.4×
Structured our’s 87.8% 4.3× 2.60M 6 1.95MB - 1.95MB 1.95MB 22.9×

Table Non-structured DNN Weight Pruning Considered Harmful and Table Non-structured DNN Weight Pruning Considered Harmful demonstrate the comparison results using VGG-16 and ResNet-18 models on CIFAR-10 dataset. We observe that very significant pruning rates can be achieved compared with prior work (over 30× improvement in certain case). We investigated deeper and found that the underlying reason is the CIFAR-10 dataset itself, in that it is both “simple” and “difficult”. “Simple” means that the input image scale is small and the number of classes is only 10; while “difficult” means that input images are blurred and feature extraction is not straightforward. As a result, researchers tend to migrate large-scale DNN models originally designed for ImageNet, such as VGG-16 and ResNet-18 (prior work even used ResNet-50). Consequently, there is significant margin of model compression, which can be exploited in the proposed systematic framework but difficult for heuristic methods.

Another observation is that non-structured pruning has only marginal gain in pruning rates (reduction in the number of weights) compared with structured one. Our hypothesis is that it is due to the high search space in non-structured pruning. Together with the large number of index bits due to high pruning rates, non-structured pruning is not preferable compared with structured one considering total storage size. The storage size gap is becoming surprisingly large when absolute indices are utilized.

Table Non-structured DNN Weight Pruning Considered Harmful demonstrates the comparison results using MobileNet V2 model on CIFAR-10 dataset. MobileNet is already compact and relatively difficult for further weight pruning, but we still achieve 5× structured pruning along with 4-bit quantization. Again non-structured pruning only shows minor gain in weight reduction, and it is not preferable considering the unavoidable indexing overheads.

Table \thetable: Comparison Results on Non-Structured vs. Structured Pruning using VGG-16 on CIFAR-10 Dataset
Method \makecell Accuracy \makecellCONV
Prune Rate
\makecellCONV No.
of Weights
\makecellCONV
Quant Bits
\makecellCONV
Weight Store
\makecellIndex
Bits
\makecellWeight+Index
Storage (Relative)
\makecellWeight+Index
Storage (Absolute)
\makecellCONV
Compress Rate
Baseline VGG-16 93.7% 1.0× 14.7M 32 58.8MB - 58.8MB 58.8MB 1.0×
Non-Structured our’s 93.1% 57.4× 0.26M 5 0.16MB 7 0.54MB 0.72MB 109×
\makecellStructured 2PFPCE [109] 92.8% 4× 3.7M N/A N/A - N/A N/A N/A
2PFPCE [109] 91.0% 8.3× 1.8M N/A N/A - N/A N/A N/A
our’s 93.1% 50.0× 0.29M 5 0.18MB - 0.18MB 0.18MB 327×
Table \thetable: Comparison Results on Non-Structured vs. Structured Pruning using ResNet-18 (ResNet-50 in prior work AMC) on CIFAR-10 Dataset
Method \makecell Accuracy \makecellCONV
Prune Rate
\makecellCONV No.
of Weights
\makecellCONV
Quant Bits
\makecellCONV
Weight Store
\makecellIndex
Bits
\makecellWeight+Index
Storage (Relative)
\makecellWeight+Index
Storage (Absolute)
\makecellCONV
Compress Rate
Baseline ResNet-18 93.9% 1.0× 11.2M 32 44.6MB - 44.6MB 44.6MB 1.0×
Non-Structured our’s 93.3% 69.0× 0.16M 5 0.10MB 8 0.33MB 0.53MB 135×
\makecellStructured AMC [110] 93.5% 1.7× N/A N/A N/A - N/A N/A N/A
our’s 93.3% 59.8× 0.19M 5 0.12MB - 0.12MB 0.12MB 372×
Table \thetable: Comparison Results on Non-Structured vs. Structured Pruning using MobileNet-V2 on CIFAR-10 Dataset
Method \makecell Accuracy \makecellCONV
Prune Rate
\makecellCONV No.
of Weights
\makecellCONV
Quant Bits
\makecellCONV
Weight Store
\makecellIndex
Bits
\makecellWeight+Index
Storage (Relative)
\makecellWeight+Index
Storage (Absolute)
\makecellCONV
Compress Rate
Baseline MobileNet-V2 95.1% 1.0× 2.2M 32 9.0MB - 9.0MB 9.0MB 1.0×
Non-Structured our’s 94.9% 6.1× 0.37M 4 0.19MB 4 0.48MB 0.55MB 18.8×
Structured our’s 95.1% 4.9× 0.45M 4 0.23MB - 0.23MB 0.23MB 39.2×

Table Non-structured DNN Weight Pruning Considered Harmful demonstrates the comparison results using LeNet-5 model on MNIST data set. It is a simple dataset, and we achieve 87.9× structured pruning on CONV layers, together with 3-bit quantization. Non-structured pruning is again not preferred due to the high index bit and marginal increase in weight reduction rate. Ironically, it results in multiple times the amount of storage compared with structured pruning, when weight quantization is in place.

Table \thetable: Comparison Results on Non-Structured vs. Structured Pruning using LeNet-5 on MNIST Dataset
Method \makecell Accuracy \makecellCONV
Prune Rate
\makecellCONV No.
of Weights
\makecellCONV
Quant Bits
\makecellCONV
Weight Store
\makecellIndex
Bits
\makecellWeight+Index
Storage (Relative)
\makecellWeight+Index
Storage (Absolute)
\makecellCONV
Compress Rate
Baseline LeNet-5 99.2% 1.0× 25.5K 32 102KB - 102KB 102KB 1.0×
\makecellNon-
structured
Han [108] 99.2% 7.7× 3.33K 8 3.33KB 5 7.0KB N/A 14.5×
our’s 99.0% 114.3× 223 3 0.08KB 8 0.39KB 0.93KB 262×
\makecellStructured SSL [76] 99.0% 26.1× 975 N/A N/A - N/A N/A N/A
our’s 99.0% 87.9× 290 3 0.11KB - 0.11KB 0.11KB 944×

We have shown that non-structured pruning is not preferable in terms of storage even assuming the storage-friendly CSR format with relative indices, not to mention absolute indices. Based on our methodology, we find that computation efficiency shows the similar trend.

As discussed before, structured pruning will have higher computation efficiency if it achieves more than 37% in the pruning rate as non-structured pruning. In all our testing, the ratio between weight pruning rates of structured vs. non-structured pruning ranges from 40% to 87%, with a large variation but consistently higher than 37%. Even for the 40% case, the choice is clear considering the difficulty in hardware design for non-structured sparsity.

In this section, we discuss additional factors and variations in different platforms, and explain why our conclusion is unlikely to change. As a result, we draw the final conclusion that non-structured weight pruning is in general not preferred compared with structured pruning across different platforms, application scenarios, DNN types, etc.

We consider the following question: will our conclusion change if there is further algorithm improvement (that outperforms the ADMM-based unified solution in this paper)? Also, how about using a number of other recently proposed generalization enhancement techniques, such as warmup, mixup, cosine decay in bag of tricks [111]? Mixup is already utilized in MobileNet V2 training in this work and can notably enhance convergence and stability in training (the original MobileNet training is very difficult). We hypothesize that the conclusion is likely to maintain unchanged, as these techniques are likely to enhance the results for both non-structured and structured weight pruning schemes. As the pruning rates increase, the number of bits for index representation will also increase. The results will likely even favor structured pruning to a greater extent.

In many critical applications of deep learning, such as autonomous driving and medical imaging, there is lack of sufficient labelled training data as standard image classification tasks. As a result, the transfer learning technique [112, 113, 114] is widely applied via (i) pre-training a DNN model using standard data set (say ImageNet); (ii) transferring to the target application domain; and (iii) performing fine tuning using target domain data. It is recently shown [115] that sufficient number of weight parameters is needed in order to maintain the generality, i.e., the ability in domain transfer. This coincides with practice that VGGNet and deep ResNets are the major types for transfer learning instead of MobileNet. From the DNN security aspects, recent work [116] shows that sufficient number of parameters is required to maintain the robustness of DNN against adversarial attacks.

We hypothesize that structured pruning may be preferred in this way because of the larger number of remaining weight parameters (compared with non-structured), which will lead to higher probability to satisfy the generality and adversarial robustness requirements. We believe that it will be a challenge to quantify such requirements, and derive the best combination of structured pruning and quantization for performance optimization while satisfying such requirements.

The comparison results conducted in this paper focus on CONV layers, which is the major computation part in DNNs. On the other hand, the FC layers are not negligible in DNNs. Besides, FC layers constitute major computations in recurrent neural networks (RNNs), which is as important as convolutional neural networks [107]. Our preliminary investigation shows that the gain of structured pruning in FC layers and in RNNs is even higher. This is an intuitive result because FC layers have higher degree of redundancy, and more number of bits for indices if non-structured pruning is utilized. It is also worth mentioning that a number of structured matrix-based techniques, such as block-circulant matrices [117] and cyclic matrices [118], serve as good candidates of structured pruning in FC layers. Superior results are already demonstrated in FC layers using these methods.

In the current industry’s practice, weight quantization is the major method in DNN model compression and is typically prioritized over weight pruning. As a result, it is unlikely that weight pruning is conducted alone (especially for FPGA/ASIC systems) without quantization. However, for such systems, it is possible that a fixed quantization level (or a set of levels) is utilized to accommodate different DNN models and applications, e.g., TPU supports 8 bit and 16 bit computation. Such moderate, fixed weight quantization (e.g., 8 bits) will unlikely change the general conclusion in this paper, especially accounting for the difficulty in developing dedicated hardware supporting non-structured sparsity. For GPUs, multi-core CPUs, and even mobile devices, 8-bit/16-bit weight quantization is already well supported. Structured pruning is known to be more suitable for such systems.

To the other extreme case, researchers are investigating weight quantization-only solution, including binary and ternary quantizations. As pointed out in Section Non-structured DNN Weight Pruning Considered Harmful, binary/ternary quantization can be almost lossless in many cases. However, we observe that there is still a large margin of structured pruning as shown in the compression results on CIFAR-10, and such compression rate cannot be achieved by weight quantization alone. As a result, we recommend to perform structured pruning in combination with weight quantization,

Non-structured and structured weight pruning and weight quantization are major methods for model compression, but the interaction among different techniques are never clearly understood. This paper is the first to investigate the value of non-structured and structured DNN weight pruning, when the weight quantization is in place. We build ADMM-NN-S, a joint weight pruning and quantization framework with algorithmic supports for structured pruning, dynamic ADMM regulation, and masked mappling and retraining. To perform fair and fundamental comparison between non-structured and structured pruning in a hardware implementation-agnostic manner, we propose a methodology that captures storage overhead and computation efficiency. We perform extensive and representative testing of ADMM-NN-S with AlexNet, VGGNet, ResNet-18/50, MobileNet, and LeNet-5 models based on ImageNet, CIAR-10, and MNIST data sets. We show that ADMM-NN-S can significant outperform the state-of-the-art results for non-structured pruning with quantization. More importantly, for the first time we show that with quantization in place and the same accuracy, non-structured pruning is not preferable in terms of both storage overhead and computation efficiency. We also explain in detail that the conclusion is unlikely to change for different hardware platforms, application scenarios, DNN types, etc. Thus, we recommend the community not to continue investigating DNN inference engines based on non-structured sparsity. We release codes and all the models of this work at anonymous link: http://bit.ly/2WMQSRi.

  • [1] Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Schwing, Hadi Esmaeilzadeh, and Nam Sung Kim. A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 175–188. IEEE, 2018.
  • [2] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. From high-level deep neural models to fpgas. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1–13. IEEE Computer Society, 2016.
  • [3] Haiyu Mao, Mingcong Song, Tao Li, Yuting Dai, and Jiwu Shu. Lergan: A zero-free, low data movement and pim-based gan architecture. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 669–681. IEEE, 2018.
  • [4] Kartik Hegde, Rohit Agrawal, Yulun Yao, and Christopher W Fletcher. Morph: Flexible acceleration for 3d cnn-based video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 933–946. IEEE, 2018.
  • [5] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, volume 44, pages 27–39. IEEE Press, 2016.
  • [6] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
  • [7] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM SIGARCH Computer Architecture News, 44(3):1–13, 2016.
  • [8] Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, and Shaojun Wei. Rana: towards efficient neural acceleration with refresh-optimized embedded dram. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 340–352. IEEE Press, 2018.
  • [9] Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. Neural cache: bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 383–396. IEEE Press, 2018.
  • [10] Mark Buckler, Philip Bedoukian, Suren Jayasuriya, and Adrian Sampson. Eva2: Exploiting temporal redundancy in live computer vision. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 533–546. IEEE, 2018.
  • [11] Amir Yazdanbakhsh, Kambiz Samadi, Nam Sung Kim, and Hadi Esmaeilzadeh. Ganax: A unified mimd-simd acceleration for generative adversarial networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 650–661. IEEE Press, 2018.
  • [12] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 674–687. IEEE Press, 2018.
  • [13] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 764–775. IEEE Press, 2018.
  • [14] Chao Zhang, Tong Meng, and Guangyu Sun. Pm3: Power modeling and power management for processing-in-memory. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 558–570. IEEE, 2018.
  • [15] Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. Hypar: Towards hybrid parallelism for deep learning accelerator array. arXiv preprint arXiv:1901.02067, 2019.
  • [16] Xiaowei Wang, Jiecao Yu, Charles Augustine, Ravi Iyer, and Reetuparna Das. Bit prudent in-cache acceleration of deep convolutional neural networks. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 81–93. IEEE, 2019.
  • [17] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. Pudiannao: A polyvalent machine learning accelerator. In ACM SIGARCH Computer Architecture News, volume 43, pages 369–381. ACM, 2015.
  • [18] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review, 51(2):751–764, 2017.
  • [19] Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. ACM SIGOPS Operating Systems Review, 51(2):405–418, 2017.
  • [20] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 461–475. ACM, 2018.
  • [21] Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. Vibnn: Hardware acceleration of bayesian neural networks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 476–488. ACM, 2018.
  • [22] Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 448–460. ACM, 2018.
  • [23] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161–170. ACM, 2015.
  • [24] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 16–25. ACM, 2016.
  • [25] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 26–35. ACM, 2016.
  • [26] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. Accelerating binarized convolutional neural networks with software-programmable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 15–24. ACM, 2017.
  • [27] Jialiang Zhang and Jing Li. Improving the performance of opencl-based fpga accelerator for convolutional neural network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 25–34. ACM, 2017.
  • [28] Chi Zhang and Viktor Prasanna. Frequency domain acceleration of convolutional neural networks on cpu-fpga shared memory system. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 35–44. ACM, 2017.
  • [29] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 45–54. ACM, 2017.
  • [30] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. An opencl™ deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 55–64. ACM, 2017.
  • [31] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 65–74. ACM, 2017.
  • [32] Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. Deltarnn: A power-efficient recurrent neural network accelerator. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 21–30. ACM, 2018.
  • [33] Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. Towards a uniform template-based architecture for accelerating 2d and 3d cnns on fpga. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 97–106. ACM, 2018.
  • [34] Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. A framework for generating high throughput cnn implementations on fpgas. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 117–126. ACM, 2018.
  • [35] Eriko Nurvitadhi, Jeffrey Cook, Asit Mishra, Debbie Marr, Kevin Nealis, Philip Colangelo, Andrew Ling, Davor Capalija, Utku Aydonat, Aravind Dasu, et al. In-package domain-specific asics for intel® stratix® 10 fpgas: A case study of accelerating deep learning using tensortile asic. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pages 106–1064. IEEE, 2018.
  • [36] Zhe Chen, Andrew Howe, Hugh T Blair, and Jason Cong. Fpga-based lstm acceleration for real-time eeg signal processing. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 288–288. ACM, 2018.
  • [37] Yankang Du, Qinrang Liu, Shuai Wei, and Chen Gao. Software-defined fpga-based accelerator for deep convolutional neural networks. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 291–291. ACM, 2018.
  • [38] Shuanglong Liu, Xinyu Niu, and Wayne Luk. A low-power deconvolutional accelerator for convolutional neural network based segmentation on fpga. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 293–293. ACM, 2018.
  • [39] Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, et al. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 23–32. ACM, 2019.
  • [40] Junzhong Shen, You Huang, Mei Wen, and Chunyuan Zhang. Accelerating 3d cnn-based lung nodule segmentation on a multi-fpga system.
  • [41] Lu Jing, Jun Liu, and FuHai Yu. A deep learning inference accelerator based on model compression on fpga. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 118–118. ACM, 2019.
  • [42] Weijie You and Chang Wu. A reconfigurable accelerator for sparse convolutional neural networks. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 119–119. ACM, 2019.
  • [43] Xuechao Wei, Yun Liang, Peng Zhang, Cody Hao Yu, and Jason Cong. Overcoming data transfer bottlenecks in dnn accelerators via layer-conscious memory managment. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 120–120. ACM, 2019.
  • [44] Jialiang Zhang and Jing Li. Unleashing the power of soft logic for convolutional neural network acceleration via product quantization. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 120–120. ACM, 2019.
  • [45] Shulin Zeng, Yujun Lin, Shuang Liang, Junlong Kang, Dongliang Xie, Yi Shan, Song Han, Yu Wang, and Huazhong Yang. A fine-grained sparse accelerator for multi-precision dnn. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 185–185. ACM, 2019.
  • [46] Hiroki Nakahara, Akira Jinguji, Masayuki Shimoda, and Shimpei Sato. An fpga-based fine tuning accelerator for a sparse cnn. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 186–186. ACM, 2019.
  • [47] Liqiang Lu, Yun Liang, Ruirui Huang, Wei Lin, Xiaoyuan Cui, and Jiansong Zhang. Speedy: An accelerator for sparse convolutional neural networks on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 187–187. ACM, 2019.
  • [48] Zhucheng Tang, Guojie Luo, and Ming Jiang. Ftconv: Fpga acceleration for transposed convolution layers in deep neural networks. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 189–189. ACM, 2019.
  • [49] Kaiyuan Guo, Shuang Liang, Jincheng Yu, Xuefei Ning, Wenshuo Li, Yu Wang, and Huazhong Yang. Compressed cnn training with fpga-based accelerator. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 189–189. ACM, 2019.
  • [50] Ephrem Wu, Xiaoqian Zhang, David Berman, Inkeun Cho, and John Thendean. Compute-efficient neural-network acceleration. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 191–200. ACM, 2019.
  • [51] Sebastian Vogel, Jannik Springer, Andre Guntoro, and Gerd Ascheid. Efficient acceleration of cnns for semantic segmentation on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 309–309. ACM, 2019.
  • [52] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
  • [53] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 49:269–284, 2014.
  • [54] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. Stripes: Bit-serial deep neural network computing. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1–12. IEEE Computer Society, 2016.
  • [55] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE Computer Society, 2014.
  • [56] Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 13–26. IEEE, 2017.
  • [57] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 267–278. IEEE, 2016.
  • [58] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, pages 92–104. IEEE, 2015.
  • [59] Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang, Jing Wang, and Tao Li. In-situ ai: Towards autonomous and incremental deep learning for iot systems. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 92–103. IEEE, 2018.
  • [60] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. Tabla: A unified template-based framework for accelerating statistical machine learning. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pages 14–26. IEEE, 2016.
  • [61] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
  • [62] Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi. In Solid-State Circuits Conference (ISSCC), 2017 IEEE International, pages 246–247. IEEE, 2017.
  • [63] Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti, Manuj Ayodhyawasi, Harvinder Singh, et al. 14.1 a 2.9 tops/w deep convolutional neural network soc in fd-soi 28nm for intelligent embedded systems. In Solid-State Circuits Conference (ISSCC), 2017 IEEE International, pages 238–239. IEEE, 2017.
  • [64] Paul N Whatmough, Sae Kyu Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu-Yeon Wei. 14.3 a 28nm soc with a 1.2 ghz 568nj/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for iot applications. In Solid-State Circuits Conference (ISSCC), 2017 IEEE International, pages 242–243. IEEE, 2017.
  • [65] Jaehyeong Sim, Jun-Seok Park, Minhye Kim, Dongmyung Bae, Yeongjae Choi, and Lee-Sup Kim. 14.6 a 1.42 tops/w deep convolutional neural network recognition processor for intelligent ioe systems. In Solid-State Circuits Conference (ISSCC), 2016 IEEE International, pages 264–265. IEEE, 2016.
  • [66] Suyoung Bang, Jingcheng Wang, Ziyun Li, Cao Gao, Yejoong Kim, Qing Dong, Yen-Po Chen, Laura Fick, Xun Sun, Ron Dreslinski, et al. 14.7 a 288μw programmable deep-learning processor with 270kb on-chip weight storage using non-uniform memory hierarchy for mobile intelligence. In Solid-State Circuits Conference (ISSCC), 2017 IEEE International, pages 250–251. IEEE, 2017.
  • [67] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 35th International Conference on Computer-Aided Design, page 12. ACM, 2016.
  • [68] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. Energy-efficient cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pages 326–331. ACM, 2016.
  • [69] http://www.techradar.com/news/computing-components/processors/google-s-tensor-processing-unit-explained-
    this-is-what-the-future-of-computing-looks-
    like-1326915
    .
  • [70] https://www.sdxcentral.com/articles/news/intels-deep-learning-chips-will-arrive-2017/2016/11/.
  • [71] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
  • [72] Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: a neural network synthesis tool based on a grow-and-prune paradigm. arXiv preprint arXiv:1711.02017, 2017.
  • [73] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint arXiv:1611.05128, 2016.
  • [74] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
  • [75] Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pages 4857–4867, 2017.
  • [76] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
  • [77] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 548–560. IEEE, 2017.
  • [78] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
  • [79] Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with admm. arXiv preprint arXiv:1707.09870, 2017.
  • [80] Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7197–7205, 2017.
  • [81] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. In International Conference on Learning Representations (ICLR), 2017.
  • [82] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
  • [83] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
  • [84] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • [85] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
  • [86] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
  • [87] https://www.tensorflow.org/mobile/tflite/.
  • [88] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch, 2017.
  • [89] Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction method of multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.
  • [90] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pages 80–88, 2013.
  • [91] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In International Conference on Machine Learning, pages 392–400, 2013.
  • [92] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • [93] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. arXiv preprint arXiv:1804.03294, 2018.
  • [94] Zhe Yuan, Jinshan Yue, Huanrui Yang, Zhibo Wang, Jinyang Li, Yixiong Yang, Qingwei Guo, Xueqing Li, Meng-Fan Chang, Huazhong Yang, et al. Sticker: A 0.41-62.1 tops/w 8bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers. In 2018 IEEE Symposium on VLSI Circuits, pages 33–34. IEEE, 2018.
  • [95] Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction method of multipliers. arXiv preprint arXiv:1812.11677, 2018.
  • [96] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 20. IEEE Press, 2016.
  • [97] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. Scnn: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 27–40. IEEE, 2017.
  • [98] Zhezhi He and Deliang Fan. Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. arXiv preprint arXiv:1810.01018, 2018.
  • [99] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
  • [100] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
  • [101] Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. In International Conference on Artificial Intelligence and Statistics, pages 288–297, 2018.
  • [102] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [103] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [104] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [105] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • [106] Hsin-Pai Cheng, Yuanjun Huang, Xuyang Guo, Feng Yan, Yifei Huang, Wei Wen, Hai Li, and Yiran Chen. Differentiable fine-grained quantization for deep neural network compression. In NIPS 2018 CDNNRIA Workshop, 2018.
  • [107] Google supercharges machine learning tasks with TPU custom chip, https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html.
  • [108] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
  • [109] Chuhan Min, Aosen Wang, Yiran Chen, Wenyao Xu, and Xin Chen. 2pfpce: Two-phase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220, 2018.
  • [110] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), September 2018.
  • [111] Junyuan Xie, Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, and Mu Li. Bag of tricks for image classification with convolutional neural networks. arXiv preprint arXiv:1812.01187, 2018.
  • [112] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [113] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
  • [114] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016.
  • [115] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
  • [116] Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. Second rethinking of network pruning in the adversarial setting. arXiv preprint arXiv:1903.12561, 2019.
  • [117] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. C ir cnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
  • [118] C Deng, S Liao, Y Xie, KK Parhi, X Qian, and B Yuan. Permdnn: Efficient compressed deep neural network architecture with permuted diagonal matrices. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018.