Abstract
Although deep neural networks (NNs) have achieved stateoftheart accuracyin many visual recognition tasks ,the growing computational complexity andenergy consumption of networks remains an issue, especially for applications onplatforms with limited resources and requiring realtime processing. Channelpruning techniques have recently shown promising results for the compression ofconvolutional NNs (CNNs). However, these techniques can result in low accuracyand complex optimisations because some only prune after training CNNs, whileothers prune from scratch during training by integrating sparsity constraintsor modifying the loss function. The progressive soft filter pruning techniqueprovides greater training efficiency, but its soft pruning strategy does nothandle the backward pass which is needed for better optimization. In thispaper, a new Progressive Gradient Pruning (PGP) technique is proposed foriterative channel pruning during training. It relies on a criterion thatmeasures the change in channel weights that improves existing progressivepruning, and on an effective hard and soft pruning strategies to adapt momentumtensors during the backward propagation pass. Experimental results obtainedafter training various CNNs on the MNIST and CIFAR10 datasets indicate that thePGP technique canachieve a better tradeoff between classification accuracy andnetwork (time and memory) complexity than stateoftheart channel pruningtechniques
Quick Read (beta)
An Improved Tradeoff Between Accuracy and Complexity
with Progressive Gradient Pruning
Abstract
Although deep neural networks (NNs) have achieved stateoftheart accuracy in many visual recognition tasks, the growing computational complexity and energy consumption of networks remains an issue, especially for applications on platforms with limited resources and requiring realtime processing. Filter pruning techniques have recently shown promising results for the compression of convolutional NNs (CNNs). However, these techniques involve numerous steps and complex optimisations because some only prune after training CNNs, while others prune from scratch during training by integrating sparsity constraints or modifying the loss function. The progressive soft filter pruning (PSFP) technique provides greater training efficiency, but its soft pruning strategy does not handle the backward pass, i.e. momentum pruning, which is needed for better optimization. We proposed a new Progressive Gradient Pruning (PGP) technique for iterative filter pruning during training. To improve on PSFP, it relies on a novel filter selection criterion that measures the change in filter weights, and new hard and soft pruning strategies to effectively adapt momentum tensors during the backward propagation pass. Experimental results obtained after training various CNNs on image data for classification and object detection benchmarks indicate that the PGP technique can achieve a better tradeoff between classification accuracy and network (time and memory) complexity than PSFP and other stateoftheart filter pruning techniques.
1 Introduction
Convolutional neural networks (CNNs) learn discriminant feature representations from labeled training data, and have achieved stateoftheart accuracy across a wide range of visual recognition tasks, e.g., image classification, object detection, and assisted medical diagnosis. Since the breakthrough results achieved with AlexNet for the 2012 ImageNet Challenge [26], CNN’s accuracy has been continually improved with architectures like VGG [48], ResNet [12] and DenseNet [22], at the expense of growing complexity (deeper and wider networks) that require more training samples and computational resources [23]. In particular, the speed of the CNNs can significantly degrade with such increased complexity.
In order to deploy these powerful CNN architectures on compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices) and for realtime processing (e.g., video surveillance and monitoring, virtual reality), the time and memory complexity and energy consumption of CNNs should be reduced. For instance, the application of CNNbased architectures to realtime face detection in video surveillance remains a challenging task [39] – while the more accurate detectors such as region proposal networks are too slow for realtime applications [45, 8], faster detectors such as singleshot detectors are less accurate [32, 43]. Consequently, effective methods to accelerate and compress deep networks, in general, and CNNs in particular, are required to provide a reasonable tradeoff between accuracy and efficiency.
Several techniques have recently been proposed to reduce the complexity of CNNs, ranging from the design of specialized compact architectures like MobileNet [19], to the distillation of knowledge from larger architectures to smaller ones [17]. Among these, pruning techniques provide an automated approach to remove insignificant network elements, e.g., filters, input channels, etc. This paper focuses on filterlevel pruning techniques, while does not provide the compression level of unstructured pruning, the reduction of parameters can be converted in a real speed up while preserving network accuracy [29, 38]. These techniques attempt to remove the filters and input channels at each convolution layer using various criteria based on, e.g., L1 norm [29], or the product of feature maps and gradients computed from a validation dataset [38].
Pruning techniques can be applied under two different scenarios: either (1) from a pretrained network, or (2) from scratch. In the first scenario, pruning is applied as a postprocessing procedure, once the network has already been trained, through an onetime pruning (followed by finetuning) [29] or complex iterative process [38] using a validation dataset [29, 34], or by minimizing the reconstruction error [35]. In the second scenario, pruning is applied from scratch by introducing sparsity constraints and/or modifying the loss function to train the network [33, 52, 59]. The later scenario can have more difficulty converging to accurate network solutions (due to the modified loss function), and thereby increase the computational complexity required for the optimisation process. For greater training efficiency, the progressive soft filter pruning (PSFP) method was recently introduced [13], allowing for iterative pruning from scratch, where filters are set to zero (instead of removing them) so that the network can preserve a greater learning capacity. This method, however, does not account for the optimization of soft pruned weights which can have an negative impact on accuracy, because pruned weights are still being optimized with old momentum values accumulated from previous epochs.
In this paper, a new Progressive Gradientbased Pruning (PGP) technique is proposed for iterative filter pruning to provide a better tradeoff between accuracy and complexity. To this end, the filters are efficiently pruned in a progressive fashion while training a network from scratch, and accuracy is maintained without requiring validation data and additional optimisation constraints. In particular, PGP improves on PSFP by integrating hard and soft pruning strategies to effectively adapt the momentum tensor during the backward propagation pass. It also integrates an improved version of the Taylor selection criterion [38] that relies on the gradient w.r.t weights (instead of output feature maps), and is more suitable for progressive filterbased pruning. For performance evaluation, the accuracy and complexity of proposed and stateoftheart filter pruning techniques are compared using Resnet, LeNet and VGG networks trained to address benchmark image classification (MNIST and CIFAR10 datasets) and object detection (PASCAL VOC dataset) problems.
2 Compression and Acceleration of CNNs
In general, time complexity of a CNN depends more on the convolutional layers, while the fully connected layers contain the most of the number of parameters. Therefore, the CNN acceleration methods typically target lowering the complexity of the convolutional layers, while the compression methods usually target reduced complexity of the fully connected layers[10, 11]. This section provides an overview of the recent acceleration and compression approaches for CNNs, namely, quantization, lowrank approximation, knowledge distillation, compact network design and pruning. Finally, a brief survey on the filter pruning methods and challenges is presented.
2.1 Overview of methods:
Quantization:
A deep neural network can be accelerated by reducing the precision of its parameters. Such techniques are often used on general embedded systems, where lowprecision, e.g., 8bit integer, provides faster processing than the higherprecision representation, e.g., 32bit floating point. There are two main approaches to quantizing a neural network – the first focuses on quantizing using weights[10, 58], and the second uses both weights and activations for quantization [9, 6]. These techniques can be either scalable [10, 58] or nonscalable [3, 9, 6, 42], where scalable techniques means that an already quantized network can be further compressed.
Lowrank decomposition:
Lowrank approximation (LRA) can accelerate CNNs by decomposing a tensor in lower rank approximations as vector products. [24, 49, 27].There are different ways of decomposing convolution tensor. Techniques like [24, 49] focus on approximating tensor by low rank tesnor that can be obtained either in a layer by layer fashion [24] or by scanning the whole network [49]. [53] proposes to force filers to coordinate more information into a lower rank space during training and then decompose it once the model is trained. Another technique employed the CPDecomposition (Canonical Polyadic Decomposition), where a good tradeoff between accuracy and efficiency is achieved [27].
Knowledge distillation:
This family of techniques focuses on training a small network, student, using a larger model, called teacher [18]. Unlike, traditional supervised learning method, the student is trained by the teacher. These methods could obtain considerable improvements in term of sparsity and generalization of the produced networks. Most of distillation techniques use large pretrained models as teachers [18, 46]. More recently, there has been interest in developing online studentteacher models on the fly [54, 57] or using GANs in order to increase the training speed and accuracy [51]. Knowledge distillation has been applied to multiple problems including object detection [4], NLP [25] and differential privacy [50].
Compact network design:
Compact model design is an alternative way to produce fast deep neural networks. The aim of these techniques is to produce light models for highspeed processing. Different methods were applied to produce compact models, for instance, MobileNet [19], MobileNetV2 [47] and Xception [5] can achieve realtime speed using depthwise convolution in order to reduce computation. Other architectures like ShuffleNet [56, 36] and CondenseNet [21] use another convolution locally connected in groups for reducing computation.
Pruning:
Pruning is a family of techniques that removes nonuseful parameters from a neural network. There are several approaches of pruning for deep neural networks. The first is weight pruning, where individual weights are pruned. This approach has proven to significantly compress and accelerates deep neural networks [10, 55, 11]. Weight pruning techniques usually employ sparse convolution algorithms [30, 44].The other approach is output channel or filter pruning, where complete output channel or filters are pruned [29, 35, 13, 59]. Since this paper proposes a method for filter pruning, we provide more details on this approach in the next section.
2.2 Filter pruning:
Filterlevel pruning techniques attempt to remove the output and input channels at each convolution layer using various criteria, such as L1norm [29], Entropy [34], L2, APoZ [20] or using a combination of feature maps and gradients [38]. These pruning methods have the advantage of being independent of a sparse convolution algorithm since the convolution remains dense, which provides a platformindependent speedup – a sparse algorithm can not be easily optimized on parallel computing devices, i.e. GPUs.
Following the work of Optical Brain Damage [7], one of the first papers that showed the efficiency of filterlevel pruning was [29], where the weight norm is used to identify and to prune weak filters, filters that do not contribute much to network. Afterwards, several works proposed pruning procedures and filter importance metrics. These methods can be organized in five pruning approaches: 1) Pruning as one time post processing and then fine tune– this approach is simple and easy to apply [29], 2) Pruning in an iterative way once the model was trained– the iterative pruning and finetune increase the chance of recovering accuracy loss directly after a filter is pruned [38, 37], 3) Pruning by minimizing the reconstruction error– minimizing the reconstruction error at each layer allows the model to approximate the original performance [35, 16, 59], 4) Pruning by using sparse constraints with a modified objective function– to let the network consider pruning during optimization [33, 2, 1, 28], 5) Pruning progressively while training from scratch or pretrained model – soft pruning [14, 15] was applied where filters are set to zero instead of actually removing them (hard manner), which leaves the network with more capacity to learn [13].
While first three approaches are capable of reducing the complexity of a model, they are only applied when the model is already trained, it would certainly be more beneficial to be able to start pruning from scratch during training. While, the fourth approach can start the pruning from scratch by adding sparse constraints and by modifying the optimization objective, this makes the loss harder and more sensitive to optimize. This can be potentially complicated when the original loss function is hard to optimize since this type of approach modifies the original loss function therefore making it potentially harder for the model to converge to a good solution. The fifth approach eases this process by not removing filters and uses the original loss function. However, we think that this approach can be improved since, currently, this approach does not handle pruning in the backward pass and only set the weak filters to zero. Also, the current approach calculates the L2 criterion separately from when the parameters are updated, i.e. not when we are iterating inside an epoch. For our approach we want to directly compute the criterion during update, i.e. when we are iterating in an epoch and updating parameters.
Another important part of pruning filters is the capacity to evaluate the importance of a filter. Currently, in literature, there has been a lot of criteria that has been used to evaluate the importance of filters, e.g. L1 [29], APoz [20], Entropy [34], L2 [13] and Taylor [38]. Among these, we think that the Taylor criterion [38], has the most potential for pruning during training since the criterion is the result of trying to minimize the impact of having a filter pruned, although we can argue that it can be improved for progressive pruning.
3 Progressive Gradient Pruning
3.1 Pruning strategy with momentum:
In a regular CNN, the weight tensor of a convolutional layer $l$ can be defined as $\text{\mathbf{W}}\in {\mathbb{R}}^{{n}_{\text{out}}\times {n}_{\text{in}}\times k\times k}$, where ${n}_{\text{in}}$ and ${n}_{\text{out}}$ are the number of input and output channels (filters), respectively. A weight tensor of filter $i$ can be then defined as ${\text{\mathbf{W}}}_{i}\in {\mathbb{R}}^{{n}_{\text{in}}\times k\times k}$. In order to select the weak filters of a layer, we evaluate the importance of an filter using a criterion function $c$, is usually defined as $c({\text{\mathbf{W}}}_{i}):{\mathbb{R}}^{{n}_{\text{in}}\times k\times k}\stackrel{}{\to}\mathbb{R}$. Given an filter, it yields a scalar that represents the rank, e.g. L1 [29] or gradient norm in our case.
In order to prune convolution layer progressively, an exponential decay function is defined such that there is always a solution in $\mathbb{R}$. (It is slightly different than in [13], where the decay function can have solutions in $\u2102$.) This decay function allows to select the number of weak filters at each epoch. The decay function is defined as the ratio of filters remaining after the training on epoch $t$:
${p}_{t}=\mathrm{exp}\left({\displaystyle \frac{\mathrm{log}(1{t}_{\text{prune}})}{T}}t\right),$  (1) 
where ${t}_{\text{prune}}$ is a hyperparameter that defines the ratio of filters to be pruned, and $t\in \{1,2,\mathrm{\dots},T\}$ is the epoch. Since we progressively prune layer by layer and epoch by epoch, we calculate the the number of weak filters or the number of remaining filters at each layer, ${n}_{\text{wc}}$. Given ratio ${p}_{t}$ at epoch $t$, the number of weak filters for any layer is defined as:
${n}_{\text{wc}}={n}_{c}(1{p}_{t}),$  (2) 
where ${n}_{c}$ can be the original number of filters of any layers. Using the the number of weak filters ${n}_{\text{wc}}$ and a pruing criterion function $c$, we end up having a subset of filters ${\text{\mathbf{W}}}_{\text{weak}}\in {\mathbb{R}}^{{n}_{\text{wc}}\times {n}_{\text{in}}\times k\times k}$ with the smallest value. This subset is further divided into two subsets, using a hyperparameter $r$ that decides the ratio of hardtoremove filters. The subset ${\text{\mathbf{W}}}_{\text{rh}}\in {\mathbb{R}}^{({n}_{\text{wc}}\cdot r)\times {n}_{\text{in}}\times k\times k}$ is removed completely, while the subset ${\text{\mathbf{W}}}_{\text{rs}}\in {\mathbb{R}}^{{n}_{\text{wc}}\cdot (1r)\times {n}_{\text{in}}\times k\times k}$ will be reset to zero while keeping ${R}_{h}$ and ${R}_{s}$ as indexes for the backward pass. Additionally, hard pruning is performed on the input channels of the next layer using ${R}_{h}$.
Figure 1 illustrates the hard and soft pruning strategy of the PGP technique, with the momentum tensor defined as $\text{\mathbf{M}}\in {\mathbb{R}}^{{n}_{\text{out}}\times {n}_{\text{in}}\times k\times k}$, same dimension as a weight tensor. Using the indexes of ${R}_{s}$, we set to zero the subset ${\text{\mathbf{M}}}_{\text{rs}}\in {\mathbb{R}}^{dim({R}_{s})\times {n}_{\text{in}}\times k\times k}$ and hard prune the subset ${\text{\mathbf{M}}}_{\text{rh}}\in {\mathbb{R}}^{dim({R}_{h})\times {n}_{\text{in}}\times k\times k}$ using indexes ${R}_{h}$. Currently, progressive pruning techniques like [13], only the weights set to zero during training, without handling the previouslyaccumulated momentum accumulated which is critical for the optimization. As illustrated in Figure 2, momentum pruning is important for the optimization process.
Let us take a closer look at the typical equations for update of weight and momentum:
${\text{\mathbf{W}}}_{t+1}={\text{\mathbf{W}}}_{t}\alpha *{\text{\mathbf{M}}}_{t}$  (3) 
${\text{\mathbf{M}}}_{t}=\beta *{\text{\mathbf{M}}}_{t1}+(1\beta )*{\displaystyle \frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{t}}}$  (4) 
where ${\text{\mathbf{W}}}_{t}$ and ${\text{\mathbf{M}}}_{t}$ are respectively the weight and momentum tensors at iteration $t$, and $\alpha $ and $\beta $ are the learning rate and momentum hyperparameters, respectively. By expanding ${M}_{t1}$ in Equ. 4:
${\text{\mathbf{M}}}_{t}$  $=\beta *{\text{\mathbf{M}}}_{t1}+(1\beta )*{\displaystyle \frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{t}}}$  (5)  
$=\beta *(\beta *{\text{\mathbf{M}}}_{t2}+(1\beta )*{\displaystyle \frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{t1}}})+(1\beta )*{\displaystyle \frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{t}}}$ 
The tensor ${\text{\mathbf{M}}}_{t1}$ depends on the previous gradient of weight at time $t1$. Using a soft pruning technique (like PSFP), the momentum tensor ${\text{\mathbf{M}}}_{t1}$ using $\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{t1}}$ is meaningless if W is soft pruned at $t$, since the weight is reset, meaning the optimization point is no longer the same. It is therefore important to adapt the momentum tensor during soft pruning. Our solution is to perform soft prune the momentum such that the weight tensor is correctly optimized.
3.2 Selection criteria:
Molchanov at al. [38] proposed the following criterion $\mathrm{\Delta}\mathcal{L}({\text{\mathbf{H}}}_{i})$ to measure the importance of a feature map ${\text{\mathbf{H}}}_{i}$ from a filter ${\text{\mathbf{W}}}_{i}$, computed at each layer, and for each filter:
$\left\mathrm{\Delta}\mathcal{L}({\text{\mathbf{H}}}_{i})\right=\mathcal{L}(\mathcal{D}{\text{\mathbf{H}}}_{i}=0)\mathcal{L}(\mathcal{D}{\text{\mathbf{H}}}_{i})\approx \left\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{H}}}_{i}}{\text{\mathbf{H}}}_{i}\right$  (6) 
The term $\mathcal{L}(\mathcal{D}{\text{\mathbf{H}}}_{i}=0)$ refers to the loss of a model when a labeled dataset $D$ is given with a pruned feature map ${\text{\mathbf{H}}}_{i}=0$. $\mathcal{L}(\mathcal{D}{\text{\mathbf{H}}}_{i})$ is the original loss before the model has been pruned. In summary, the criterion of Equ. 6 is the difference between the loss of a pruned model and the original model. The criterion grows with the impact the feature map. This criterion has been shown to work well on some trained network. However, in the scenario where the network is pruned from scratch, we argue that information measured from feature map ${\mathbf{H}}_{\mathbf{i}}$ is not informative since the model is not trained. Empirical results in Section 4 also support that the criterion of Equ. 6 is not effective at other criteria for progressive pruning.
Instead of using ${\text{\mathbf{H}}}_{i}=0$ to prune a feature map [31] or filter, we can replace ${\text{\mathbf{H}}}_{i}$ with ${\text{\mathbf{W}}}_{i}$ since setting an filter to zero is the same as pruning it [13]. The same Taylor expansion from [38] then can applied with ${\text{\mathbf{W}}}_{i}$, resulting in:
$\text{TW}=\mathcal{L}(\mathcal{D}{\text{\mathbf{W}}}_{i}=0)\mathcal{L}(\mathcal{D}{\text{\mathbf{W}}}_{i})\approx \left\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{i}}{\text{\mathbf{W}}}_{i}\right$  (7) 
Equ. 7 can be further simplified when taking in account the soft pruning nature. We can decomposed this equation because $\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{i}}{\text{\mathbf{W}}}_{i}$ is an elementwise multiplication:
$\left\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{i}}{\text{\mathbf{W}}}_{i}\right=\left\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{i}}\right\left{\text{\mathbf{W}}}_{i}\right$  (8) 
where ${\text{\mathbf{W}}}_{i}$ is the absolute value of the weight of filter $i$. This meant that ${\text{\mathbf{W}}}_{i}$ can be or very close to zero if ${\text{\mathbf{W}}}_{i}$ was one of the filter that was softpruned. In this case, ${\text{\mathbf{W}}}_{i}$ has little chance to recover, since it will likely be pruned. In order to encourage more recovery on soft prune filters, we propose to remove the ${\text{\mathbf{W}}}_{i}$ term:
${\text{GN}}_{i}=\left\frac{\partial \mathcal{L}}{\partial {\text{\mathbf{W}}}_{i}}\right$  (9) 
where ${\text{GN}}_{i}$ is the criterion for our approach for $i$ filter. There are two ways of calculating our criterion:

•
PGP: performs a training epoch without updating the model, and compute the pruning criterion. This amounts to a batch gradient descent without updating the parameters at then end, and can provide better performance since the optimization is less noisy than SGD.

•
RPGP: computes the pruning criterion directly during a forwardbackward pass of training (while updating). This approach uses a SGD optimizer and calculates the criterion directly during the optimization and update of the model.
In either case, the criterion is applied over several iterations, so there are two ways of interpreting Equ. 9. One natural way of interpreting is by accumulating gradients, where the gradients are summed up to the total gradient of an filter. Since PGP goes thought the entire epoch without updates. We can use an L1 norm in order to sum up the variation inside an filter using criterion:
${G{N}_{G}}^{i}={{\sum}_{j}^{N}{\text{\mathbf{G}}}_{ij}}_{1}$  (10) 
where ${\text{\mathbf{G}}}_{ij}$ is the gradient tensor of an filter $i$ at iteration $j$ inside an epoch. Equ. 10 measures the amount of global changes for an filter at the end of an epoch, which makes it most suitable for PGP. The second way of interpreting is by accumulating the actual changes of an filter at each updates, using criterion:
${G{N}_{S}}^{i}={\sum}_{j}^{N}{{\text{\mathbf{G}}}_{ij}}_{1}$  (11) 
Equ. 11 calculates the L1 norm of a gradient tensor of an filter at each iteration during an epoch. Thus, instead of measuring the global change only at the end like Equation 10, this measure the gradual changes during an epoch. This criteria is most suitable for RPGP since the weight is updated at the same time as we accumulate our gradient. PGP is summarized in Algo. 1. The algorithm for RPGP is similar but the criterion is calculated directly at the train step.
4 Experiments
In this section, we compare the experimental results obtained using the proposed PGP and RPGP techniques against stateoftheart filter pruning techniques that are representative of each family described in Section 2.2 – L1norm Pruning (prunes once), Taylor Pruning (prunes iteratively), DCP (specialised loss function and minimize reconstruction error) and PSFP (progressive pruning). Performance is measured in terms of accuracy, and in terms of time and memory complexity (number of parameters and number of FLOPS). For techniques like our PGP, and PSFP, DCP and L1, it is possible to set a target pruning rate ${t}_{\text{prune}}$ hyper parameter. A fixed pruning rate, the complexity (number of FLOPS and parameters) are identical for these techniques, so we can compare them in terms of accuracy for a given complexity. In contrast, techniques like Taylor prune until the end, and then select the proportion of filters to be pruned. Our experiments considering two visual recognition tasks: (1) image classification (using MNIST and CIFAR10 datasets), and (2) object detection (using the PASCAL VOC dataset). Pruning ResNet needs a special strategy, we decided to follow the popular pruning strategy proposed in [29] – pruning the downsampling layer and then using the same indexes to prune the last convolution of the residual. For the pruning of Faster RCNN [40], we skip the pruning of the last layer since it would mean pruning the input of the RPN layer, which we found empirically that it results in significant performance reduction. Techniques are compared using Resnet, LeNet and VGG networks trained to address benchmark problems.The source code for our paper will be available at https://github.com/Anon6627/PruningPGP.
4.1 Experimental Results
Performance on MNIST classification data:
In this case, we use the same hyperparameters as in the original papers. The same settings were used for LeNet5 and ResNet20. With PGP and RPGP, we use a learning rate 0.01, momentum 0.9, 40 epochs with a remove rate of 50%. For PSFP, we used these same settings except for removal rate of 50%. For Taylor [38], we iteratively remove 5 filters each time, and then finetune for 5 epochs. This varies slightly from the original procedure because this configuration does not collapse and return the best result. For L1 pruning, we use a 20 epochs finetuning after pruning. For DCP, we ran the author’s code for MNIST over 40 epochs, with 20 epochs for the filter pruning and 20 epochs for finetuning.
Methods  ${t}_{\text{\mathbf{p}\mathbf{r}\mathbf{u}\mathbf{n}\mathbf{e}\mathbf{d}}}$  Params  FLOPS  Error % ($\mathrm{\pm}$ gap) 

Baseline LeNet5  0%  61K  446k  0.84 ( 0) 
L1 [29]  30%  34.1K  304K  0.9 ( +0.06) 
50%  18K  152K  1.05 ( +0.21)  
70%  84K  82K  2.22 ( +1.38)  
Taylor [38]  30%  38K  286K  0.9 ( +0.06) 
50%  24K  76K  1.05 ( +0.21)  
70%  13K  34K  1.22 ( +0.38)  
DCP [59]  30%  42.7K  325K  2.75 ( +1.91) 
50%  30.5K  232K  4.18 ( +3.34)  
70%^{1}^{1} 1 Since DCP’s code, provided by the authors, did not handle nonresidual architecture, we had to modified the original code. Pruning rate above 50% are struck on LeNet and VGG19  30.5K  232K  6.28 ( +5.44)  
PSFP [13]  30%  34.1K  304K  1.32 ( +0.48) 
50%  18K  152K  2.27 ( +1.43)  
70%  84K  82K  2.99 ( +2.15)  
PGP_GN_{G} (ours)  30%  34.1K  304K  0.87 ( +0.03) 
50%  18K  152K  1.08 ( +0.24)  
70%  84K  82K  1.74 ( +0.9)  
RPGP_GN_{S} (ours)  30%  34.1K  304K  0.9 ( +0.06) 
50%  18K  152K  1.25 ( +0.41)  
70%  84K  82K  1.75 ( +0.91) 
Methods  ${t}_{\text{\mathbf{p}\mathbf{r}\mathbf{u}\mathbf{n}\mathbf{e}\mathbf{d}}}$  Params  FLOPS  Error % ($\mathrm{\pm}$ gap) 

Baseline Resnet20  0%  272K  41M  0.74 ( 0) 
L1 [29]  30%  137K  22M  0.75 ( +0.01) 
50%  68K  10M  1.09 ( +0.35)  
70%  27K  4.2M  2.02 ( +1.28)  
Taylor [38]  30%  149K  17.7M  0.87 ( +0.13) 
50%  87K  7.8M  0.95 ( +0.21)  
70%  36K  2.6M  1.04 ( +0.30)  
DCP [59]  30%  193K  30.3M  1.11 (+0.37) 
50%  138K  21.1M  0.62 ( 0.12)  
70%  87.7K  13.5M  1.19 ( +0.45)  
PSFP [13]  30%  137K  22M  0.5 ( 0.24) 
50%  68K  10M  0.61 ( 0.13)  
70%  27K  4.2M  0.72 ( 0.02)  
PGP_GN_{G} (ours)  30%  137K  22M  0.4 ( 0.34) 
50%  68K  10M  0.51 ( 0.23)  
70%  27K  4.2M  0.57 ( 0.17)  
RPGP_GN_{S} (ours)  30%  137K  22M  0.4 ( 0.34) 
50%  68K  10M  0.48 ( 0.29)  
70%  27K  4.2M  0.5 ( 0.24) 
Results in Tab. 1 show that our PGP methods compare favorably against Stateof theart techniques like L1,Taylor and PSFP. Similar tendencies are seen in Tab. 2. We also see that PGP performs slightly better than DCP in some case. Finally, since both PGP_GN_{G} and RPGP_GN_{S} have the same criterion, results show that their procedure that differs. The slight better performance of PGP_GN_{G} can be explained by the fact that the pruning criterion is calculated using Batch Gradient Descent instead of Stochastic Gradient Descent.
Performance on CIFAR10 classification data:
In this case, we use a VGG19 for CIFAR10, with learning rate 0.1, momentum 0.9, 400 epochs and we decrease the learning rate by a factor of 10 at 160 and 240 epochs. We also use Resnet56 adapted to CIFAR10 with the same settings, except with 500 epochs. As of PGP and RPGP, we set the remove rate hyperparameter $r$ to 0.5 (50%), finetune them for 100 epochs after pruned, and store the best score. We use the same settings for PSFP except the removal rate $r$. For Taylor, 5 filters are iteratively iteratively each time and finetune on 5 epochs after that. We slightly changed the procedure compared to the original paper because the original procedure pruned one feature map each iteration which is inefficient on a large model. Empirically, we found that 5 feature maps has the best accuracy. For L1 pruning, 100 epochs of finetuning are used after pruned to find the best score. With DCP, the settings are provided by the original authors are found to have the best performance.
Methods  ${t}_{\text{\mathbf{p}\mathbf{r}\mathbf{u}\mathbf{n}\mathbf{e}\mathbf{d}}}$  Params  FLOPS  Error % ($\mathrm{\pm}$ gap) 

Baseline VGG19  0%  20M  400M  6.23 (0) 
Li [29]  30%  9M  198M  16.94 ( +8.41) 
50%  5M  100M  16.51 ( +7.98)  
70%  1M  37M  16.17 ( +7.64)  
Taylor [38]  30%  10M  156M  9.82 ( +2.29) 
50%  5M  72M  11.94 ( +3.41)  
70%  1.9M  24M  16.85 ( +8.32)  
DCP [59]  30%  10M  221M  5.8 ( 0.65) 
50%  6M  158M  7.76 ( +1.53)  
70%^{†}^{†}footnotemark:  6M  158M  7.86 ( +1.63)  
PSFP [13]  30%  9M  198M  8.98 ( +2.75) 
50%  5M  100M  11.2 ( +4.97)  
70%  1M  37M  12.06 ( +5.83)  
PGP_GN_{G} (ours)  30%  9M  198M  7.37 ( +1.14) 
50%  5M  100M  8.38 ( +2.15)  
70%  1M  37M  9.7 ( +3.47)  
RPGP_GN_{S} (ours)  30%  9M  198M  7.65 ( +1.42) 
50%  5M  100M  8.79 ( +2.56)  
70%  1M  37M  10.56 ( +4.33) 
Methods  ${t}_{\text{\mathbf{p}\mathbf{r}\mathbf{u}\mathbf{n}\mathbf{e}\mathbf{d}}}$  Params  FLOPS  Error % ($\mathrm{\pm}$ gap) 

Baseline Resnet56  0%  855K  128M  6.02 ( 0) 
L1 [29]  30%  431K  67M  13.34 ( +7.32) 
50%  215K  32M  15.57 ( +9.55)  
70%  84K  13M  17.89 ( +11.87)  
Taylor [38]  40%  491K  51M  13.9 ( +7.88) 
50%  268K  23M  15.34 ( +9.32)  
70%  100k  8M  22.1 ( +16.08)  
DCP [59]  30%  600K  90M  5.67 ( 0.35) 
50%  430K  65M  6.43 ( +0.41)  
70%  270K  41M  7.18 ( +1.16)  
PSFP [13]  30%  431K  67M  8.94 ( +2.92) 
50%  215K  32M  10.93 ( +4.91)  
70%  84K  13M  14.18 ( +8.16)  
PGP_GN_{G} (ours)  30%  431K  67M  8.95 ( +2.93) 
50%  215K  32M  10.59 ( +4.57)  
70%  84K  13M  13.02 ( +7)  
RPGP_GN_{S} (ours))  30%  431K  67M  9.37 ( +3.35) 
50%  215K  32M  10.46 ( +4.44)  
70%  84K  13M  14.16 ( +8.14) 
From Tabs. 3 and 4, our techniques consistently perform better than state of the art techniques L1, Taylor and PSFP on VGGNet. For ResNet, PSFP has a different pruning strategy on ResNet, and does not prune the downsample layer, and therefore does not prune the last convolutional layer of the residual. This translates into a slight better accuracy on some settings. Our ablation study also provides a comparison of techniques using the same pruning strategy on ResNet, and shows the importance of momentum pruning. DCP performs better than ours on this dataset, mainly because of the additional losses that help selecting discriminate filters. However, it is difficult to compare directly since they do not yield the same number of FLOPS and parameters, and DCP starts from a trained model and requires more computation power.
Performance on PASCAL VOC detection data
: In this case, PGP, RPGP and PSFP techniques are adapted for an object detection problem. We progressively prune a Faster RCNN with a VGG16 backbone using a learning rate of 0.001, momentum of 0.9 using a 10 epochs progressive pruning, and early stopping for finetuning over a few epochs. For the L1 pruning, a trained model is prune 50% from the network, and then we finetune on Pascal VOC. For this experiment, we set the pruning rate hyperparameter $r$ to 0.5 (50%), and show mean average precision (MAP) measure for comparison. In Tab.5, PGP and RPGP perform better than PSFP, the current stateoftheart progressive pruning. However, the PGP needs more time to prune due to the calculation of the criterion in a separate epoch. RPGP provides a slightly better performance (possibly due to stochasticity), and with much less pruning time. The difference in accuracy between RPGP and PSFP highlights the importance of momentum pruning with these approaches. The significant difference in the training time between RPGP and PSFP also suggests that by adding hard pruning to existing soft pruning during training can reduce training time. Overall, our proposed techniques work best on smaller architecture. It achieve a better tradeoff in term pruning time, compression and accuracy than state of the art progressive pruning techniques. It also manages to have comparable performance to state of the art techniques that start from a trained model like DCP while starting from scratch. Also, while DCP has better performance, i would be very costly to deploy DCP on production environment that does not have a lot of computational power, therefore our algorithm has better tradeoff in term of pruning time and accuracy.
4.2 Ablation study:
The training and pruning time of a model are important factors of a technique, for instance for deploying or adapting a model in an operational environment. One of advantage of progressive pruning techniques is the reduction of processing time at each epoch since filters are removed while training, at each epoch. Tab. 6 presents the training and pruning time pruning for the evaluated techniques. For progressive pruning technique, values represent both pruning and training times, while for DCP, L1 and Iterative pruning, values represent (training time) + pruning and retrain times. Experiments are conducted on the CIFAR10 dataset with the same settings as above, running on an isolated computer (Intel Xeon Gold 5118, @2.3GHZ) with an Nvidia Tesla P100 GPU card.
Methods  VGG19  Resnet56  

${t}_{\text{pruned}}$  0.5  0.9  0.5  0.9 
Baseline  219m  219m  307m  307m 
L1 [29]  (219) + 32m  (219) + 32m  (307) + 48m  (307) + 48m 
Taylor [38]  (219) + 254m  (219) + 457m  (307) + 488m  (307) + 878m 
DCP [59]      (307) + 489m  (307) + 443m 
PSFP [13]  219m  219m  307m  307m 
PGP (ours)  329m  329m  441m  441m 
RPGP (ours)  211m  168m  263m  241m 
From Tab. 6, the fastest pruning method (without considering training time) is currently the L1. However, it should be noted that the original training of the model takes around 219 mins for VGG and 307 mins for Resnet56. So, taking into account also training time L1 is slower than our approach. Other techniques likes Taylor prune in a iterative way composed of multiple feature maps and finetuning, this method can be very slow, depending on the number of filters pruned at each iteration. DCP is particulary slow since it needs to start from an already trained model and then the pruning process need to do the filter pruning optimization process and the finetuning after pruning. For PSFP, this algorithm has similar time to the original training since it does not technically change the size of the model during training. Between PGP and RPGP, the difference is the use of an entire epoch to compute the pruning criterion with PGP, and the direct computation of the criterion during a training epoch with RPGP. Also, since we hardprune filters at each epoch, the epoch time will become faster as the model is pruned/trained. Overall, the progressive pruning methods train and prune in considerably less time than other methods.
To compare the selection criterion, we use the same configuration as the general comparison for RPGP on CIFAR10, except we vary the criterion and set a pruning rate of 50%.
Networks  L2  Taylor  TW  GN_G  GN_S 

VGG19  8.47%  9.27%  8.78%  8.47%  8.79% 
ResNet56  10.30%  10.97%  10.46%  10.24%  10.28% 
In Tab. 7, we can see that our criterion performs better than others in the context of progressive pruning, and similar to the L2 Norm. The comparison between Taylor Weight (TW), and Gradient Norm (GN) shows that a small gradient norm during training may be a good indicator about the importance of a filter. From the table we can also see that Taylor Weights performs better than the original Taylor criterion. Overall $G{N}_{G}$, which uses batch gradient to capture changes, seems to work the best with progressive pruning. As for the similarity between L2 and $GN$, it is explained in the Supplemental Material.
In this experiment of momentum pruning, the same strategy, hyperparameters and L2 criterion are used for both RPGP and PSFP. The only difference is that RPGP performs momentum pruning.
Method  VGG19  ResNet56 

PSFP  11.20%  10.93% 
RPGP  8.47%  10.09% 
From the Tab. 8, in both of the case (VGG19 and ResNet56), our proposed methods performs better than the state of the art PSFP method. Since, everything is the same in this setting except the momentum pruning, this clearly shows the advantage of pruning momentum during progressive pruning.
As described, PSFP does not prune the downsampling layer of ResNet56, thus, it does not prune the last layer of the residual connection. The performance of PSFP and RPGP is compared using the same strategy on ResNet56, i.e., the downsampling layer and last layer of residual connection are not pruned, on with CIFAR10 dataset and the same hyperparameters as in previous experiments. The results in Tab. 9 indicate that the RPGP approach typically performs better than PSFP. Interestingly, when no pruning is performed on the downsampling layer and last layer of the residual connection, our method performs much better. The residual connection is sensitive to pruning, and may require a different pruning strategy.
${t}_{\text{\mathbf{p}\mathbf{r}\mathbf{u}\mathbf{n}\mathbf{e}}}\times 100$%  

Methods  30%  50%  70%  90% 
PSFP  8.94  10.93  14.18  28.09 
RPGP(GN_S)  8.87  10.09  11.02  13.94 
5 Conclusion
In this paper, we show that it is possible to efficiently prune a deep learning model from scratch with the PGP technique while improving the tradeoff between compression, accuracy and training time. PGP is a new progressive pruning technique that relies on change in filter weights to apply hard and soft pruning strategies that allows for pruning along the backpropagation path. The filter selection criterion is well adapted for progressive pruning from scratch when the norm of the gradient is considered. Results obtained from pruning various CNNs on image data for classification and object detection problems show that the proposed PGP allows maintaining a high level of accuracy with compact networks. Results show that PGP can achieve better CNN optimisations than PSFP, often translating to a higher level of accuracy for a same pruning rate as PSFP and other stateofart techniques. Future research will involve analyzing the performance of different CNNs pruned using the proposed method on larger datasets from realworld visual recognition problems (e.g., tracking and recognition of persons in video surveillance).
References
 [1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2270–2278. Curran Associates, Inc., 2016.
 [2] J. M. Alvarez and M. Salzmann. Compressionaware training of deep networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 856–867. Curran Associates, Inc., 2017.
 [3] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. CoRR, abs/1702.00953, 2017.
 [4] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker. Learning efficient object detection models with knowledge distillation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 742–751. Curran Associates, Inc., 2017.
 [5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016.
 [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, abs/1602.02830, 2016.
 [7] Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
 [8] J. Dai, Y. Li, K. He, and J. Sun. Rfcn: Object detection via regionbased fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016.
 [9] J. Faraone, N. J. Fraser, M. Blott, and P. H. W. Leong. SYQ: learning symmetric quantization for efficient deep neural networks. CoRR, abs/1807.00301, 2018.
 [10] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
 [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 [13] Y. He, X. Dong, G. Kang, Y. Fu, and Y. Yang. Progressive deep neural networks acceleration via soft filter pruning. CoRR, abs/1808.07471, 2018.
 [14] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft filter pruning for accelerating deep convolutional neural networks. CoRR, abs/1808.06866, 2018.
 [15] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [16] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. CoRR, abs/1707.06168, 2017.
 [17] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
 [18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
 [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
 [20] H. Hu, R. Peng, Y. Tai, and C. Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. CoRR, abs/1607.03250, 2016.
 [21] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017.
 [22] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [23] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy tradeoffs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
 [24] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. CoRR, abs/1405.3866, 2014.
 [25] Y. Kim and A. M. Rush. Sequencelevel knowledge distillation. CoRR, abs/1606.07947, 2016.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017.
 [27] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR, abs/1412.6553, 2014.
 [28] C. Lemaire, A. Achkar, and P.M. Jodoin. Structured pruning of neural networks with budgetaware regularization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. CoRR, abs/1608.08710, 2016.
 [30] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [31] C. Liu and H. Wu. Channel pruning based on mean gradient for accelerating convolutional neural networks. Signal Processing, 156:84 – 91, 2019.
 [32] W. Liu, D. Anguelov, D. Erhan, C. S. andRH Scott E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 21–37. Springer, 2016.
 [33] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. CoRR, abs/1708.06519, 2017.
 [34] J. Luo and J. Wu. An entropybased pruning method for CNN compression. CoRR, abs/1706.05791, 2017.
 [35] J. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. CoRR, abs/1707.06342, 2017.
 [36] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), September 2018.
 [37] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz. Importance estimation for neural network pruning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [38] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient transfer learning. CoRR, abs/1611.06440, 2016.
 [39] L. T. NguyenMeidine, E. Granger, M. Kiran, and L. BlaisMorin. A comparison of cnnbased face and head detectors for realtime video surveillance applications. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1–7, Nov 2017.
 [40] J. Park, S. R. Li, W. Wen, H. Li, Y. Chen, and P. Dubey. Holistic sparsecnn: Forging the trident of accuracy, speed, and size. CoRR, abs/1608.01409, 2016.
 [41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [42] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.
 [43] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
 [44] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun. Sbnet: Sparse blocks network for fast inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [45] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
 [46] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
 [47] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018.
 [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [49] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neural networks with lowrank regularization. CoRR, abs/1511.06067, 2015.
 [50] J. Wang, W. Bao, L. Sun, X. Zhu, B. Cao, and P. S. Yu. Private model compression via knowledge distillation. CoRR, abs/1811.05072, 2018.
 [51] X. Wang, R. Zhang, Y. Sun, and J. Qi. Kdgan: Knowledge distillation with generative adversarial networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 775–786. Curran Associates, Inc., 2018.
 [52] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. CoRR, abs/1608.03665, 2016.
 [53] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. CoRR, abs/1703.09746, 2017.
 [54] X. X Lan, X. Zhu, and S. Gong. Knowledge distillation by onthefly native ensemble. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7517–7527. Curran Associates, Inc., 2018.
 [55] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In The European Conference on Computer Vision (ECCV), September 2018.
 [56] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.
 [57] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [58] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. CoRR, abs/1702.03044, 2017.
 [59] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discriminationaware channel pruning for deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 883–894. Curran Associates, Inc., 2018.
Supplementary Material
Appendix A Additional Experimental Results
A.1 Implementation Details
One of the problem of pruning during training is how to handle the shape of the gradient tensor and momentum tensor during backward pass. In the case of PyTorch [41], the shape of the gradient tensor and momentum tensor is usually handled by the optimizer, which does not necessary update the shape during forward pass. Also, redefining a new optimizer with the new pruned model in a trivial way would result in losing all values accumulated in the momentum buffer. One of the way to overcome this, is to prune also the gradient and momentum tensors using indexes that we used to prune the weight tensor, and then transfer them to a newly defined optimizer.
A.2 Graphical comparison on CIFAR10 with VGG:
The results presented in this section are similar to the ones shown in Tabs. 1 to 4 of our paper. In the main paper, we could only compare the performance of methods with 4 pruning rates due to space constraints. In this section, we compare the performance of methods using the same experimental settings (as in our paper), but with 10 data points (${t}_{\text{pruned}}=0.1,0.2,\mathrm{\dots},1.0$) on L1 [29], Taylor [38], PSFP [13] and our approach. Since the number of remaining parameters can differ slightly from one algorithm to the other, some of the value on Xaxis are rounded up for a better visualization.
Results in Figure 3 show the proposed PGP and RPGP pruning methods consistently outperforming the other methods. Note that the proposed methods allow to maintain a low lever of error event with an important increase in the pruning rate.
A.3 L2 vs Gradient Norm:
From the ablation study, we noticed that the performance of L2 and Gradient norm is very similar in the case of soft pruning. This can be understood considering the following:
${{\text{\mathbf{W}}}_{i}^{j}}_{2}$  $={{\text{\mathbf{W}}}_{i}^{j1}\alpha \frac{\partial {\mathcal{L}}^{j1}}{\partial {\text{\mathbf{W}}}_{i}^{j1}}}_{2}$  (12)  
$={{\text{\mathbf{W}}}_{i}^{j2}\alpha \frac{\partial {\mathcal{L}}^{j2}}{\partial {\text{\mathbf{W}}}_{i}^{j2}}\alpha \frac{\partial {\mathcal{L}}^{j1}}{\partial {\text{\mathbf{W}}}_{i}^{j1}}}_{2}$  
$={{\text{\mathbf{W}}}_{i}^{0}\alpha {\sum}_{k=0}^{j}\frac{\partial {\mathcal{L}}^{k}}{\partial {\text{\mathbf{W}}}_{i}^{k}}}_{2}$ 
Where ${\text{\mathbf{W}}}_{i}^{j}$ represents the weight of an filter $i$ at iteration $j$ in an epoch, $\alpha $ is the learning rate, and ${\mathcal{L}}^{k}$ denotes here the loss function at iteration $k$. From the Equ.12 we can observe the difference between L2 and Gradient Norm is the initial values of ${\text{\mathbf{W}}}_{i}^{0}$. Taking in account the partial soft pruning nature of our approach, ${\text{\mathbf{W}}}_{i}^{0}$ can be zero when it is soft pruned. Therefore the two approaches tends to have similar values (since $\alpha $ is a scalar, it is not important in this context).
A.4 Progressive pruning from scratch vs trained:
Tab. 10 shows that the performance obtained by a model that was randomly initialized (scratch) versus one that was pretrained on CIFAR10 using the same settings as before (${t}_{pruned}=50\%$, $r=0.5$).
Training Scenario  VGG19  ResNet56 

Scratch  8.79 %  10.46 % 
Pretrained  8.23 %  9.51 % 
From Tab. 10 the difference in terms of accuracy between a network pruned starting from scratch and a network pruned after training is quite reduced and can vary depending on the architectures. Overall, instead of starting from a trained model and prune, the proposed techniques can attain similar performance starting from a randomly initialized model, thus, with a reduced training and pruning time, therefore more suitable for fast deployment.
A.5 Hard vs soft pruning:
RPGP is used with our gradient criterion and a target prune rate at 50% and using the same hyperparameters. The removal rate $r$ is varied in order to see the impact of having more or less recovery.
Networks  $r=0.3$  $r=0.5$  $r=0.7$  $r=1.0$ 

VGG19  8.74%  8.79%  8.99%  8.92% 
ResNet56  10.57%  10.46%  11.03%  10.78% 
The results in Tab. 11 show that a remove rate of 0.3(30%)or 0.5(50%) has the best balance between the amount of hard pruning soft pruning. It is also interesting to see that, without any soft pruning ($r$=1.0), the performance of the approach is still close to others removal rate.