An Improved Trade-off Between Accuracy and Complexity with Progressive Gradient Pruning

  • 2019-08-12 13:46:47
  • Le Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran, Louis-Antoine Blais-Morin
  • 0

Abstract

Although deep neural networks (NNs) have achieved state-of-the-art accuracyin many visual recognition tasks ,the growing computational complexity andenergy consumption of networks remains an issue, especially for applications onplatforms with limited resources and requiring real-time processing. Channelpruning techniques have recently shown promising results for the compression ofconvolutional NNs (CNNs). However, these techniques can result in low accuracyand complex optimisations because some only prune after training CNNs, whileothers prune from scratch during training by integrating sparsity constraintsor modifying the loss function. The progressive soft filter pruning techniqueprovides greater training efficiency, but its soft pruning strategy does nothandle the backward pass which is needed for better optimization. In thispaper, a new Progressive Gradient Pruning (PGP) technique is proposed foriterative channel pruning during training. It relies on a criterion thatmeasures the change in channel weights that improves existing progressivepruning, and on an effective hard and soft pruning strategies to adapt momentumtensors during the backward propagation pass. Experimental results obtainedafter training various CNNs on the MNIST and CIFAR10 datasets indicate that thePGP technique canachieve a better tradeoff between classification accuracy andnetwork (time and memory) complexity than state-of-the-art channel pruningtechniques

 

Quick Read (beta)

An Improved Trade-off Between Accuracy and Complexity
with Progressive Gradient Pruning

Le Thanh Nguyen-Meidine, Eric Granger, Marco Pedersoli, Madhu Kiran
LIVIA, Dept. of Systems Engineering
Ecole de Technologie Superieur
[email protected]
   Louis-Antoine
Genetec Inc.
Montreal, Canada
[email protected]
Abstract

Although deep neural networks (NNs) have achieved state-of-the-art accuracy in many visual recognition tasks, the growing computational complexity and energy consumption of networks remains an issue, especially for applications on platforms with limited resources and requiring real-time processing. Filter pruning techniques have recently shown promising results for the compression of convolutional NNs (CNNs). However, these techniques involve numerous steps and complex optimisations because some only prune after training CNNs, while others prune from scratch during training by integrating sparsity constraints or modifying the loss function. The progressive soft filter pruning (PSFP) technique provides greater training efficiency, but its soft pruning strategy does not handle the backward pass, i.e. momentum pruning, which is needed for better optimization. We proposed a new Progressive Gradient Pruning (PGP) technique for iterative filter pruning during training. To improve on PSFP, it relies on a novel filter selection criterion that measures the change in filter weights, and new hard and soft pruning strategies to effectively adapt momentum tensors during the backward propagation pass. Experimental results obtained after training various CNNs on image data for classification and object detection benchmarks indicate that the PGP technique can achieve a better trade-off between classification accuracy and network (time and memory) complexity than PSFP and other state-of-the-art filter pruning techniques.

1 Introduction

Convolutional neural networks (CNNs) learn discriminant feature representations from labeled training data, and have achieved state-of-the-art accuracy across a wide range of visual recognition tasks, e.g., image classification, object detection, and assisted medical diagnosis. Since the breakthrough results achieved with AlexNet for the 2012 ImageNet Challenge [26], CNN’s accuracy has been continually improved with architectures like VGG [48], ResNet [12] and DenseNet [22], at the expense of growing complexity (deeper and wider networks) that require more training samples and computational resources [23]. In particular, the speed of the CNNs can significantly degrade with such increased complexity.

In order to deploy these powerful CNN architectures on compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices) and for real-time processing (e.g., video surveillance and monitoring, virtual reality), the time and memory complexity and energy consumption of CNNs should be reduced. For instance, the application of CNN-based architectures to real-time face detection in video surveillance remains a challenging task [39] – while the more accurate detectors such as region proposal networks are too slow for real-time applications [45, 8], faster detectors such as single-shot detectors are less accurate [32, 43]. Consequently, effective methods to accelerate and compress deep networks, in general, and CNNs in particular, are required to provide a reasonable trade-off between accuracy and efficiency.

Several techniques have recently been proposed to reduce the complexity of CNNs, ranging from the design of specialized compact architectures like MobileNet [19], to the distillation of knowledge from larger architectures to smaller ones [17]. Among these, pruning techniques provide an automated approach to remove insignificant network elements, e.g., filters, input channels, etc. This paper focuses on filter-level pruning techniques, while does not provide the compression level of unstructured pruning, the reduction of parameters can be converted in a real speed up while preserving network accuracy [29, 38]. These techniques attempt to remove the filters and input channels at each convolution layer using various criteria based on, e.g., L1 norm [29], or the product of feature maps and gradients computed from a validation dataset [38].

Pruning techniques can be applied under two different scenarios: either (1) from a pre-trained network, or (2) from scratch. In the first scenario, pruning is applied as a post-processing procedure, once the network has already been trained, through an one-time pruning (followed by fine-tuning) [29] or complex iterative process [38] using a validation dataset [29, 34], or by minimizing the reconstruction error [35]. In the second scenario, pruning is applied from scratch by introducing sparsity constraints and/or modifying the loss function to train the network [33, 52, 59]. The later scenario can have more difficulty converging to accurate network solutions (due to the modified loss function), and thereby increase the computational complexity required for the optimisation process. For greater training efficiency, the progressive soft filter pruning (PSFP) method was recently introduced [13], allowing for iterative pruning from scratch, where filters are set to zero (instead of removing them) so that the network can preserve a greater learning capacity. This method, however, does not account for the optimization of soft pruned weights which can have an negative impact on accuracy, because pruned weights are still being optimized with old momentum values accumulated from previous epochs.

In this paper, a new Progressive Gradient-based Pruning (PGP) technique is proposed for iterative filter pruning to provide a better trade-off between accuracy and complexity. To this end, the filters are efficiently pruned in a progressive fashion while training a network from scratch, and accuracy is maintained without requiring validation data and additional optimisation constraints. In particular, PGP improves on PSFP by integrating hard and soft pruning strategies to effectively adapt the momentum tensor during the backward propagation pass. It also integrates an improved version of the Taylor selection criterion [38] that relies on the gradient w.r.t weights (instead of output feature maps), and is more suitable for progressive filter-based pruning. For performance evaluation, the accuracy and complexity of proposed and state-of-the-art filter pruning techniques are compared using Resnet, LeNet and VGG networks trained to address benchmark image classification (MNIST and CIFAR10 datasets) and object detection (PASCAL VOC dataset) problems.

2 Compression and Acceleration of CNNs

In general, time complexity of a CNN depends more on the convolutional layers, while the fully connected layers contain the most of the number of parameters. Therefore, the CNN acceleration methods typically target lowering the complexity of the convolutional layers, while the compression methods usually target reduced complexity of the fully connected layers[10, 11]. This section provides an overview of the recent acceleration and compression approaches for CNNs, namely, quantization, low-rank approximation, knowledge distillation, compact network design and pruning. Finally, a brief survey on the filter pruning methods and challenges is presented.

2.1 Overview of methods:

Quantization:

A deep neural network can be accelerated by reducing the precision of its parameters. Such techniques are often used on general embedded systems, where low-precision, e.g., 8-bit integer, provides faster processing than the higher-precision representation, e.g., 32-bit floating point. There are two main approaches to quantizing a neural network – the first focuses on quantizing using weights[10, 58], and the second uses both weights and activations for quantization [9, 6]. These techniques can be either scalable [10, 58] or non-scalable [3, 9, 6, 42], where scalable techniques means that an already quantized network can be further compressed.

Low-rank decomposition:

Low-rank approximation (LRA) can accelerate CNNs by decomposing a tensor in lower rank approximations as vector products. [24, 49, 27].There are different ways of decomposing convolution tensor. Techniques like [24, 49] focus on approximating tensor by low rank tesnor that can be obtained either in a layer by layer fashion [24] or by scanning the whole network [49]. [53] proposes to force filers to coordinate more information into a lower rank space during training and then decompose it once the model is trained. Another technique employed the CP-Decomposition (Canonical Polyadic Decomposition), where a good trade-off between accuracy and efficiency is achieved [27].

Knowledge distillation:

This family of techniques focuses on training a small network, student, using a larger model, called teacher [18]. Unlike, traditional supervised learning method, the student is trained by the teacher. These methods could obtain considerable improvements in term of sparsity and generalization of the produced networks. Most of distillation techniques use large pretrained models as teachers [18, 46]. More recently, there has been interest in developing online student-teacher models on the fly [54, 57] or using GANs in order to increase the training speed and accuracy [51]. Knowledge distillation has been applied to multiple problems including object detection [4], NLP [25] and differential privacy [50].

Compact network design:

Compact model design is an alternative way to produce fast deep neural networks. The aim of these techniques is to produce light models for high-speed processing. Different methods were applied to produce compact models, for instance, MobileNet [19], MobileNetV2 [47] and Xception [5] can achieve real-time speed using depth-wise convolution in order to reduce computation. Other architectures like ShuffleNet [56, 36] and CondenseNet [21] use another convolution locally connected in groups for reducing computation.

Pruning:

Pruning is a family of techniques that removes non-useful parameters from a neural network. There are several approaches of pruning for deep neural networks. The first is weight pruning, where individual weights are pruned. This approach has proven to significantly compress and accelerates deep neural networks [10, 55, 11]. Weight pruning techniques usually employ sparse convolution algorithms [30, 44].The other approach is output channel or filter pruning, where complete output channel or filters are pruned [29, 35, 13, 59]. Since this paper proposes a method for filter pruning, we provide more details on this approach in the next section.

2.2 Filter pruning:

Filter-level pruning techniques attempt to remove the output and input channels at each convolution layer using various criteria, such as L1-norm [29], Entropy [34], L2, APoZ [20] or using a combination of feature maps and gradients [38]. These pruning methods have the advantage of being independent of a sparse convolution algorithm since the convolution remains dense, which provides a platform-independent speed-up – a sparse algorithm can not be easily optimized on parallel computing devices, i.e. GPUs.

Following the work of Optical Brain Damage [7], one of the first papers that showed the efficiency of filter-level pruning was [29], where the weight norm is used to identify and to prune weak filters, filters that do not contribute much to network. Afterwards, several works proposed pruning procedures and filter importance metrics. These methods can be organized in five pruning approaches: 1) Pruning as one time post processing and then fine tune– this approach is simple and easy to apply [29], 2) Pruning in an iterative way once the model was trained– the iterative pruning and fine-tune increase the chance of recovering accuracy loss directly after a filter is pruned [38, 37], 3) Pruning by minimizing the reconstruction error– minimizing the reconstruction error at each layer allows the model to approximate the original performance [35, 16, 59], 4) Pruning by using sparse constraints with a modified objective function– to let the network consider pruning during optimization [33, 2, 1, 28], 5) Pruning progressively while training from scratch or pre-trained model – soft pruning [14, 15] was applied where filters are set to zero instead of actually removing them (hard manner), which leaves the network with more capacity to learn [13].

While first three approaches are capable of reducing the complexity of a model, they are only applied when the model is already trained, it would certainly be more beneficial to be able to start pruning from scratch during training. While, the fourth approach can start the pruning from scratch by adding sparse constraints and by modifying the optimization objective, this makes the loss harder and more sensitive to optimize. This can be potentially complicated when the original loss function is hard to optimize since this type of approach modifies the original loss function therefore making it potentially harder for the model to converge to a good solution. The fifth approach eases this process by not removing filters and uses the original loss function. However, we think that this approach can be improved since, currently, this approach does not handle pruning in the backward pass and only set the weak filters to zero. Also, the current approach calculates the L2 criterion separately from when the parameters are updated, i.e. not when we are iterating inside an epoch. For our approach we want to directly compute the criterion during update, i.e. when we are iterating in an epoch and updating parameters.

Another important part of pruning filters is the capacity to evaluate the importance of a filter. Currently, in literature, there has been a lot of criteria that has been used to evaluate the importance of filters, e.g. L1 [29], APoz [20], Entropy [34], L2 [13] and Taylor [38]. Among these, we think that the Taylor criterion [38], has the most potential for pruning during training since the criterion is the result of trying to minimize the impact of having a filter pruned, although we can argue that it can be improved for progressive pruning.

3 Progressive Gradient Pruning

3.1 Pruning strategy with momentum:

In a regular CNN, the weight tensor of a convolutional layer l can be defined as 𝐖nout×nin×k×k, where nin and nout are the number of input and output channels (filters), respectively. A weight tensor of filter i can be then defined as 𝐖inin×k×k. In order to select the weak filters of a layer, we evaluate the importance of an filter using a criterion function c, is usually defined as c(𝐖i):nin×k×k. Given an filter, it yields a scalar that represents the rank, e.g. L1 [29] or gradient norm in our case.

In order to prune convolution layer progressively, an exponential decay function is defined such that there is always a solution in . (It is slightly different than in [13], where the decay function can have solutions in .) This decay function allows to select the number of weak filters at each epoch. The decay function is defined as the ratio of filters remaining after the training on epoch t:

pt=exp(log(1-tprune)Tt), (1)

where tprune is a hyper-parameter that defines the ratio of filters to be pruned, and t{1,2,,T} is the epoch. Since we progressively prune layer by layer and epoch by epoch, we calculate the the number of weak filters or the number of remaining filters at each layer, nwc. Given ratio pt at epoch t, the number of weak filters for any layer is defined as:

nwc=nc(1-pt), (2)

where nc can be the original number of filters of any layers. Using the the number of weak filters nwc and a pruing criterion function c, we end up having a subset of filters 𝐖weaknwc×nin×k×k with the smallest value. This subset is further divided into two subsets, using a hyper-parameter r that decides the ratio of hard-to-remove filters. The subset 𝐖rh(nwcr)×nin×k×k is removed completely, while the subset 𝐖rsnwc(1-r)×nin×k×k will be reset to zero while keeping Rh and Rs as indexes for the backward pass. Additionally, hard pruning is performed on the input channels of the next layer using Rh.

Figure 1 illustrates the hard and soft pruning strategy of the PGP technique, with the momentum tensor defined as 𝐌nout×nin×k×k, same dimension as a weight tensor. Using the indexes of Rs, we set to zero the subset 𝐌rsdim(Rs)×nin×k×k and hard prune the subset 𝐌rhdim(Rh)×nin×k×k using indexes Rh. Currently, progressive pruning techniques like [13], only the weights set to zero during training, without handling the previously-accumulated momentum accumulated which is critical for the optimization. As illustrated in Figure 2, momentum pruning is important for the optimization process.

Let us take a closer look at the typical equations for update of weight and momentum:

𝐖t+1=𝐖t-α*𝐌t (3)
𝐌t=β*𝐌t-1+(1-β)*𝐖t (4)

where 𝐖t and 𝐌t are respectively the weight and momentum tensors at iteration t, and α and β are the learning rate and momentum hyper-parameters, respectively. By expanding Mt-1 in Equ. 4:

𝐌t =β*𝐌t-1+(1-β)*𝐖t (5)
=β*(β*𝐌t-2+(1-β)*𝐖t-1)+(1-β)*𝐖t

The tensor 𝐌t-1 depends on the previous gradient of weight at time t-1. Using a soft pruning technique (like PSFP), the momentum tensor 𝐌t-1 using 𝐖t-1 is meaningless if W is soft pruned at t, since the weight is reset, meaning the optimization point is no longer the same. It is therefore important to adapt the momentum tensor during soft pruning. Our solution is to perform soft prune the momentum such that the weight tensor is correctly optimized.

Figure 1: Illustration of the PGP pruning strategy between two successive convolutional layers.
Figure 2: An illustration of the optimization process of a weight tensor Wt during a progressive pruning with soft and momentum pruning. The dotted green line indicates the direction of the momentum, while the red full line indicates the direction of the gradient. At iteration t the weight tensor Wt is soft pruned. If the momentum tensor Mt is not soft pruned, even if the gradient direction of Wt is correct, the old momentum would force it to follow another direction.

3.2 Selection criteria:

Molchanov at al. [38] proposed the following criterion |Δ(𝐇i)| to measure the importance of a feature map 𝐇i from a filter 𝐖i, computed at each layer, and for each filter:

|Δ(𝐇i)|=|(𝒟|𝐇i=0)-(𝒟|𝐇i)||𝐇i𝐇i| (6)

The term (𝒟|𝐇i=0) refers to the loss of a model when a labeled dataset D is given with a pruned feature map 𝐇i=0. (𝒟|𝐇i) is the original loss before the model has been pruned. In summary, the criterion of Equ. 6 is the difference between the loss of a pruned model and the original model. The criterion grows with the impact the feature map. This criterion has been shown to work well on some trained network. However, in the scenario where the network is pruned from scratch, we argue that information measured from feature map 𝐇𝐢 is not informative since the model is not trained. Empirical results in Section 4 also support that the criterion of Equ. 6 is not effective at other criteria for progressive pruning.

Instead of using 𝐇i=0 to prune a feature map [31] or filter, we can replace 𝐇i with 𝐖i since setting an filter to zero is the same as pruning it [13]. The same Taylor expansion from [38] then can applied with 𝐖i, resulting in:

TW=|(𝒟|𝐖i=0)-(𝒟|𝐖i)||𝐖i𝐖i| (7)

Equ. 7 can be further simplified when taking in account the soft pruning nature. We can decomposed this equation because |𝐖i𝐖i| is an element-wise multiplication:

|𝐖i𝐖i|=|𝐖i||𝐖i| (8)

where |𝐖i| is the absolute value of the weight of filter i. This meant that |𝐖i| can be or very close to zero if 𝐖i was one of the filter that was soft-pruned. In this case, 𝐖i has little chance to recover, since it will likely be pruned. In order to encourage more recovery on soft prune filters, we propose to remove the |𝐖i| term:

GNi=|𝐖i| (9)

where GNi is the criterion for our approach for i filter. There are two ways of calculating our criterion:

  • PGP: performs a training epoch without updating the model, and compute the pruning criterion. This amounts to a batch gradient descent without updating the parameters at then end, and can provide better performance since the optimization is less noisy than SGD.

  • RPGP: computes the pruning criterion directly during a forward-backward pass of training (while updating). This approach uses a SGD optimizer and calculates the criterion directly during the optimization and update of the model.

In either case, the criterion is applied over several iterations, so there are two ways of interpreting Equ. 9. One natural way of interpreting is by accumulating gradients, where the gradients are summed up to the total gradient of an filter. Since PGP goes thought the entire epoch without updates. We can use an L1 norm in order to sum up the variation inside an filter using criterion:

GNGi=||jN𝐆ij||1 (10)

where 𝐆ij is the gradient tensor of an filter i at iteration j inside an epoch. Equ. 10 measures the amount of global changes for an filter at the end of an epoch, which makes it most suitable for PGP. The second way of interpreting is by accumulating the actual changes of an filter at each updates, using criterion:

GNSi=jN||𝐆ij||1 (11)

Equ. 11 calculates the L1 norm of a gradient tensor of an filter at each iteration during an epoch. Thus, instead of measuring the global change only at the end like Equation 10, this measure the gradual changes during an epoch. This criteria is most suitable for RPGP since the weight is updated at the same time as we accumulate our gradient. PGP is summarized in Algo. 1. The algorithm for RPGP is similar but the criterion is calculated directly at the train step.

\SetAlgoLined \SetKwInOutInputinput \SetKwInOutOutputoutput \SetKwInOutParameterparameter \InputA non-trained model m, a target percent of pruned away tprune, remove ratio r, number of epochs T \OutputPruned trained model \Fort1 \KwToT Train the model for one epoch
\ForEachconvolution layer Cl Calculate the number of weak filters nwc (2)
Calculate the pruning criterion using GNG (10) or GNS (11)
Partition 𝐖weak into indexes Rh (hard remove filters) and Rs (soft remove filters) using r
Remove subset 𝐖rh and set 𝐖rs to zero
Remove the filters of momentum tensor M using the same index as Rh
Set the filters of momentum tensor 𝐌 to zero using the same index as Rs
Evaluate the model
\algorithmcfname 1 Progressive Gradient Pruning method.

4 Experiments

In this section, we compare the experimental results obtained using the proposed PGP and RPGP techniques against state-of-the-art filter pruning techniques that are representative of each family described in Section 2.2 – L1-norm Pruning (prunes once), Taylor Pruning (prunes iteratively), DCP (specialised loss function and minimize reconstruction error) and PSFP (progressive pruning). Performance is measured in terms of accuracy, and in terms of time and memory complexity (number of parameters and number of FLOPS). For techniques like our PGP, and PSFP, DCP and L1, it is possible to set a target pruning rate tprune hyper parameter. A fixed pruning rate, the complexity (number of FLOPS and parameters) are identical for these techniques, so we can compare them in terms of accuracy for a given complexity. In contrast, techniques like Taylor prune until the end, and then select the proportion of filters to be pruned. Our experiments considering two visual recognition tasks: (1) image classification (using MNIST and CIFAR10 datasets), and (2) object detection (using the PASCAL VOC dataset). Pruning ResNet needs a special strategy, we decided to follow the popular pruning strategy proposed in [29] – pruning the downsampling layer and then using the same indexes to prune the last convolution of the residual. For the pruning of Faster R-CNN [40], we skip the pruning of the last layer since it would mean pruning the input of the RPN layer, which we found empirically that it results in significant performance reduction. Techniques are compared using Resnet, LeNet and VGG networks trained to address benchmark problems.The source code for our paper will be available at https://github.com/Anon6627/Pruning-PGP.

4.1 Experimental Results

Performance on MNIST classification data:

In this case, we use the same hyper-parameters as in the original papers. The same settings were used for LeNet5 and ResNet20. With PGP and RPGP, we use a learning rate 0.01, momentum 0.9, 40 epochs with a remove rate of 50%. For PSFP, we used these same settings except for removal rate of 50%. For Taylor [38], we iteratively remove 5 filters each time, and then fine-tune for 5 epochs. This varies slightly from the original procedure because this configuration does not collapse and return the best result. For L1 pruning, we use a 20 epochs fine-tuning after pruning. For DCP, we ran the author’s code for MNIST over 40 epochs, with 20 epochs for the filter pruning and 20 epochs for fine-tuning.

Table 1: Performance of pruning methods for training LeNet5 on the MNIST classification dataset.
Methods t𝐩𝐫𝐮𝐧𝐞𝐝 Params FLOPS Error % (± gap)
Baseline LeNet5   0% 61K 446k 0.84        ( 0)
L1 [29] 30% 34.1K 304K 0.9   ( +0.06)
50% 18K 152K 1.05   ( +0.21)
70% 84K 82K 2.22   ( +1.38)
Taylor [38] 30% 38K 286K 0.9   ( +0.06)
50% 24K 76K 1.05   ( +0.21)
70% 13K 34K 1.22   ( +0.38)
DCP [59] 30% 42.7K 325K 2.75   ( +1.91)
50% 30.5K 232K 4.18   ( +3.34)
70%11 1 Since DCP’s code, provided by the authors, did not handle non-residual architecture, we had to modified the original code. Pruning rate above 50% are struck on LeNet and VGG19 30.5K 232K 6.28   ( +5.44)
PSFP [13] 30% 34.1K 304K 1.32   ( +0.48)
50% 18K 152K 2.27   ( +1.43)
70% 84K 82K 2.99   ( +2.15)
PGP_GNG (ours) 30% 34.1K 304K 0.87   ( +0.03)
50% 18K 152K 1.08   ( +0.24)
70% 84K 82K 1.74     ( +0.9)
RPGP_GNS (ours) 30% 34.1K 304K 0.9   ( +0.06)
50% 18K 152K 1.25   ( +0.41)
70% 84K 82K 1.75   ( +0.91)
Table 2: Performance of pruning methods for training ResNet20 on the MNIST classification dataset.
Methods t𝐩𝐫𝐮𝐧𝐞𝐝 Params FLOPS Error % (± gap)
Baseline Resnet20   0% 272K 41M 0.74        ( 0)
L1 [29] 30% 137K 22M 0.75   ( +0.01)
50% 68K 10M 1.09   ( +0.35)
70% 27K 4.2M 2.02   ( +1.28)
Taylor [38] 30% 149K 17.7M 0.87   ( +0.13)
50% 87K 7.8M 0.95   ( +0.21)
70% 36K 2.6M 1.04   ( +0.30)
DCP [59] 30% 193K 30.3M 1.11    (+0.37)
50% 138K 21.1M 0.62   ( -0.12)
70% 87.7K 13.5M 1.19   ( +0.45)
PSFP [13] 30% 137K 22M 0.5    ( -0.24)
50% 68K 10M 0.61   ( -0.13)
70% 27K 4.2M 0.72   ( -0.02)
PGP_GNG (ours) 30% 137K 22M 0.4    ( -0.34)
50% 68K 10M 0.51   ( -0.23)
70% 27K 4.2M 0.57   ( -0.17)
RPGP_GNS (ours) 30% 137K 22M 0.4   ( -0.34)
50% 68K 10M 0.48   ( -0.29)
70% 27K 4.2M 0.5   ( -0.24)

Results in Tab. 1 show that our PGP methods compare favorably against State-of the-art techniques like L1,Taylor and PSFP. Similar tendencies are seen in Tab. 2. We also see that PGP performs slightly better than DCP in some case. Finally, since both PGP_GNG and RPGP_GNS have the same criterion, results show that their procedure that differs. The slight better performance of PGP_GNG can be explained by the fact that the pruning criterion is calculated using Batch Gradient Descent instead of Stochastic Gradient Descent.

Performance on CIFAR10 classification data:

In this case, we use a VGG19 for CIFAR10, with learning rate 0.1, momentum 0.9, 400 epochs and we decrease the learning rate by a factor of 10 at 160 and 240 epochs. We also use Resnet56 adapted to CIFAR10 with the same settings, except with 500 epochs. As of PGP and RPGP, we set the remove rate hyper-parameter r to 0.5 (50%), fine-tune them for 100 epochs after pruned, and store the best score. We use the same settings for PSFP except the removal rate r. For Taylor, 5 filters are iteratively iteratively each time and fine-tune on 5 epochs after that. We slightly changed the procedure compared to the original paper because the original procedure pruned one feature map each iteration which is inefficient on a large model. Empirically, we found that 5 feature maps has the best accuracy. For L1 pruning, 100 epochs of fine-tuning are used after pruned to find the best score. With DCP, the settings are provided by the original authors are found to have the best performance.

Table 3: Performance of pruning methods for training VGG19 on the CIFAR10 classification dataset.
Methods t𝐩𝐫𝐮𝐧𝐞𝐝 Params FLOPS Error % (± gap)
Baseline VGG19   0% 20M 400M 6.23         (0)
Li [29] 30% 9M 198M 16.94   (  +8.41)
50% 5M 100M 16.51   (  +7.98)
70% 1M 37M 16.17   (  +7.64)
Taylor [38] 30% 10M 156M 9.82    (  +2.29)
50% 5M 72M 11.94   (  +3.41)
70% 1.9M 24M 16.85   (  +8.32)
DCP [59] 30% 10M 221M 5.8   (  -0.65)
50% 6M 158M 7.76   (  +1.53)
70%footnotemark: 6M 158M 7.86   (  +1.63)
PSFP [13] 30% 9M 198M  8.98   (  +2.75)
50% 5M 100M 11.2   (  +4.97)
70% 1M 37M 12.06   (  +5.83)
PGP_GNG (ours) 30% 9M 198M  7.37   (  +1.14)
50% 5M 100M  8.38   (  +2.15)
70% 1M 37M  9.7   (  +3.47)
RPGP_GNS (ours) 30% 9M 198M  7.65   (  +1.42)
50% 5M 100M  8.79  (  +2.56)
70% 1M 37M 10.56   (  +4.33)
Table 4: Performance of pruning methods for training ResNet56 on the CIFAR10 classification dataset.
Methods t𝐩𝐫𝐮𝐧𝐞𝐝 Params FLOPS Error % (± gap)
Baseline Resnet56   0% 855K 128M  6.02        ( 0)
L1 [29] 30% 431K 67M 13.34   ( +7.32)
50% 215K 32M 15.57   ( +9.55)
70% 84K 13M 17.89 ( +11.87)
Taylor [38] 40% 491K 51M 13.9     ( +7.88)
50% 268K 23M 15.34   ( +9.32)
70% 100k 8M 22.1   ( +16.08)
DCP [59] 30% 600K 90M 5.67     ( -0.35)
50% 430K 65M 6.43    ( +0.41)
70% 270K 41M 7.18    ( +1.16)
PSFP [13] 30% 431K 67M 8.94    ( +2.92)
50% 215K 32M 10.93   ( +4.91)
70% 84K 13M 14.18   ( +8.16)
PGP_GNG (ours) 30% 431K 67M 8.95   ( +2.93)
50% 215K 32M 10.59   ( +4.57)
70% 84K 13M 13.02      ( +7)
RPGP_GNS (ours)) 30% 431K 67M 9.37   ( +3.35)
50% 215K 32M 10.46   ( +4.44)
70% 84K 13M 14.16   ( +8.14)

From Tabs. 3 and 4, our techniques consistently perform better than state of the art techniques L1, Taylor and PSFP on VGGNet. For ResNet, PSFP has a different pruning strategy on ResNet, and does not prune the down-sample layer, and therefore does not prune the last convolutional layer of the residual. This translates into a slight better accuracy on some settings. Our ablation study also provides a comparison of techniques using the same pruning strategy on ResNet, and shows the importance of momentum pruning. DCP performs better than ours on this dataset, mainly because of the additional losses that help selecting discriminate filters. However, it is difficult to compare directly since they do not yield the same number of FLOPS and parameters, and DCP starts from a trained model and requires more computation power.

Performance on PASCAL VOC detection data

: In this case, PGP, RPGP and PSFP techniques are adapted for an object detection problem. We progressively prune a Faster R-CNN with a VGG16 backbone using a learning rate of 0.001, momentum of 0.9 using a 10 epochs progressive pruning, and early stopping for fine-tuning over a few epochs. For the L1 pruning, a trained model is prune 50% from the network, and then we fine-tune on Pascal VOC. For this experiment, we set the pruning rate hyper-parameter r to 0.5 (50%), and show mean average precision (MAP) measure for comparison. In Tab.5, PGP and RPGP perform better than PSFP, the current state-of-the-art progressive pruning. However, the PGP needs more time to prune due to the calculation of the criterion in a separate epoch. RPGP provides a slightly better performance (possibly due to stochasticity), and with much less pruning time. The difference in accuracy between RPGP and PSFP highlights the importance of momentum pruning with these approaches. The significant difference in the training time between RPGP and PSFP also suggests that by adding hard pruning to existing soft pruning during training can reduce training time. Overall, our proposed techniques work best on smaller architecture. It achieve a better trade-off in term pruning time, compression and accuracy than state of the art progressive pruning techniques. It also manages to have comparable performance to state of the art techniques that start from a trained model like DCP while starting from scratch. Also, while DCP has better performance, i would be very costly to deploy DCP on production environment that does not have a lot of computational power, therefore our algorithm has better trade-off in term of pruning time and accuracy.

Table 5: Performance of pruning methods for training Faster R-CNN with VGG16 backbone on the Pascal VOC detection dataset with tpruned=50%.
Methods Params FLOPS mAP Training Time
Baseline VGG16 137M 250G 69.6% 428m
L1 [29] 125M 174G 62.3% (428) + 31m
PSFP [13] 125M 174G 63.5% 428m
PGP_GNG (ours) 125M 174G 65.5% 769m
RPGP_GNS (ours) 125M 174G 66.0% 281m

4.2 Ablation study:

The training and pruning time of a model are important factors of a technique, for instance for deploying or adapting a model in an operational environment. One of advantage of progressive pruning techniques is the reduction of processing time at each epoch since filters are removed while training, at each epoch. Tab. 6 presents the training and pruning time pruning for the evaluated techniques. For progressive pruning technique, values represent both pruning and training times, while for DCP, L1 and Iterative pruning, values represent (training time) + pruning and retrain times. Experiments are conducted on the CIFAR10 dataset with the same settings as above, running on an isolated computer (Intel Xeon Gold 5118, @2.3GHZ) with an Nvidia Tesla P-100 GPU card.

Table 6: Training and pruning time for pruning techniques with tprune=0.5 and 0.9.
Methods VGG19 Resnet56
  tpruned 0.5 0.9 0.5 0.9
Baseline 219m 219m 307m 307m
L1 [29] (219)   + 32m (219)   + 32m (307)   + 48m (307)   + 48m
Taylor [38] (219)   + 254m (219)   + 457m (307)   + 488m (307)   + 878m
DCP [59] - - (307) + 489m (307) + 443m
PSFP [13] 219m 219m 307m 307m
PGP (ours) 329m 329m 441m 441m
RPGP (ours) 211m 168m 263m 241m

From Tab. 6, the fastest pruning method (without considering training time) is currently the L1. However, it should be noted that the original training of the model takes around 219 mins for VGG and 307 mins for Resnet56. So, taking into account also training time L1 is slower than our approach. Other techniques likes Taylor prune in a iterative way composed of multiple feature maps and fine-tuning, this method can be very slow, depending on the number of filters pruned at each iteration. DCP is particulary slow since it needs to start from an already trained model and then the pruning process need to do the filter pruning optimization process and the fine-tuning after pruning. For PSFP, this algorithm has similar time to the original training since it does not technically change the size of the model during training. Between PGP and RPGP, the difference is the use of an entire epoch to compute the pruning criterion with PGP, and the direct computation of the criterion during a training epoch with RPGP. Also, since we hard-prune filters at each epoch, the epoch time will become faster as the model is pruned/trained. Overall, the progressive pruning methods train and prune in considerably less time than other methods.

To compare the selection criterion, we use the same configuration as the general comparison for RPGP on CIFAR10, except we vary the criterion and set a pruning rate of 50%.

Table 7: Error rate for RPGP with different pruning criteria.
Networks L2 Taylor TW GN_G GN_S
VGG19  8.47%  9.27%  8.78%  8.47%  8.79%
ResNet56 10.30% 10.97% 10.46% 10.24% 10.28%

In Tab. 7, we can see that our criterion performs better than others in the context of progressive pruning, and similar to the L2 Norm. The comparison between Taylor Weight (TW), and Gradient Norm (GN) shows that a small gradient norm during training may be a good indicator about the importance of a filter. From the table we can also see that Taylor Weights performs better than the original Taylor criterion. Overall GNG, which uses batch gradient to capture changes, seems to work the best with progressive pruning. As for the similarity between L2 and GN, it is explained in the Supplemental Material.

In this experiment of momentum pruning, the same strategy, hyper-parameters and L2 criterion are used for both RPGP and PSFP. The only difference is that RPGP performs momentum pruning.

Table 8: Error rates for RPGP and PSFP with L2.
Method VGG19 ResNet56
PSFP 11.20% 10.93%
RPGP  8.47% 10.09%

From the Tab. 8, in both of the case (VGG19 and ResNet56), our proposed methods performs better than the state of the art PSFP method. Since, everything is the same in this setting except the momentum pruning, this clearly shows the advantage of pruning momentum during progressive pruning.

As described, PSFP does not prune the downsampling layer of ResNet56, thus, it does not prune the last layer of the residual connection. The performance of PSFP and RPGP is compared using the same strategy on ResNet56, i.e., the downsampling layer and last layer of residual connection are not pruned, on with CIFAR10 dataset and the same hyper-parameters as in previous experiments. The results in Tab. 9 indicate that the RPGP approach typically performs better than PSFP. Interestingly, when no pruning is performed on the downsampling layer and last layer of the residual connection, our method performs much better. The residual connection is sensitive to pruning, and may require a different pruning strategy.

Table 9: Error rates of PSFP and RPGP with different pruning rates, when downsampling and last layers of residual connection are not pruned.
t𝐩𝐫𝐮𝐧𝐞×100%
Methods 30% 50% 70% 90%
PSFP 8.94 10.93 14.18 28.09
RPGP(GN_S) 8.87 10.09 11.02 13.94

5 Conclusion

In this paper, we show that it is possible to efficiently prune a deep learning model from scratch with the PGP technique while improving the trade-off between compression, accuracy and training time. PGP is a new progressive pruning technique that relies on change in filter weights to apply hard and soft pruning strategies that allows for pruning along the back-propagation path. The filter selection criterion is well adapted for progressive pruning from scratch when the norm of the gradient is considered. Results obtained from pruning various CNNs on image data for classification and object detection problems show that the proposed PGP allows maintaining a high level of accuracy with compact networks. Results show that PGP can achieve better CNN optimisations than PSFP, often translating to a higher level of accuracy for a same pruning rate as PSFP and other state-of-art techniques. Future research will involve analyzing the performance of different CNNs pruned using the proposed method on larger datasets from real-world visual recognition problems (e.g., tracking and recognition of persons in video surveillance).

References

  • [1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2270–2278. Curran Associates, Inc., 2016.
  • [2] J. M. Alvarez and M. Salzmann. Compression-aware training of deep networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 856–867. Curran Associates, Inc., 2017.
  • [3] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. CoRR, abs/1702.00953, 2017.
  • [4] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker. Learning efficient object detection models with knowledge distillation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 742–751. Curran Associates, Inc., 2017.
  • [5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016.
  • [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.
  • [7] Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
  • [8] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016.
  • [9] J. Faraone, N. J. Fraser, M. Blott, and P. H. W. Leong. SYQ: learning symmetric quantization for efficient deep neural networks. CoRR, abs/1807.00301, 2018.
  • [10] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
  • [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [13] Y. He, X. Dong, G. Kang, Y. Fu, and Y. Yang. Progressive deep neural networks acceleration via soft filter pruning. CoRR, abs/1808.07471, 2018.
  • [14] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft filter pruning for accelerating deep convolutional neural networks. CoRR, abs/1808.06866, 2018.
  • [15] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [16] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. CoRR, abs/1707.06168, 2017.
  • [17] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
  • [20] H. Hu, R. Peng, Y. Tai, and C. Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. CoRR, abs/1607.03250, 2016.
  • [21] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017.
  • [22] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
  • [23] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
  • [24] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. CoRR, abs/1405.3866, 2014.
  • [25] Y. Kim and A. M. Rush. Sequence-level knowledge distillation. CoRR, abs/1606.07947, 2016.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017.
  • [27] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. CoRR, abs/1412.6553, 2014.
  • [28] C. Lemaire, A. Achkar, and P.-M. Jodoin. Structured pruning of neural networks with budget-aware regularization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. CoRR, abs/1608.08710, 2016.
  • [30] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [31] C. Liu and H. Wu. Channel pruning based on mean gradient for accelerating convolutional neural networks. Signal Processing, 156:84 – 91, 2019.
  • [32] W. Liu, D. Anguelov, D. Erhan, C. S. andRH Scott E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 21–37. Springer, 2016.
  • [33] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. CoRR, abs/1708.06519, 2017.
  • [34] J. Luo and J. Wu. An entropy-based pruning method for CNN compression. CoRR, abs/1706.05791, 2017.
  • [35] J. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. CoRR, abs/1707.06342, 2017.
  • [36] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), September 2018.
  • [37] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz. Importance estimation for neural network pruning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [38] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient transfer learning. CoRR, abs/1611.06440, 2016.
  • [39] L. T. Nguyen-Meidine, E. Granger, M. Kiran, and L. Blais-Morin. A comparison of cnn-based face and head detectors for real-time video surveillance applications. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1–7, Nov 2017.
  • [40] J. Park, S. R. Li, W. Wen, H. Li, Y. Chen, and P. Dubey. Holistic sparsecnn: Forging the trident of accuracy, speed, and size. CoRR, abs/1608.01409, 2016.
  • [41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [42] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.
  • [43] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
  • [44] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun. Sbnet: Sparse blocks network for fast inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [45] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  • [46] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
  • [47] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018.
  • [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [49] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neural networks with low-rank regularization. CoRR, abs/1511.06067, 2015.
  • [50] J. Wang, W. Bao, L. Sun, X. Zhu, B. Cao, and P. S. Yu. Private model compression via knowledge distillation. CoRR, abs/1811.05072, 2018.
  • [51] X. Wang, R. Zhang, Y. Sun, and J. Qi. Kdgan: Knowledge distillation with generative adversarial networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 775–786. Curran Associates, Inc., 2018.
  • [52] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. CoRR, abs/1608.03665, 2016.
  • [53] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. CoRR, abs/1703.09746, 2017.
  • [54] X. X Lan, X. Zhu, and S. Gong. Knowledge distillation by on-the-fly native ensemble. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7517–7527. Curran Associates, Inc., 2018.
  • [55] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In The European Conference on Computer Vision (ECCV), September 2018.
  • [56] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.
  • [57] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [58] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. CoRR, abs/1702.03044, 2017.
  • [59] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-aware channel pruning for deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 883–894. Curran Associates, Inc., 2018.

Supplementary Material

Appendix A Additional Experimental Results

A.1 Implementation Details

One of the problem of pruning during training is how to handle the shape of the gradient tensor and momentum tensor during backward pass. In the case of PyTorch [41], the shape of the gradient tensor and momentum tensor is usually handled by the optimizer, which does not necessary update the shape during forward pass. Also, redefining a new optimizer with the new pruned model in a trivial way would result in losing all values accumulated in the momentum buffer. One of the way to overcome this, is to prune also the gradient and momentum tensors using indexes that we used to prune the weight tensor, and then transfer them to a newly defined optimizer.

A.2 Graphical comparison on CIFAR10 with VGG:

The results presented in this section are similar to the ones shown in Tabs. 1 to 4 of our paper. In the main paper, we could only compare the performance of methods with 4 pruning rates due to space constraints. In this section, we compare the performance of methods using the same experimental settings (as in our paper), but with 10 data points (tpruned=0.1,0.2,,1.0) on L1 [29], Taylor [38], PSFP [13] and our approach. Since the number of remaining parameters can differ slightly from one algorithm to the other, some of the value on X-axis are rounded up for a better visualization.

Results in Figure 3 show the proposed PGP and RPGP pruning methods consistently outperforming the other methods. Note that the proposed methods allow to maintain a low lever of error event with an important increase in the pruning rate.

Figure 3: Error rate versus the number of remaining parameters with the proposed and baseline pruning methods for VGG19 on the CIFAR10 dataset.

A.3 L2 vs Gradient Norm:

From the ablation study, we noticed that the performance of L2 and Gradient norm is very similar in the case of soft pruning. This can be understood considering the following:

||𝐖ij||2 =||𝐖ij-1-αj-1𝐖ij-1||2 (12)
=||𝐖ij-2-αj-2𝐖ij-2-αj-1𝐖ij-1||2
=||𝐖i0-αk=0jk𝐖ik||2

Where 𝐖ij represents the weight of an filter i at iteration j in an epoch, α is the learning rate, and k denotes here the loss function at iteration k. From the Equ.12 we can observe the difference between L2 and Gradient Norm is the initial values of 𝐖i0. Taking in account the partial soft pruning nature of our approach, 𝐖i0 can be zero when it is soft pruned. Therefore the two approaches tends to have similar values (since α is a scalar, it is not important in this context).

A.4 Progressive pruning from scratch vs trained:

Tab. 10 shows that the performance obtained by a model that was randomly initialized (scratch) versus one that was pre-trained on CIFAR10 using the same settings as before (tpruned=50%, r=0.5).

Table 10: Error rate for RPGP when trained from scratch compared to a trained model.
Training Scenario VGG19 ResNet56
Scratch 8.79 % 10.46 %
Pre-trained 8.23 %  9.51 %

From Tab. 10 the difference in terms of accuracy between a network pruned starting from scratch and a network pruned after training is quite reduced and can vary depending on the architectures. Overall, instead of starting from a trained model and prune, the proposed techniques can attain similar performance starting from a randomly initialized model, thus, with a reduced training and pruning time, therefore more suitable for fast deployment.

A.5 Hard vs soft pruning:

RPGP is used with our gradient criterion and a target prune rate at 50% and using the same hyper-parameters. The removal rate r is varied in order to see the impact of having more or less recovery.

Table 11: Error rate for RPGP for different removal rates r.
Networks r=0.3 r=0.5 r=0.7 r=1.0
VGG19 8.74% 8.79% 8.99% 8.92%
ResNet56 10.57% 10.46% 11.03% 10.78%

The results in Tab. 11 show that a remove rate of 0.3(30%)or 0.5(50%) has the best balance between the amount of hard pruning soft pruning. It is also interesting to see that, without any soft pruning (r=1.0), the performance of the approach is still close to others removal rate.