### Abstract

Deep neural networks frequently contain far more weights, represented at ahigher precision, than are required for the specific task which they aretrained to perform. Consequently, they can often be compressed using techniquessuch as weight pruning and quantization that reduce both model size andinference time without appreciable loss in accuracy. Compressing models beforethey are deployed can therefore result in significantly more efficient systems.However, while the results are desirable, finding the best compression strategyfor a given neural network, target platform, and optimization objective oftenrequires extensive experimentation. Moreover, finding optimal hyperparametersfor a given compression strategy typically results in even more expensive,frequently manual, trial-and-error exploration. In this paper, we introduce aprogrammable system for model compression called Condensa. Usersprogrammatically compose simple operators, in Python, to build complexcompression strategies. Given a strategy and a user-provided objective, such asminimization of running time, Condensa uses a novel sample-efficientconstrained Bayesian optimization algorithm to automatically infer desirablesparsity ratios. Our experiments on three real-world image classification andlanguage modeling tasks demonstrate memory footprint reductions of up to 65xand runtime throughput improvements of up to 2.22x using at most 10 samples persearch. We have released a reference implementation of Condensa athttps://github.com/NVlabs/condensa.

### Quick Read (beta)

###### Abstract

Deep neural networks frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy. Compressing models before they are deployed can therefore result in significantly more efficient systems. However, while the results are desirable, finding the best compression strategy for a given neural network, target platform, and optimization objective often requires extensive experimentation. Moreover, finding optimal hyperparameters for a given compression strategy typically results in even more expensive, frequently manual, trial-and-error exploration. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build complex compression strategies. Given a strategy and a user-provided objective, such as minimization of running time, Condensa uses a novel sample-efficient constrained Bayesian optimization algorithm to automatically infer desirable sparsity ratios. Our experiments on three real-world image classification and language modeling tasks demonstrate memory footprint reductions of up to $65\times $ and runtime throughput improvements of up to 2.22x using at most 10 samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.

oddsidemargin has been altered.

marginparsep has been altered.

topmargin has been altered.

marginparwidth has been altered.

marginparpush has been altered.

paperheight has been altered.

The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

A Programmable Approach to Model Compression

Vinu Joseph ^{0 }
Saurav Muralidharan ^{0 }
Animesh Garg ^{0 }
Michael Garland ^{0 }
Ganesh Gopalakrishnan ^{0 }

^{†}

^{†}footnotetext: \forloop@affilnum1<

^{0}NVIDIA. Correspondence to: Vinu Joseph <[email protected]>.

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. Distribution Statement "A" (Approved for Public Release, Distribution Unlimited).\@xsect

Modern deep neural networks (DNNs) are complex, and often contain millions of parameters spanning dozens or even hundreds of layers (he2016deep; huang:2017). This complexity engenders substantial memory and runtime costs on hardware platforms at all scales. Recent work has demonstrated that DNNs are often over-provisioned and can be compressed without appreciable loss of accuracy. Model compression can be used to reduce both model memory footprint and inference latency using techniques such as weight pruning (han2015learning; luo2017thinet), quantization (gupta2015deep), and low-rank factorization (jaderberg2014speeding; denton2014exploiting). Unfortunately, the requirements of different compression contexts—DNN structure, target hardware platform, and the user’s optimization objective—are often in conflict. The recommended compression strategy for reducing inference latency may be different from that required to reduce total memory footprint. For example, for a Convolutional Neural Network (CNN), the former strategy may prune convolutional filters (li2016pruning), while the latter may prune individual non-zero weights. Similarly, even for the same optimization objective, say reducing inference latency, one may employ filter pruning for a CNN, while pruning 2-D blocks of non-zero weights (gray:2017) for a language modeling network such as Transformer (vaswani:2017), since the latter has no convolutional layers. Thus, it is crucial to enable convenient expression of alternative compression schemes, yet none of today’s model compression approaches help the designer tailor compression schemes to their needs.

Current approaches to model compression also require manual specification of compression hyperparameters, such as the target sparsity ratio, which is the proportion of zero-valued parameters in the compressed model vs. the original. Finding the best sparsity ratio often becomes a trial-and-error search in practice, since compression hyperparameter values vary unpredictably with changes in the compression context. This makes it difficult to provide users with a rule of thumb, much less a single number, to apply when faced with the need to select a hyperparameter value. Each trial in this approach has a huge cost (hours or days for larger models), as it requires training the compressed model to convergence, with most of these manually orchestrated trials ending up in unmet compression objectives. Thus, automation is a crucial requirement to support the needs of designers who must adapt a variety of neural networks to a broad spectrum of platforms targeting a wide range of tasks.

As an illustration of the level of automation provided by Condensa,
consider the problem of improving the
inference throughput of VGG-19 (simonyan2014very) on the CIFAR-10 image
classification task (krizhevsky2014cifar).
Since VGG-19 is a convolutional neural network, one way to improve its
inference performance on modern hardware such as GPUs is by pruning
away individual convolutional filters (he2018progressive).
Figure 1 shows the accuracy and throughput obtained
by Condensa on this task.
Here, we plot the compressed model’s top-1 test accuracy and throughput as a function of the sparsity ratio (green and red lines,
respectively).^{1}^{1}
1
Note that these curves are not known a priori and
are often extremely expensive to sample;
they are only plotted here to better place the obtained solution in context.
Condensa’s solution corresponds to a sparsity ratio of $0.72$
and is depicted as the vertical dashed line.
This result is significant for two reasons: (1) using the Condensa library,
the filter pruning strategy employed for this experiment was expressed in
less than 10 lines of Python code, and (2) the optimal sparsity ratio of
$0.72$ (shown as the vertical dashed line in the Figure) that
achieves a state-of-the-art throughput of $2130$ images/sec ($2.22\times $ improvement) and a top-1 accuracy improvement of $0.5\%$
was obtained automatically by Condensa using a sample-efficient constrained
Bayesian optimization algorithm.
For this to work, the user didn’t have to specify any
sparsity ratios manually, and instead only had to define a domain-specific
objective function to maximize (inference throughput, in this case).

Condensa supports the expression of the overall compression scheme in Python using operators provided by the Condensa library. Since each scheme is a Python function, users are able to programmatically compose elementary schemes to build much more complex and practically interesting schemes. Condensa accepts a black-box objective function (also expressed in Python) on the target compressed model that is maximized or minimized to automatically find corresponding compression hyperparameters such as sparsity ratios. This programmable approach to model compression enables users to experiment and rapidly converge to an ideal scheme for a given compression context, avoiding manual trial-and-error search. Given Condensa’s ability to support the expression of meaningful high-level objective functions—for example, the throughput (images/sec) of a convolutional neural network—users are freed from the burden of having to specify compression hyperparameters manually.

This paper makes the following contributions: (1) it introduces Condensa, a novel programming system for model compression and demonstrates its ease-of-use for expressing complex compression schemes, (2) it presents the first sample-efficient constrained Bayesian optimization-based method for automatically inferring optimal sparsity ratios based on a user-provided objective function, and (3) it demonstrates the effectiveness of Condensa on three image classification and language modeling tasks, resulting in memory footprint reductions of up to $65\times $ and runtime throughput improvements of up to $2.22\times $ using at most 10 samples per search.

For a given task such as image classification, assume we have trained a large reference model $\overline{\mathbf{w}}=\underset{\mathbf{w}}{argmin}L(\mathbf{w})$, where $L()$ denotes a loss function (e.g., cross-entropy on a given training set), and $\mathbf{w}\in {\mathbb{R}}^{P}$. Model compression refers to finding a smaller model $\mathrm{\Theta}$ that can be applied to the same task and ideally achieves the same accuracy as $\overline{\mathbf{w}}$. Model compression can be performed in various ways, and Condensa currently supports two commonly used techniques: pruning and quantization. In pruning, non-zero values from $\overline{\mathbf{w}}$ are eliminated or “pruned” to obtain $\mathrm{\Theta}$. Pruning is usually performed using some kind of thresholding (for eg., magnitude-based) and can be unstructured (prune any non-zero value) or structured (prune only blocks of non-zeros). On the other hand, quantization retains the number of parameters in $\mathrm{\Theta}$ but assigns parameters in $\overline{\mathbf{w}}$ one of K codebook values, where the codebook may be fixed or adaptive. Condensa supports low-precision approximation, which refers to assigning each parameter in $\overline{\mathbf{w}}$ a corresponding lower-precision representation (for example, converting from 32-bit to 16-bit floating-point) and is equivalent to quantization using a fixed codebook.

There is considerable prior work on accelerating neural networks using structured weight pruning (frankle2018lottery; han2015learning; han2015deep; luo2017thinet; han2017ese; dong2017more; han2016eie; polyak2015channel; hu2016network; anwar2016compact; molchanov2016pruning), quantization (zhu2016trained; gong2014compressing) and low-rank tensor factorization (lebedev2014speeding; xue2013restructuring; denton2014exploiting; girshick2015fast). Most of these individual compression schemes for pruning and quantization and their combinations can be expressed in Condensa. Two common problems with these existing methods are: (1) determining optimal sparsity ratios at a global (network) level, and (2) distributing global sparsity into a particular sparsity ratio for each layer. We tackle these problems efficiently and systematically using our Bayesian and L-C optimizers, respectively, as described in Section id1.

Bayesian optimization has previously been demonstrated to work well for general hyperparameter optimization in machine learning and neural architecture search (snoek2012practical; dai2019chamnet). To the best of our knowledge, we are the first to use sample-efficient search via Bayesian optimization for obtaining compression hyperparameters. Automation in model compression is currently achieved either through reinforcement learning (RL) algorithms (he2018amc) or simulated annealing (liu2019autoslim). In particular, the automation procedure for AMC (he2018amc) uses four arbitrary stages of pruning and re-training for RL training; additionally, the reward function is difficult to design, and even given a good reward, local optima can be hard to escape. It is also difficult to determine when such methods may just be overfitting to irrelevant patterns in the environment. Even disregarding generalization issues, AMC’s agent (DDPG) uses trial and error, which is characterized to have an underlying incompatibility with the target pruning problem (liu2019autoslim). AutoSlim (liu2019autoslim) proposes an automated approach based on simulated annealing, and uses the ADMM algorithm for accuracy recovery, which is an AL-based method very similar to the L-C algorithm; AutoSlim, however, only supports weight pruning and does not support general compression schemes as Condensa does.

General accuracy recovery algorithms capable of handling a wide variety of compression techniques provide the foundation for systems like Condensa. Apart from the L-C algorithm (carreira2017model) which Condensa uses, other recent accuracy recovery algorithms have been proposed. ADAM-ADMM (zhang2018adam) proposes a unified framework for structured weight pruning based on ADMM that performs dynamic regularization in which the regularization target is updated in each iteration. DCP (zhuang2018discrimination) introduces additional losses into the network to increase the discriminative power of intermediate layers and select the most discriminative channels for each layer by considering the additional loss and the reconstruction error. Condensa can readily support such algorithms as additional optimizers as described in Section id1. Neural network distiller (neta_zmora_2018_1297430) and TensorFlow model optimization toolkit (tftoolkit) are two recent open-source model compression frameworks that support multiple compression schemes. While these projects share a number of common goals with Condensa, they differ in two important ways: first, they do not support the expression of schemes as imperative programs containing control-flow, iteration, recursion, etc. (Distiller requires a declarative compression specification in YAML, while the TensorFlow model optimization toolkit operates by modifying the DNN computation graph directly); second, these frameworks do not support automatic compression hyperparameter optimization for black-box objective functions.

Figure 2 provides a high-level overview of the Condensa framework. As shown on the left side of the figure, a user compresses a pre-trained model $\overline{\mathbf{w}}$ by specifying a compression scheme and an objective function $f$. Both the scheme and objective are specified in Python using operators from the Condensa library; alternatively, users may choose from a selection of commonly used built-in schemes and objectives. The Condensa library is described in more detail in Section id1. Apart from the operator library, the core framework, shown in the middle of the figure, consists primarily of two components: (1) the constrained Bayesian optimizer for inferring optimal sparsity ratios, and (2) the L-C optimizer for accuracy recovery. These components interact with each other as follows: at each iteration, the Bayesian optimizer samples a sparsity ratio $s$, which is fed into the L-C optimizer. The L-C optimizer distributes this global sparsity across all the layers of the network and performs accuracy recovery (this process is described in more detail in Section id1), passing the final obtained accuracy $A(s)$ back to the Bayesian optimizer. The compressed model $w$ obtained by the L-C optimizer is also used to evaluate the user-provided objective function $f$, the result of which is fed into the Bayesian optimizer. Based on these inputs ($A(s)$ and $f(w)$), the Bayesian optimizer decides the next point to sample. The sparsity ratio that satisfies both the accuracy and objective constraints (${s}^{*}$) is used to obtain the final compressed model (denoted as $\mathrm{\Theta}$ in the figure). The L-C and Bayesian optimizers are described in more detail in Sections id1 and id1, respectively, and the sparsity inference algorithm is presented in Algorithm 1.

Listing 1 provides a concrete example of invoking Condensa to compress a model. Here, we first train the reference models (lines 2-3) and instantiate the pre-built Prune scheme for unstructured pruning (line 6; see Table 1 for a full list of pre-built schemes). We also define our objective function to be throughput (line 8) and specify that it must be maximized (line 10); note that while users may define their own objective functions, Condensa also comes bundled with some common objective functions such as model memory footprint and throughput. Next, we instantiate the L-C optimizer (line 12) and the model compressor (lines 14-24). The model compressor (Compressor class in Listing) automatically samples and evaluates global sparsity ratios as described in Section id1 and returns the final compressed model.

The Condensa Library provides a set of operators for constructing complex compression schemes programmatically in Python. Three sets of operators are currently supported: (1) the quantize and dequantize operators for converting network parameters from a 32-bit floating-point representation to a lower-precision one such as 16-bit floating-point, and in the opposite direction, respectively; (2) the prune operator for unstructured magnitude-based pruning, and (3) the filter_prune, neuron_prune, and blockprune operators for pruning blocks of nonzeros (structure pruning). Each operator can be applied on a per-layer basis. A decompression scheme needs to be specified only when at least one of the operators in the corresponding compression scheme performs quantization, as described in Section id1.

Condensa’s tight integration with the Python ecosystem makes the expression of common compression patterns more natural. For example, operators can be combined with conditional statements to selectively compress layers based on properties of the input DNN and/or target hardware platform, as shown below:

Similarly, the use of iteration statements obviates the need for applying compression operators individually for each layer, resulting in more concise and readable schemes. This is in contrast to frameworks such as Distiller (neta_zmora_2018_1297430) which require a per-layer declarative compression specification.

In addition to the layer-wise operators described above, the Condensa Library also includes a set of pre-built compression schemes that operate on the full model. Condensa includes schemes for unstructured and structured pruning, quantization, and composition of individual schemes. These schemes handle a number of low-level details such as magnitude threshold computation from a sparsity ratio, filter/neuron/block aggregation, etc., enabling non-expert users to quickly get started with Condensa without knowledge of low-level implementation details. The current set of pre-built schemes is listed in Table 1. Line 6 in Listing 1 shows an example instantiation of the Prune scheme.

Scheme | Description |
---|---|

Quantize(dtype) | Quantizes network weights to given datatype dtype. |

Prune() | Performs unstructured pruning of network weights. |

NeuronPrune(criteria) | Aggregates and prunes neurons (1D blocks) according to criteria. |

FilterPrune(criteria) | Aggregates and prunes filters (3D blocks) according to criteria. |

StructurePrune(criteria) | Combines neuron and filter pruning. |

BlockPrune(criteria, bs) | Aggregates and prunes n-D blocks of size bs according to criteria. |

Compose(slist) | Composes together all schemes in slist. |

As described earlier in this section, given a reference model, compression scheme, and compression hyperparameter values (obtained automatically by the Bayesian hyperparameter optimization subsystem described in Section id1), Condensa tries to recover any accuracy lost due to compression. While the compressed model, denoted as $\mathrm{\Theta}$, can be obtained by directly zeroing out lower-magnitude parameters from the reference model $\overline{\mathbf{w}}$ (a technique referred to as direct compression), the resulting model $\mathrm{\Theta}$ is generally sub-optimal w.r.t. the loss since the latter is ignored in learning $\mathrm{\Theta}$. Instead, we desire an accuracy recovery algorithm that obtains an optimally compressed model with locally optimal loss. An effective accuracy recovery mechanism for Condensa must ideally have three important attributes: (1) able to handle all the compression operators supported by Condensa, (2) be efficient with relatively low overheads, and (3) provide optimality guarantees whenever possible. In this paper, we use the recently proposed L-C algorithm (carreira2018learning), since it satisfies all three of the above requirements. In L-C, model compression is formulated as a constrained optimization problem:

$${\text{min}}_{\mathbf{w},\mathbf{\Theta}}L(\mathbf{w})\mathit{\hspace{1em}\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}\hspace{1em}}\mathbf{w}=\mathcal{D}(\mathrm{\Theta})$$ | (1) |

Here, the *decompression mapping* $\mathcal{D}:\mathrm{\Theta}\in {\mathbb{R}}^{Q}\to \mathbf{w}\in {\mathbb{R}}^{P}$ maps a low-dimensional parameterization to uncompressed model weights, and the *compression mapping* $\mathcal{C}(\mathbf{w})=\underset{\mathrm{\Theta}}{argmin}{\parallel \mathbf{w}-\mathcal{D}(\mathrm{\Theta})\parallel}^{2}$ behaves
similar to the inverse of $\mathcal{D}$.
This formulation naturally supports a number of well-known compression techniques. In particular, pruning is defined as $\mathbf{w}=\mathcal{D}(\mathrm{\Theta})=\mathrm{\Theta}$ where $\mathbf{w}$ is real and $\mathrm{\Theta}$ is constrained to have fewer nonzero values by removing (zeroing out) lower magnitude weights; low-precision approximation defines a constraint ${w}_{i}={\theta}_{i}$ per parameter where ${w}_{i}$ is in a higher-precision representation and ${\theta}_{i}$ is in a lower-precision one.

Eq. 1 is non-convex due to two reasons: (1) the original problem of training the reference model is already non-convex for models such as DNNs, making the objective function of Eq 1 non-convex, and (2) the decompression mapping $\mathcal{D}(\mathrm{\Theta})$ typically adds another layer of non-convexity caused by an underlying combinatorial problem. While a number of non-convex algorithms may be used to solve Eq 1, we focus on the augmented Lagrangian (AL) method (wright1999numerical) implemented in the L-C algorithm (carreira2018learning) in this paper, since it is relatively efficient and easy to implement. As its name indicates, the L-C algorithm alternates between two steps: a learning (L) step which trains the uncompressed model but with a quadratic regularization term, and a compression (C) step, which finds the best compression of $\mathbf{w}$ (the current uncompressed model) and corresponds to the definition of compression mapping $\mathcal{C}$. Due to space restrictions, we refer the reader to (carreira2018learning) for a more detailed description of the L-C algorithm. Other recent AL-based algorithms that could potentially be used include ADMM (zhang2018adam) and DCP (zhuang2018discrimination).

It is intuitive to split the problem of finding optimal sparsity ratios into two stages: (1) find the highest sparsity value that loses at most $\u03f5$ accuracy w.r.t the original uncompressed model, and (2) in a constrained sparsity regime obtained from stage I, optimize a user-provided objective function $f$ (e.g., throughput, or memory footprint) and return the solution as the final sparsity ratio.

It is worth noting that optimizing performance characteristics (accuracy, throughput, and so on) against sparsity ratios requires access to function $f$, and often assumes cheap function evaluation. However, for compression, each function evaluation may amount to optimizing the full model, which is computationally prohibitive.

Condensa leverages black box sample-efficient Bayesian optimization to optimize objective $f$ with accuracy constraints. Bayesian optimization solves for the minimum of a black-box function $f(\mathbf{x})$ on some bounded set $\mathcal{X}$, which we take to be a subset of ${\mathbb{R}}^{D}$ (mockus1978application; jones2001taxonomy). These methods construct a probabilistic model of $f$ with sequential evaluation, and then exploit this model for sequential selection of information gathering actions—the choice of $x\in \mathcal{X}$. This procedure leverages all function evaluations instead of only local gradient approximations, and hence is sample efficient even for non-convex black-box functions (brochu2010tutorial).

A Bayesian optimization algorithm requires two design choices: a prior and an acquisition function. The prior captures assumptions about smoothness and continuity of function $f$, while the acquisition function expresses a utility function over the model posterior for sequential decisions.

Gaussian Process Prior. The Gaussian Process (GP) is a computationally convenient prior distribution on functions that allows for closed-form marginal and conditional computations (rasmussen2006gaussian). The GP is defined by the property that any finite set of $N$ points ${\{{\mathbf{x}}_{n}\in \mathcal{X}\}}_{n=1}^{N}$ induces a multivariate Gaussian distribution on ${\mathbb{R}}^{N}$. We assume that the function $f(x)$ is drawn from a GP prior and that our observations are of the form ${\{{\mathbf{x}}_{n},{y}_{n}\}}_{n=1}^{N}$, where ${y}_{n}\sim \mathcal{N}(f({\mathbf{x}}_{n}),\nu )$ and $\nu $ is the variance of noise introduced into the function observations. The support and properties of the resulting distribution on functions are determined by a mean function $m:\mathcal{X}\to \mathbb{R}$ and a positive definite covariance function $K:\mathcal{X}\times \mathcal{X}\to \mathbb{R}$.

Design of Acquisition Function. The GP prior and sequential function evaluations induce a posterior over the function of interest $f$; the acquisition function, which we denote by $a:\mathcal{X}\to {\mathbb{R}}^{+}$ is the utility model that guides the next best point for function evaluation. These acquisition functions depend on the previous observations $\{{\mathbf{x}}_{n},{y}_{n}\}$ and the GP hyperparameters $\rho $; we denote this dependence as $a(\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )$. Under the Gaussian process prior, the acquisition function depends on the model solely through its predictive mean function $\mu (\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )$ and predictive variance function ${\sigma}^{2}(\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )$. For this discussion, we denote the best current value as ${\mathbf{x}}_{\text{next}}={\text{argmin}}_{{\mathbf{x}}_{n}}f({\mathbf{x}}_{n})$ and the cumulative distribution function of the standard normal as $\mathrm{\Phi}(\cdot )$. The choice of acquisition function depends on the overall problem objective, as illustrated following.

1. Probability of Improvement. This intuitive strategy maximizes the probability of improving over the best current value (kushner1964new). Under the GP this can be computed analytically as: ${a}_{\text{PI}}(\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )=\mathrm{\Phi}(\gamma (\mathbf{x}))$, where $\gamma (\mathbf{x})=\frac{f({\mathbf{x}}_{\text{best}})-\mu (\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )}{\sigma (\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )}$.

2. Expected Improvement. Alternatively, one could choose to maximize the expected improvement (EI) over the current best. This also has closed form under the Gaussian process: ${a}_{\text{EI}}(\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )=\sigma (x;\{{\mathbf{x}}_{n},{y}_{n}\},\rho )-\kappa \sigma (\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )$, with a tunable $\kappa $ to balance exploitation against exploration.

3. Upper/Lower Confidence Bound. Here, the functional approximation uncertainty is leveraged for acquisition through lower (upper) confidence bounds for functional min (max) (srinivas2009gaussian). These acquisition functions have the form ${a}_{\text{UCB}}(\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\};\rho )=\mu (\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )-\kappa \sigma (\mathbf{x};\{{\mathbf{x}}_{n},{y}_{n}\},\rho )$, with a tunable $\kappa $ to balance exploitation against exploration.

[!t] \[email protected]@algorithmic[1] \STATEInput: $\overline{\mathbf{w}}$,$\u03f5$ \STATEOutput: ${s}^{*}$ \STATEAcqFn $\leftarrow $ ILS-UCB($L$ = ${\overline{\mathbf{w}}}_{acc}-\u03f5$, $s=(0,1)$) \STATE${s}_{acc}\leftarrow $ BayesOpt(${\mathcal{B}}_{f}$ = L-C, AcqFn) \STATEAcqFn $\leftarrow $ GP-UCB($s=(0,{s}_{acc})$) \STATE${s}^{*}\leftarrow $ BayesOpt(${\mathcal{B}}_{f}=f$, AcqFn) \STATE \STATEBayesOpt \STATEInput: ${\mathcal{B}}_{f}$, AcqFn \STATEOutput: $s$ \STATEGP $\leftarrow $ GP-Regressor.initialize() \FORt $\leftarrow $ 0, 1, 2, … \STATE${s}_{t}\leftarrow $ argmax${}_{\mathrm{s}}$AcqFn($s|{D}_{1:t-1})$ \STATE${y}_{t}\leftarrow f({s}_{t})$ \STATE${D}_{1:t}\leftarrow \{{D}_{1:t-1},({s}_{t},{y}_{t})\}$ \STATEGP.Update(${D}_{1:t}$) \IF$t>0$ and ${s}_{t}=={s}_{t-1}$ \RETURN${s}_{t}$ \ENDIF\ENDFOR

4. Level-Set Optimization. In addition to unconstrained optimization, to enable Condensa to achieve constraint satisfaction we build on top of level-set black-box optimization (bogunovic2016truncated; garg2016tumor; zanette2018robust). We leverage a Gaussian Process Adaptive Sampling criterion called Implicit Level Set Upper Confidence Bound (ILS-UCB) (garg2016tumor), that prioritizes sampling near a level set of the estimate. This algorithm prioritizes searching the expected L-C curve intersection with user accuracy constraints, conditional on estimated uncertainty, and does not seek to precisely learn the shape of the entire L-C curve. Intuitively, by reducing the estimation space to specifically localize the sparsity that meets user accuracy constraints, we can reduce the total number of measurements-and consequently the time required to achieve an optimal value for the sparsity. Hence, rather than prioritizing both high variance and high mean like UCB, ILS-UCB prioritizes sampling in areas near a level set of the mean represented by the Gaussian Process Implicit Surface, i.e. to minimize the implicit potential defined by $\mu (\mathbf{x})-L$, and where the confidence interval is large:

$${\mathbf{x}}_{t}=\underset{\mathbf{x}\in X}{\text{argmax}}(1-\gamma )\sigma (\mathbf{x})-\gamma |\mu (\mathbf{x})-L|$$ | (2) |

Maximizing the acquisition function. Condensa uses a combination of random sampling and the L-BFGS-B optimization method to find the maximum of the acquisition function. We first sample a few $(1e5)$ warmup points at random, and then run L-BFGS-B from $250$ random starting points. To find the point at which to sample, we still need to maximize the constrained objective $u(\mathbf{x})$. Unlike the original objective function, $u\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$ can be cheaply sampled. Existing works optimize the acquisition function using DIRECT (jones1993lipschitzian), a deterministic, derivative-free optimizer. It uses the existing samples of the objective function to decide how to proceed to divide the feasible space into finer rectangles. Other methods such as Monte Carlo and multi-start have also been used, and seem to perform reasonably well (mockus1994application; lizotte2008practical). Note that the second term in Equation 2 is negative, as we are trying to sample in locations where the distance to the level set is minimized. To find the point at which to sample, we still need to maximize the constrained objective $u(\mathbf{x})$. Unlike the original objective function $f$, $u(\cdot )$ can be cheaply sampled. In Condensa we use GP-UCB (GP-LCB) for function maximization (minimization) and ILS-UCB for solving constraints, as shown in Algorithm 1.

Algorithm Summary. We describe Condensa’s two-stage optimization pipeline in Algorithm 1. Here, we first find a sparsity value ${s}_{acc}$ that constrains the accuracy function $A$ to the provided $\u03f5$. We then constrain the search space to $(0,{s}_{acc})$ while optimizing the user-provided objective function $f$. The BAYESOPT function runs a Bayesian optimization loop given a target objective function ${\mathcal{B}}_{f}$ and an acquisition function. Note that we assume that $A$ decreases monotonically w.r.t. sparsity in the region $(0,{s}_{acc})$.

The Condensa library and L-C optimizer are implemented in Python and are designed to inter-operate seamlessly with the PyTorch framework (pytorch). While we chose PyTorch for its widespread use in the machine learning community, it is worth noting that Condensa’s design is general and that its features can be implemented in other similar frameworks such as TensorFlow (abadi2016tensorflow) and MXNET (chen2015mxnet). We currently use a publicly available Python library for Bayesian global optimization with Gaussian Processes (fmfnbo).

Network Thinning Condensa comes pre-built with three structure pruning schemes: filter, neuron, and block pruning, as shown in Table 1. The application of these schemes may yield zero structures, which refer to blocks of zeros within a DNN’s parameters. Network thinning refers to the process of identifying and removing such zero structures and consequently reducing the number of floating-point operations executed by the target hardware platform. Condensa employs a three-phase network thinning algorithm for structure pruning: in the first phase, we construct an in-memory graph representation of the target DNN. PyTorch makes this non-trivial, as its eager execution semantics preclude it from ever building a full graph-based representation of the DNN. To overcome this, we trace a forward execution path of the DNN and use it to construct an in-memory representation based on the ONNX format. In the next phase, we create a thinning strategy by analyzing the dependencies between the nodes of the graph constructed in the first phase. This step primarily involves keeping track of tensor dimension changes in a node due to thinning and ensuring that the corresponding tensor dimensions of the node’s successors are appropriately adjusted. Due to the possibility of complex dependence patterns such as skip nodes in real-world DNNs (for example, deep residual networks (he2016deep)), this step is the most challenging to implement. In the final phase, we apply the thinning strategy obtained in phase 2 and physically alter tensor shapes to obtain the final thinned network. The Condensa Library provides the thin method which can be used to thin a given compressed model.

We conduct extensive experiments and fully analyze Condensa on three tasks: (1) image classification on CIFAR-10 (krizhevsky2014cifar), (2) image classification on ILSVRC (ImageNet) (deng2009imagenet), and (3) language modeling on WikiText-2 (merity2016pointer). We optimize the networks in each task for two distinct objectives: (1) minimize their memory footprint, and (2) maximize their inference throughput. We now describe the individual tasks and optimization objectives in more detail. We also describe how the Bayesian and L-C optimizers are set up.

Image Classification on CIFAR-10 The CIFAR-10 dataset (krizhevsky2014cifar) consists of $50k$ training and $10k$ testing $32\times 32$ images in 10 classes. We train the VGG-19 (simonyan2014very) and ResNet56 (he2016deep) neural networks on this dataset for $160$ epochs with batch normalization, weight decay (${10}^{-4}$), decreasing learning rate schedules (starting from $0.1$) and augmented training data.

Image Classification on ImageNet Here, we use the VGG-16 neural network (simonyan2014very) trained on the challenging ImageNet task (deng2009imagenet), specifically the ILSVRC2012 version. We use PyTorch (pytorch) and default pretrained models as a starting point.

Method | Dataset | Network | Accuracy/Log Perplexity | ${s}^{*}$ | BO-Samples | ${r}_{c}$ | ${r}_{T}/{s}_{F}$ |

Baseline | CIFAR-10 | VGG19-BN | 92.98% | ||||

Condensa P+Q ($\u03f5=2\%$) | CIFAR-10 | VGG19-BN | $93.04\%$ | $0.97$ | 8,7 | $65.25\times $ | N/A |

Condensa Filter ($\u03f5=2\%$) | CIFAR-10 | VGG19-BN | $93.51\%$ | $0.72$ | 9,8 | N/A | ${r}_{T}=2.22\times $ |

Baseline | CIFAR-10 | ResNet56 | 92.75% | ||||

AMC (he2018amc) | CIFAR-10 | ResNet56 | $90.1\%$ | N/A | N/A | N/A | ${s}_{F}=2\times $ |

Condensa P+Q ($\u03f5=2\%$) | CIFAR-10 | ResNet56 | $91.2\%$ | $0.94$ | 7,7 | $27\times $ | N/A |

Condensa Filter ($\u03f5=2\%$) | CIFAR-10 | ResNet56 | $91.29\%$ | $0.72$ | 7,7 | N/A | ${r}_{T}=1.07\times $ |

Baseline | ImageNet | VGG16-BN | 91.5% | ||||

Filter Pruning (he2017channel) | ImageNet | VGG16-BN | $89.80\%$ | N/A | N/A | $\approx 4\times $ | N/A |

AutoSlim (liu2019autoslim) | ImageNet | VGG16-BN | $90.90\%$ | N/A | N/A | $6.4\times $ | N/A |

AMC (he2018amc) | ImageNet | VGG16-BN | $90.10\%$ | N/A | N/A | N/A | ${s}_{F}=1.25\times $ |

Condensa P+Q ($\u03f5=2\%$) | ImageNet | VGG16-BN | $89.89\%$ | $0.92$ | 8,7 | $25.59\times $ | N/A |

Condensa Filter ($\u03f5=2\%$) | ImageNet | VGG16-BN | $90.25\%$ | $0.12$ | 9,7 | N/A | ${r}_{T}=1.16\times $ |

Baseline | WikiText-2 | LSTM | $4.70$ | ||||

(yu2019playing) | WikiText-2 | LSTM | $4.70$ | N/A | N/A | $\approx 10\times $ | N/A |

Condensa P+Q ($\u03f5=2\%$) | WikiText-2 | LSTM | $4.75$ | $0.92$ | 9,7 | $4.2\times $ | N/A |

Condensa Block ($\u03f5=2\%$) | WikiText-2 | LSTM | $4.77$ | $0.61$ | 8,7 | N/A | ${s}_{F}=2.2\times $ |

Language Modeling on WikiText-2 We trained a 2-layer LSTM model to perform a language modeling task on the WikiText-2 dataset (merity2016pointer). We used a hidden state size of $650$ and included a dropout layer between the two RNN layers with a dropout probability of $0.5$. The LSTM received word embeddings of size $650$. For training, we used truncated Backpropagation Through Time (truncated BPTT) with a sequence length of $50$. The training batch size was set to $30$, and models were optimized using SGD with a learning rate of $20$. This setup is similar to the one used by Yu et al. (yu2019playing).

Objective 1: Minimize Memory Footprint The memory footprint of a model is defined as the number of bytes consumed by the model’s non-zero parameters. Reducing the footprint below a threshold value is desirable, especially for memory-constrained devices such as mobile phones, and can be accomplished through either pruning or quantization, or both. While some real-world sparse matrix formats such as CSR and COO incur additional overheads in terms of storage, we believe that the footprint of a model is an effective proxy. The objective function $f$ in this case is defined as follows:

For reducing footprint, we define a compression scheme that performs unstructured pruning of each learnable layer (except batch normalization layers), and then quantizes it to half-precision floating-point, yielding an additional 2x reduction. In Condensa, this scheme can be constructed using the Compose operator as shown below (see Table 1 for the full list of schemes):

Objective 2: Maximize Throughput Inference throughput is defined as the number of input samples processed by a model per second, and is commonly used for measuring real-world performance. For CIFAR-10 and ImageNet, we measure hardware inference throughput of the compressed model in the objective function. We use an NVIDIA Titan V GPU with the TensorRT 5 framework to obtain throughput data. For WikiText-2, due to the lack of optimized block-sparse kernels for PyTorch, we measure the floating-point operations (FLOPs) of the compressed model instead as a proxy for inference performance. To improve throughput, we focus on removing entire blocks of non-zeros, such as convolutional filters, since they have been proven to improve performance on real-world hardware (he2018progressive; gray:2017). For CIFAR-10 and ImageNet, we use filter pruning, since all the networks we consider are CNNs. In WikiText-2, we employ block pruning with a block size of 5. Condensa makes it convenient to customize schemes in this manner based on the input DNN structure.

Bayesian Optimizer Settings We use a Gaussian Processes prior with the Matern kernel ($\nu =2.5$), length scale of $1.0$ and $\alpha $ value of $0.1$ with normalization of the predictions. For the GP regressor, the noise level in the covariance matrix is governed by another parameter, which we set to a very low value of ${10}^{-6}$. For the ILS-UCB acquisition function, we use a $\kappa $ value of $0.95$ for all our experiments with a bias towards sampling more in the area of level set, with the intention that the Bayesian optimizer results in a favorable sparsity level in as few samples as possible. We stop the Bayesian optimization loop according to the termination condition specified in Algorithm 1.

L-C Optimizer Settings The L-C optimizer was configured as follows: for all experiments, we use ${\mu}_{j}={\mu}_{o}{a}^{j}$, with ${\mu}_{0}={10}^{-3}$ and $a=1.1$ where $j$ is the L-C iteration. For CIFAR-10 and ImageNet, we use the SGD optimizer in the learning (L) step with a momentum value of 0.9, with the learning rate decayed from $0.1$ to ${10}^{-5}$ over each mini-batch iteration. We use the Adam optimizer in the L-step of WikiText-2 with a fixed learning rate of ${10}^{-4}$. We ran between 4000-5000 mini-batch iterations in each L-step, with a higher number of iterations in the first L-step ($30k$ for CIFAR-10 and ImageNet, and $7k$ for WikiText-2) as recommended by (carreira2018learning). We ran 5, 30, and 50 L-C iterations for WikiText-2, ImageNet, and CIFAR-10, respectively; compared to CIFAR-10, we ran relatively fewer iterations for ImageNet due to its significantly higher computational cost, and ran an extra 5 fine-tuning iterations instead. We use the same mini-batch sizes as during training for all experiments, and use validation datasets to select the best model during compression (we perform a 9:1 training:validation split for CIFAR-10 since it doesn’t include a validation dataset).

We present the memory footprint reductions and inference throughput improvements obtained by Condensa for each of the three tasks we evaluate in Table 2. For each task, we list the sparsity ratio obtained by the Condensa Bayesian optimizer (${s}^{*}$ in the table), its corresponding accuracy/perplexity (top-1 accuracy, top-5 accuracy, and log perplexity for CIFAR-10, ImageNet, and WikiText-2, respectively), memory footprint reductions using pruning and quantization (column labeled ${r}_{c}$), and inference throughput/FLOP improvements using filter/block pruning (column labeled ${r}_{T}/{s}_{F}$). We also show the number of samples required by the Bayesian optimizer for each phase of the sparsity ratio inference algorithm (shown in Algorithm 1) to arrive at the final solution. We also compare our approach with recent work on automated model compression. For CIFAR-10 and ImageNet, we compare our results with AMC (he2018amc) and AutoSlim (liu2019autoslim), and for WikiText-2, we compare with (yu2019playing). Since AMC (he2018amc) and (yu2019playing) do not report actual runtime numbers on hardware, we report the corresponding FLOP improvements instead (values marked ${s}_{F}$). We also use FLOP reduction as a metric for LSTM block pruning, as described above. Overall, we obtain memory footprint reductions of up to $65.25\times $ and inference throughput improvements of up to $2.22\times $. On CIFAR-10, we notice relatively smaller throughput improvements on ResNet56 due to inter-layer filter dependencies that prevented us from performing aggressive network thinning (we describe network thinning in more detail in Section id1).

Figures 3 and 4 illustrate how a compressed model’s accuracy, inference performance, and memory footprint vary w.r.t. sparsity ratios for the CIFAR-10 and WikiText-2 tasks. All three of these functions are assumed to be unknown in our problem formulation, but we compute them explicitly here to better understand the quality of solutions produced by Condensa. For each figure, compression accuracies (shown in green) are obtained by running the L-C algorithm to convergence for $100$ sparsity ratios ranging from $0.9$ to $1.0$ (for pruning + quantization), and from $0$ to $1$ for the filter and block pruning schemes; collecting each such point requires between $30$ minutes to $8$ hours of time on a single NVIDIA Tesla V100 GPU. We are unable to show the full profile for ImageNet due to its significantly higher computation cost: collecting each data point for compression accuracy requires over 12 hours of compute time on a node with 8 Tesla V100 GPUs. Inference throughput, FLOPs, and memory footprint data is collected for each compressed model and depicted by red lines in the figures (right-hand-side y-axis). We also show direct compression (DC) accuracies in gray for comparison (DC is described in more detail in Section id1). In each figure, the sparsity ratio found by Condensa is shown as a black vertical dashed line.

We notice three important trends in Figures 3 and 4: (1) Condensa consistently finds solutions near the ‘knee‘ of the L-C accuracy curves, signifying the effectiveness of the ILS-UCB acquisition function; (2) local minima/maxima is avoided while optimizing the objective function, demonstrating that the UCB acquisition function for objective function optimization is working as expected, and (3) the knee of the D-C accuracy curves occur at significantly lower sparsity ratios; the L-C optimizer, on the other hand is able to recover accuracies up to much higher sparsity ratios.

In this section, we analyze how improving throughput using compression translates to execution time improvements for each layer on actual hardware. For this experiment, we focus on VGG-19 on CIFAR-10, since it has a relatively simple structure and is easy to analyze on a layer-by-layer basis. We use filter pruning with a sparsity ratio of $0.72$ (found by the Bayesian optimizer, as shown in Table 2) for this experiment. We report the mean runtimes over 100 executions as obtained using TensorRT.

Table 3 shows layer-by-layer compression ratios and mean runtimes collected over 100 runs for filter pruning. Here, the columns labeled R and C represent results for the reference, and filter-pruned models, respectively. We only show data for convolutional layers as they dominate computation time for this network. We observe large inference runtime speedups in later layers of the network. This result helps us gain more insight into how the L-C algorithm distributes global sparsity ratios to each layer, resulting in actual hardware speedups.

LAYER | SHAPE | TIME(ms) | SPEEDUP | ||
---|---|---|---|---|---|

R | C | R | C | ||

conv1 | 3 x 3 x 3 x 64 | 3 x 3 x 3 x 23 | 0.07 | 0.05 | 1.4$\times $ |

conv2 | 3 x 3 x 64 x 64 | 3 x 3 x 23 x 58 | 0.23 | 0.11 | 2.09$\times $ |

conv3 | 3 x 3 x 64 x 128 | 3 x 3 x 58 x 126 | 0.12 | 0.17 | 0.71$\times $ |

conv4 | 3 x 3 x 128 x 128 | 3 x 3 x 126 x 127 | 0.22 | 0.23 | 0.95$\times $ |

conv5 | 3 x 3 x 128 x 256 | 3 x 3 x 127 x 256 | 0.22 | 0.22 | 1$\times $ |

conv6 | 3 x 3 x 256 x 256 | 3 x 3 x 256 x 255 | 0.41 | 0.41 | 1$\times $ |

conv7 | 3 x 3 x 256 x 256 | 3 x 3 x 255 x 251 | 0.41 | 0.41 | 1$\times $ |

conv8 | 3 x 3 x 256 x 256 | 3 x 3 x 251 x 241 | 0.41 | 0.35 | 1.17$\times $ |

conv9 | 3 x 3 x 256 x 512 | 3 x 3 x 241 x 214 | 0.28 | 0.16 | 1.75$\times $ |

conv10 | 3 x 3 x 512 x 512 | 3 x 3 x 214 x 71 | 0.54 | 0.03 | 18$\times $ |

conv11 | 3 x 3 x 512 x 512 | 3 x 3 x 71 x 30 | 0.53 | 0.02 | 26.5$\times $ |

conv12 | 3 x 3 x 512 x 512 | 3 x 3 x 30 x 38 | 0.53 | 0.01 | 53$\times $ |

conv13 | 3 x 3 x 512 x 512 | 3 x 3 x 38 x 48 | 0.56 | 0.03 | 18.66$\times $ |

conv14 | 3 x 3 x 512 x 512 | 3 x 3 x 48 x 38 | 0.56 | 0.02 | 28$\times $ |

conv15 | 3 x 3 x 512 x 512 | 3 x 3 x 38 x 48 | 0.56 | 0.02 | 28$\times $ |

conv16 | 3 x 3 x 512 x 512 | 3 x 3 x 28 x 102 | 0.56 | 0.03 | 18.66$\times $ |

This paper has presented Condensa, which is a programming system for model compression. Condensa enables users to programmatically compose elementary schemes to build much more complex and practically interesting schemes, and includes a novel sample-efficient constrained Bayesian optimization-based algorithm for automatically inferring desirable sparsity ratios based on a user-provided objective function. On three real-world image classification and language modeling tasks, Condensa achieves memory footprint reductions of up to $65\times $ and runtime throughput improvements of up to $2.22\times $ using at most 10 samples per search. With the initial framework in place, we envision a number of directions to expand on Condensa’s capability. For example, we plan to augment automatic sparsity ratio inference with support for additional compression hyperparameters such as block sizes in block-sparsification (gray:2017), and data types for quantization. Our long-term goal is a framework that makes model compression easier, more flexible, and accessible to a wide range of users.