A Programmable Approach to Model Compression

Abstract

Deep neural networks frequently contain far more weights, represented at ahigher precision, than are required for the specific task which they aretrained to perform. Consequently, they can often be compressed using techniquessuch as weight pruning and quantization that reduce both model size andinference time without appreciable loss in accuracy. Compressing models beforethey are deployed can therefore result in significantly more efficient systems.However, while the results are desirable, finding the best compression strategyfor a given neural network, target platform, and optimization objective oftenrequires extensive experimentation. Moreover, finding optimal hyperparametersfor a given compression strategy typically results in even more expensive,frequently manual, trial-and-error exploration. In this paper, we introduce aprogrammable system for model compression called Condensa. Usersprogrammatically compose simple operators, in Python, to build complexcompression strategies. Given a strategy and a user-provided objective, such asminimization of running time, Condensa uses a novel sample-efficientconstrained Bayesian optimization algorithm to automatically infer desirablesparsity ratios. Our experiments on three real-world image classification andlanguage modeling tasks demonstrate memory footprint reductions of up to 65xand runtime throughput improvements of up to 2.22x using at most 10 samples persearch. We have released a reference implementation of Condensa athttps://github.com/NVlabs/condensa.

Quick Read (beta)

loading the full paper ...