Balanced Sparsity for Efficient DNN Inference on GPU

Abstract

In trained deep neural networks, unstructured pruning can reduce redundantweights to lower storage cost. However, it requires the customization ofhardwares to speed up practical inference. Another trend accelerates sparsemodel inference on general-purpose hardwares by adopting coarse-grainedsparsity to prune or regularize consecutive weights for efficient computation.But this method often sacrifices model accuracy. In this paper, we propose anovel fine-grained sparsity approach, balanced sparsity, to achieve high modelaccuracy with commercial hardwares efficiently. Our approach adapts to highparallelism property of GPU, showing incredible potential for sparsity in thewidely deployment of deep learning services. Experiment results show thatbalanced sparsity achieves up to 3.1x practical speedup for model inference onGPU, while retains the same high model accuracy as fine-grained sparsity.

Quick Read (beta)

loading the full paper ...