PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

Abstract

With the emergence of a spectrum of high-end mobile devices, manyapplications that formerly required desktop-level computation capability arebeing transferred to these devices. However, executing the inference of DeepNeural Networks (DNNs) is still challenging considering high computation andstorage demands, specifically, if real-time performance with high accuracy isneeded. Weight pruning of DNNs is proposed, but existing schemes represent twoextremes in the design space: non-structured pruning is fine-grained, accurate,but not hardware friendly; structured pruning is coarse-grained,hardware-efficient, but with higher accuracy loss. In this paper, we introducea new dimension, fine-grained pruning patterns inside the coarse-grainedstructures, revealing a previously unknown point in design space. With thehigher accuracy enabled by fine-grained pruning patterns, the unique insight isto use the compiler to re-gain and guarantee high hardware efficiency. In otherwords, our method achieves the best of both worlds, and is desirable acrosstheory/algorithm, compiler, and hardware levels. The proposed PatDNN is anend-to-end framework to efficiently execute DNN on mobile devices with the helpof a novel model compression technique (pattern-based pruning based on extendedADMM solution framework) and a set of thorough architecture-aware compiler- andcode generation-based optimizations (filter kernel reordering, compressedweight storage, register load redundancy elimination, and parameterauto-tuning). Evaluation results demonstrate that PatDNN outperforms threestate-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and AlibabaMobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively,with no accuracy compromise. Real-time inference of representative large-scaleDNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

Quick Read (beta)

loading the full paper ...