Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Abstract

Pretrained large-scale language models have increasingly demonstrated highaccuracy on many natural language processing (NLP) tasks. However, the limitedweight storage and computational speed on hardware platforms have impeded thepopularity of pretrained models, especially in the era of edge computing. Inthis work, we propose an efficient transformer-based large-scale languagerepresentation using hardware-friendly block structure pruning. We incorporatethe reweighted group Lasso into block-structured pruning for optimization.Besides the significantly reduced weight storage and computation, the proposedapproach achieves high compression rates. Experimental results on differentmodels (BERT, RoBERTa, and DistilBERT) on the General Language UnderstandingEvaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero orminor accuracy degradation on certain task(s). Our proposed method is alsoorthogonal to existing compact pretrained language models such as DistilBERTusing knowledge distillation, since a further 1.79x average compression ratecan be achieved on top of DistilBERT with zero or minor accuracy degradation.It is suitable to deploy the final compressed model on resource-constrainededge devices.

Quick Read (beta)

loading the full paper ...