Structured Pruning of Large Language Models

Abstract

Large language models have recently achieved state of the art performanceacross a wide variety of natural language tasks. Meanwhile, the size of thesemodels and their latency have significantly increased, which makes their usagecostly, and raises an interesting question: do language models need to belarge? We study this question through the lens of model compression. We presenta novel, structured pruning approach based on low rank factorization andaugmented Lagrangian L0 norm regularization. Our structured approach achievessignificant inference speedups while matching or outperforming our unstructuredpruning baseline at various sparsity levels. We apply our method to state ofthe art models on the enwiki8 dataset and obtain a 1.19 perplexity score withjust 5M parameters, vastly outperforming a model of the same size trained fromscratch. We also demonstrate that our method can be applied to language modelfine-tuning by pruning the BERT model on several downstream classificationbenchmarks.

Quick Read (beta)

loading the full paper ...