Abstract
Large language models have recently achieved state of the art performanceacross a wide variety of natural language tasks. Meanwhile, the size of thesemodels and their latency have significantly increased, which makes their usagecostly, and raises an interesting question: do language models need to belarge? We study this question through the lens of model compression. We presenta generic, structured pruning approach by parameterizing each weight matrixusing its low-rank factorization, and adaptively removing rank-1 componentsduring training. On language modeling tasks, our structured approachoutperforms other unstructured and block-structured pruning baselines atvarious compression levels, while achieving significant speedups during bothtraining and inference. We also demonstrate that our method can be applied topruning adaptive word embeddings in large language models, and to pruning theBERT model on several downstream fine-tuning classification benchmarks.