LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Abstract

Large language models (LLMs) show excellent performance in difficult tasks,but they often require massive memories and computational resources. How toreduce the parameter scale of LLMs has become research hotspots. In this study,we make an important observation that the multi-head self-attention (MHA)sub-layer of Transformer exhibits noticeable low-rank structure, while thefeed-forward network (FFN) sub-layer does not. With this regard, we design amixed compression model, which organically combines Low-Rank matrixapproximation And structured Pruning (LoRAP). For the MHA sub-layer, we proposean input activation weighted singular value decomposition method to strengthenthe low-rank characteristic. Furthermore, we discover that the weight matricesin MHA sub-layer have different low-rank degrees. Thus, a novel parameterallocation scheme according to the discrepancy of low-rank degrees is devised.For the FFN sub-layer, we propose a gradient-free structured channel pruningmethod. During the pruning, we get an interesting finding that the leastimportant 1% of parameter actually play a vital role in model performance.Extensive evaluations on zero-shot perplexity and zero-shot task classificationindicate that our proposal is superior to previous structured compressionrivals under multiple compression ratios.

Quick Read (beta)

loading the full paper ...