Abstract
As Large Language Models (LLMs) continue to advance in performance, theirsize has escalated significantly, with current LLMs containing billions or eventrillions of parameters. However, in this study, we discovered that many layersof LLMs exhibit high similarity, and some layers play a negligible role innetwork functionality. Based on this observation, we define a metric calledBlock Influence (BI) to gauge the significance of each layer in LLMs. We thenpropose a straightforward pruning approach: layer removal, in which we directlydelete the redundant layers in LLMs based on their BI scores. Experimentsdemonstrate that our method, which we call ShortGPT, significantly outperformsprevious state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPTis orthogonal to quantization-like methods, enabling further reduction inparameters and computation. The ability to achieve better results throughsimple layer removal, as opposed to more complex pruning techniques, suggests ahigh degree of redundancy in the model architecture.