MoDeGPT: Modular Decomposition for Large Language Model Compression

Abstract

Large Language Models (LLMs) have reshaped the landscape of artificialintelligence by demonstrating exceptional performance across various tasks.However, substantial computational requirements make their deploymentchallenging on devices with limited resources. Recently, compression methodsusing low-rank matrix techniques have shown promise, yet these often lead todegraded accuracy or introduce significant overhead in parameters and inferencelatency. This paper introduces \textbf{Mo}dular \textbf{De}composition(MoDeGPT), a novel structured compression framework that does not need recoveryfine-tuning while resolving the above drawbacks. MoDeGPT partitions theTransformer block into modules comprised of matrix pairs and reduces the hiddendimensions via reconstructing the module-level outputs. MoDeGPT is developedbased on a theoretical framework that utilizes three well-established matrixdecomposition algorithms -- Nystr\"om approximation, CR decomposition, and SVD-- and applies them to our redefined transformer modules. Our comprehensiveexperiments show MoDeGPT, without backward propagation, matches or surpassesprevious structured compression methods that rely on gradient information, andsaves 98% of compute costs on compressing a 13B model. On \textsc{Llama}-2/3and OPT models, MoDeGPT maintains 90-95% zero-shot performance with 25-30%compression rates. Moreover, the compression can be done on a single GPU withina few hours and increases the inference throughput by up to 46%.

Quick Read (beta)

loading the full paper ...