SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Abstract

Large language models have become the cornerstone of natural languageprocessing, but their use comes with substantial costs in terms of compute andmemory resources. Sparsification provides a solution to alleviate theseresource constraints, and recent works have shown that trained models can besparsified post-hoc. Existing sparsification techniques face challenges as theyneed additional data structures and offer constrained speedup with currenthardware. In this paper we present SliceGPT, a new post-training sparsificationscheme which replaces each weight matrix with a smaller (dense) matrix,reducing the embedding dimension of the network. Through extensiveexperimentation, we show that SliceGPT can remove up to 25% of the modelparameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 modelswhile maintaining 99%, 99% and 90% zero-shot task performance of the densemodel respectively. Our sliced models run on fewer GPUs and run faster withoutany additional code optimization: on 24GB consumer GPUs we reduce the totalcompute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GBA100 GPUs we reduce it to 66%. We offer a new insight, computational invariancein transformer networks, which enables SliceGPT and we hope it will inspire andenable future avenues to reduce memory and computation demands for pre-trainedmodels. Code is available at:https://github.com/microsoft/TransformerCompression

Quick Read (beta)

loading the full paper ...