SliceGPT: Compress Large Language Models by Deleting Rows and Columns

  • 2024-02-09 17:59:40
  • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
  • 0

Abstract

Large language models have become the cornerstone of natural languageprocessing, but their use comes with substantial costs in terms of compute andmemory resources. Sparsification provides a solution to alleviate theseresource constraints, and recent works have shown that trained models can besparsified post-hoc. Existing sparsification techniques face challenges as theyneed additional data structures and offer constrained speedup with currenthardware. In this paper we present SliceGPT, a new post-training sparsificationscheme which replaces each weight matrix with a smaller (dense) matrix,reducing the embedding dimension of the network. Through extensiveexperimentation, we show that SliceGPT can remove up to 25% of the modelparameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 modelswhile maintaining 99%, 99% and 90% zero-shot task performance of the densemodel respectively. Our sliced models run on fewer GPUs and run faster withoutany additional code optimization: on 24GB consumer GPUs we reduce the totalcompute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GBA100 GPUs we reduce it to 66%. We offer a new insight, computational invariancein transformer networks, which enables SliceGPT and we hope it will inspire andenable future avenues to reduce memory and computation demands for pre-trainedmodels. Code is available at:https://github.com/microsoft/TransformerCompression

 

Quick Read (beta)

loading the full paper ...