High-throughput Generative Inference of Large Language Models with a Single GPU

  • 2023-03-13 06:19:28
  • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RĂ©, Ion Stoica, Ce Zhang
  • 30

Abstract

The high computational and memory requirements of large language model (LLM)inference traditionally make it feasible only with multiple high-endaccelerators. Motivated by the emerging demand for latency-insensitive taskswith batched processing, this paper initiates the study of high-throughput LLMinference using limited resources, such as a single commodity GPU. We presentFlexGen, a high-throughput generation engine for running LLMs with limited GPUmemory. FlexGen can be flexibly configured under various hardware resourceconstraints by aggregating memory and computation from the GPU, CPU, and disk.Through a linear programming optimizer, it searches for efficient patterns tostore and access tensors. FlexGen further compresses these weights and theattention cache to 4 bits with negligible accuracy loss. These techniquesenable FlexGen to have a larger space of batch size choices and thussignificantly increase maximum throughput. As a result, when running OPT-175Bon a single 16GB GPU, FlexGen achieves significantly higher throughput comparedto state-of-the-art offloading systems, reaching a generation throughput of 1token/s for the first time with an effective batch size of 144. On the HELMbenchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7representative sub-scenarios in 21 hours. The code is available athttps://github.com/FMInference/FlexGen

 

Quick Read (beta)

loading the full paper ...