High-throughput Generative Inference of Large Language Models with a Single GPU

Abstract

The high computational and memory requirements of large language model (LLM)inference traditionally make it feasible only with multiple high-endaccelerators. Motivated by the emerging demand for latency-insensitive taskswith batched processing, this paper initiates the study of high-throughput LLMinference using limited resources, such as a single commodity GPU. We presentFlexGen, a high-throughput generation engine for running LLMs with limited GPUmemory. FlexGen can be flexibly configured under various hardware resourceconstraints by aggregating memory and computation from the GPU, CPU, and disk.Through a linear programming optimizer, it searches for efficient patterns tostore and access tensors. FlexGen further compresses these weights and theattention cache to 4 bits with negligible accuracy loss. These techniquesenable FlexGen to have a larger space of batch size choices and thussignificantly increase maximum throughput. As a result, when running OPT-175Bon a single 16GB GPU, FlexGen achieves significantly higher throughput comparedto state-of-the-art offloading systems, reaching a generation throughput of 1token/s for the first time with an effective batch size of 144. On the HELMbenchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7representative sub-scenarios in 21 hours. The code is available athttps://github.com/FMInference/FlexGen

Quick Read (beta)

loading the full paper ...