Abstract
Quantization can accelerate large language model (LLM) inference. Goingbeyond INT8 quantization, the research community is actively exploring evenlower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantizationtechniques only accelerate low-batch, edge LLM inference, failing to deliverperformance gains in large-batch, cloud-based LLM serving. We uncover acritical issue: existing INT4 quantization methods suffer from significantruntime overhead (20-90%) when dequantizing either weights or partial sums onGPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantizationalgorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ standsfor quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implementedby the QServe inference library that achieves measured speedup. The key insightdriving QServe is that the efficiency of LLM serving on GPUs is criticallyinfluenced by operations on low-throughput CUDA cores. Building upon thisinsight, in QoQ algorithm, we introduce progressive quantization that can allowlow dequantization overhead in W4A8 GEMM. Additionally, we developSmoothAttention to effectively mitigate the accuracy degradation incurred by4-bit KV quantization. In the QServe system, we perform compute-aware weightreordering and take advantage of register-level parallelism to reducedequantization latency. We also make fused attention memory-bound, harnessingthe performance gain brought by KV4 quantization. As a result, QServe improvesthe maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4xon L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared toTensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughputthan TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost ofLLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.