MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Abstract

Large Language Models (LLMs) have become more prevalent in long-contextapplications such as interactive chatbots, document analysis, and agentworkflows, but it is challenging to serve long-context requests with lowlatency and high throughput. Speculative decoding (SD) is a widely usedtechnique to reduce latency without sacrificing performance but theconventional wisdom suggests that its efficacy is limited to small batch sizes.In MagicDec, we show that surprisingly SD can achieve speedup even for a highthroughput inference regime for moderate to long sequences. More interestingly,an intelligent drafting strategy can achieve better speedup with increasingbatch size based on our rigorous analysis. MagicDec first identifies thebottleneck shifts with increasing batch size and sequence length, and usesthese insights to deploy speculative decoding more effectively for highthroughput inference. Then, it leverages draft models with sparse KV cache toaddress the KV bottleneck that scales with both sequence length and batch size.This finding underscores the broad applicability of speculative decoding inlong-context serving, as it can enhance throughput and reduce latency withoutcompromising accuracy. For moderate to long sequences, we demonstrate up to 2xspeedup for LLaMA-2-7B-32K and 1.84x speedup for LLaMA-3.1-8B when servingbatch sizes ranging from 32 to 256 on 8 NVIDIA A100 GPUs. The code is availableat https://github.com/Infini-AI-Lab/MagicDec/.

Quick Read (beta)

loading the full paper ...