EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Abstract

Auto-regressive decoding makes the inference of Large Language Models (LLMs)time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithmfor Greater Language-model Efficiency), for lossless acceleration. Unliketraditional speculative sampling methods, EAGLE operates the drafting processauto-regressively at the more regular (second-top-layer) feature level andaddresses the sampling uncertainty issues in the next-feature predictionproblems by integrating tokens from one time step ahead. The accelerationprovided by EAGLE is lossless: it involves no fine-tuning of the target LLM,and the generated text maintains the same distribution as that of vanillaauto-regressive decoding. As of the submission of this paper, EAGLE is thefastest known framework within the speculative sampling family. On MT-bench,EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6xfaster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s withLLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s ofHuggingface's implementations.

Quick Read (beta)

loading the full paper ...