Abstract
Inference with modern Large Language Models (LLMs) is expensive andtime-consuming, and speculative sampling has proven to be an effectivesolution. Most speculative sampling methods such as EAGLE use a static drafttree, implicitly assuming that the acceptance rate of draft tokens depends onlyon their position. Interestingly, we found that the acceptance rate of drafttokens is also context-dependent. In this paper, building upon EAGLE, wepropose EAGLE-2, which introduces a new technique of context-aware dynamicdraft tree into drafting modeling. This improvement leverages the fact that thedraft model of EAGLE is well-calibrated: the confidence scores from the draftmodel approximate acceptance rates with small errors. We conducted extensiveevaluations on three series of LLMs and six tasks, with EAGLE-2 achievingspeedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 alsoensures that the distribution of the generated text remains unchanged, makingit a lossless acceleration algorithm.