EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Abstract

The sequential nature of modern LLMs makes them expensive and slow, andspeculative sampling has proven to be an effective solution to this problem.Methods like EAGLE perform autoregression at the feature level, reusingtop-layer features from the target model to achieve better results than vanillaspeculative sampling. A growing trend in the LLM community is scaling uptraining data to improve model intelligence without increasing inference costs.However, we observe that scaling up data provides limited improvements forEAGLE. We identify that this limitation arises from EAGLE's feature predictionconstraints. In this paper, we introduce EAGLE-3, which abandons featureprediction in favor of direct token prediction and replaces reliance ontop-layer features with multi-layer feature fusion via a technique namedtraining-time test. These improvements significantly enhance performance andenable the draft model to fully benefit from scaling up training data. Ourexperiments include both chat models and reasoning models, evaluated on fivetasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, withabout 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achievesa 1.38x throughput improvement at a batch size of 64. The code is available athttps://github.com/SafeAILab/EAGLE.

Quick Read (beta)

loading the full paper ...