Abstract
Speculative decoding enhances the efficiency of large language models (LLMs)by leveraging a draft model to draft for a larger target model to review.However, drafting in speculative decoding involves slow autoregressivegeneration and generating tokens of different importance with the same timeallocation. These two inefficiencies lead to its suboptimal performance. Toaddress this issue, we introduce Cascade Speculative Drafting (CS. Drafting), anovel approach that employs two types of cascades. The Vertical Cascadeeliminates autoregressive generation from neural models. The Horizontal Cascadeconstitutes efficient time allocation in drafting with its optimality supportedby our theoretical analysis. Combining both cascades, our CS. Draftingalgorithm has achieved up to 72 percent additional speedup over speculativedecoding in our experiments while keeping the same output distribution.