Block Verification Accelerates Speculative Decoding

Abstract

Speculative decoding is an effective method for lossless acceleration oflarge language models during inference. It uses a fast model to draft a blockof tokens which are then verified in parallel by the target model, and providesa guarantee that the output is distributed identically to a sample from thetarget model. In prior works, draft verification is performed independentlytoken-by-token. Surprisingly, we show that this approach is not optimal. Wepropose Block Verification, a simple draft verification algorithm that verifiesthe entire block jointly and provides additional wall-clock speedup. We provethat the proposed mechanism is optimal in the expected number of tokensproduced each iteration and specifically is never worse than the standardtoken-level verification. Empirically, block verification provides modest butconsistent wall-clock speedups over the standard token verification algorithmof 5%-8% in a range of tasks and datasets. Given that block verification doesnot increase code complexity, maintains the strong lossless guarantee of thestandard speculative decoding verification algorithm, cannot deteriorateperformance, and, in fact, consistently improves it, it can be used as a gooddefault in speculative decoding implementations.

Quick Read (beta)

loading the full paper ...