Abstract
Large language models (LLMs) are often queried multiple times at test time,with predictions aggregated by majority vote. While effective, thisself-consistency strategy (arXiv:2203.11171) requires a fixed number of callsand can fail when the correct answer is rare. We introduce Confidence-GuidedEarly Stopping (CGES), a Bayesian framework that forms posteriors overcandidate answers using scalar confidence signals derived from tokenprobabilities or reward models. CGES adaptively halts sampling once theposterior mass of a candidate exceeds a threshold. We provide theoreticalguarantees for both perfectly calibrated confidences and realistic noisyconfidence signals. Across five reasoning benchmarks, CGES reduces the averagenumber of model calls by about 69 percent (for example, from 16.0 to 4.9) whilematching the accuracy of self-consistency within 0.06 percentage points.