Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers

Abstract

Recent research has generated hope that inference scaling could allow weakerlanguage models to match or exceed the accuracy of stronger models, such as byrepeatedly sampling solutions to a coding problem until it passes unit tests.The central thesis of this paper is that there is no free lunch for inferencescaling: indefinite accuracy improvement through resampling can only berealized if the "verifier" (in this case, a set of unit tests) is perfect. Whenthe verifier is imperfect, as it almost always is in domains such as reasoningor coding (for example, unit tests have imperfect coverage), there is a nonzeroprobability of false positives: incorrect solutions that pass the verifier.Resampling cannot decrease this probability, so it imposes an upper bound tothe accuracy of resampling-based inference scaling even with an infinitecompute budget. We find that there is a very strong correlation between themodel's single-sample accuracy (i.e. accuracy without unit tests) and its falsepositive rate on coding benchmarks HumanEval and MBPP, whose unit tests havelimited coverage. Therefore, no amount of inference scaling of weaker modelscan enable them to match the single-sample accuracy of a sufficiently strongmodel (Fig. 1a). When we consider that false positives have a negative utilitycompared to abstaining from producing a solution, it bends the inferencescaling curve further downward. Empirically, we find that the optimal number ofsamples can be less than 10 under realistic assumptions (Fig. 1b). Finally, weshow that beyond accuracy, false positives may have other undesirablequalities, such as poor adherence to coding style conventions.

Quick Read (beta)

loading the full paper ...