Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Abstract

Scaling the amount of compute used to train language models has dramaticallyimproved their capabilities. However, when it comes to inference, we oftenlimit the amount of compute to only one attempt per problem. Here, we exploreinference compute as another axis for scaling by increasing the number ofgenerated samples. Across multiple tasks and models, we observe that coverage -the fraction of problems solved by any attempt - scales with the number ofsamples over four orders of magnitude. In domains like coding and formalproofs, where all answers can be automatically verified, these increases incoverage directly translate into improved performance. When we apply repeatedsampling to SWE-bench Lite, the fraction of issues solved withDeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250samples, outperforming the single-attempt state-of-the-art of 43% which usesmore capable frontier models. Moreover, using current API pricing, amplifyingthe cheaper DeepSeek model with five samples is more cost-effective and solvesmore issues than paying a premium for one sample from GPT-4o or Claude 3.5Sonnet. Interestingly, the relationship between coverage and the number ofsamples is often log-linear and can be modelled with an exponentiated powerlaw, suggesting the existence of inference-time scaling laws. Finally, we findthat identifying correct samples out of many generations remains an importantdirection for future research in domains without automatic verifiers. Whensolving math word problems from GSM8K and MATH, coverage with Llama-3 modelsgrows to over 95% with 10,000 samples. However, common methods to pick correctsolutions from a sample collection, such as majority voting or reward models,plateau beyond several hundred samples and fail to fully scale with the samplebudget.

Quick Read (beta)

loading the full paper ...