e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Abstract

Test-time scaling offers a promising path to improve LLM reasoning byutilizing more compute at inference time; however, the true promise of thisparadigm lies in extrapolation (i.e., improvement in performance on hardproblems as LLMs keep "thinking" for longer, beyond the maximum token budgetthey were trained on). Surprisingly, we find that most existing reasoningmodels do not extrapolate well. We show that one way to enable extrapolation isby training the LLM to perform in-context exploration: training the LLM toeffectively spend its test time budget by chaining operations (such asgeneration, verification, refinement, etc.), or testing multiple hypothesesbefore it commits to an answer. To enable in-context exploration, we identifythree key ingredients as part of our recipe e3: (1) chaining skills that thebase LLM has asymmetric competence in, e.g., chaining verification (easy) withgeneration (hard), as a way to implement in-context search; (2) leveraging"negative" gradients from incorrect traces to amplify exploration during RL,resulting in longer search traces that chains additional asymmetries; and (3)coupling task difficulty with training token budget during training via aspecifically-designed curriculum to structure in-context exploration. Ourrecipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25scores, and extrapolates to 2x the training token budget. Our e3-1.7B model notonly attains high pass@1 scores, but also improves pass@k over the base model.

Quick Read (beta)

loading the full paper ...