s1: Simple test-time scaling

Abstract

Test-time scaling is a promising new approach to language modeling that usesextra test-time compute to improve performance. Recently, OpenAI's o1 modelshowed this capability but did not publicly share its methodology, leading tomany replication efforts. We seek the simplest approach to achieve test-timescaling and strong reasoning performance. First, we curate a small dataset s1Kof 1,000 questions paired with reasoning traces relying on three criteria wevalidate through ablations: difficulty, diversity, and quality. Second, wedevelop budget forcing to control test-time compute by forcefully terminatingthe model's thinking process or lengthening it by appending "Wait" multipletimes to the model's generation when it tries to end. This can lead the modelto double-check its answer, often fixing incorrect reasoning steps. Aftersupervised finetuning the Qwen2.5-32B-Instruct language model on s1K andequipping it with budget forcing, our model s1-32B exceeds o1-preview oncompetition math questions by up to 27% (MATH and AIME24). Further, scalings1-32B with budget forcing allows extrapolating beyond its performance withouttest-time intervention: from 50% to 57% on AIME24. Our model, data, and codeare open-source at https://github.com/simplescaling/s1

Quick Read (beta)

loading the full paper ...