START: Self-taught Reasoner with Tools

Abstract

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 havedemonstrated remarkable capabilities in complex reasoning tasks through theutilization of long Chain-of-thought (CoT). However, these models often sufferfrom hallucinations and inefficiencies due to their reliance solely on internalreasoning processes. In this paper, we introduce START (Self-Taught Reasonerwith Tools), a novel tool-integrated long CoT reasoning LLM that significantlyenhances reasoning capabilities by leveraging external tools. Through codeexecution, START is capable of performing complex computations, self-checking,exploring diverse methods, and self-debugging, thereby addressing thelimitations of LRMs. The core innovation of START lies in its self-learningframework, which comprises two key techniques: 1) Hint-infer: We demonstratethat inserting artificially designed hints (e.g., ``Wait, maybe using Pythonhere is a good idea.'') during the inference process of a LRM effectivelystimulates its ability to utilize external tools without the need for anydemonstration data. Hint-infer can also serve as a simple and effectivesequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning(Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, andmodifying the reasoning trajectories with tool invocation generated by a LRMvia Hint-infer, followed by fine-tuning the LRM. Through this framework, wehave fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA(GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and thecompetition-level code benchmark (LiveCodeBench), START achieves accuracy ratesof 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantlyoutperforms the base QwQ-32B and achieves performance comparable to thestate-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietarymodel o1-Preview.

Quick Read (beta)

loading the full paper ...