Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Abstract

Large Language Models (LLMs) often struggle with mathematical reasoning tasksrequiring precise, verifiable computation. While Reinforcement Learning (RL)from outcome-based rewards enhances text-based reasoning, understanding howagents autonomously learn to leverage external tools like code executionremains crucial. We investigate RL from outcome-based rewards forTool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneouslygenerate and execute Python code for mathematical problems without supervisedtool-use examples. Our central contribution is we demonstrate that as RLtraining progresses, key metrics scale predictably. Specifically, we observestrong positive correlations where increased training steps lead to increasesin the spontaneous code execution frequency, the average response length, and,critically, the final task accuracy. This suggests a quantifiable relationshipbetween computational effort invested in training and the emergence ofeffective, tool-augmented reasoning strategies. We implement a robust frameworkfeaturing a decoupled code execution environment and validate our findingsacross standard RL algorithms and frameworks. Experiments show ZeroTIRsignificantly surpasses non-tool ZeroRL baselines on challenging mathbenchmarks. Our findings provide a foundational understanding of how autonomoustool use is acquired and scales within Agent RL, offering a reproduciblebenchmark for future studies. Code is released at\href{https://github.com/Anonymize-Author/AgentRL}{https://github.com/Anonymize-Author/AgentRL}.

Quick Read (beta)

loading the full paper ...