SSRL: Self-Search Reinforcement Learning

Abstract

We investigate the potential of large language models (LLMs) to serve asefficient simulators for agentic search tasks in reinforcement learning (RL),thereby reducing dependence on costly interactions with external searchengines. To this end, we first quantify the intrinsic search capability of LLMsvia structured prompting and repeated sampling, which we term Self-Search. Ourresults reveal that LLMs exhibit strong scaling behavior with respect to theinference budget, achieving high pass@k on question-answering benchmarks,including the challenging BrowseComp task. Building on these observations, weintroduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capabilitythrough format-based and rule-based rewards. SSRL enables models to iterativelyrefine their knowledge utilization internally, without requiring access toexternal tools. Empirical evaluations demonstrate that SSRL-trained policymodels provide a cost-effective and stable environment for search-driven RLtraining, reducing reliance on external search engines and facilitating robustsim-to-real transfer. We draw the following conclusions: 1) LLMs possess worldknowledge that can be effectively elicited to achieve high performance; 2) SSRLdemonstrates the potential of leveraging internal knowledge to reducehallucination; 3) SSRL-trained models integrate seamlessly with external searchengines without additional effort. Our findings highlight the potential of LLMsto support more scalable RL agent training.

Quick Read (beta)

loading the full paper ...