Search Self-play: Pushing the Frontier of Agent Capability without Supervision

  • 2025-10-21 17:19:35
  • Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang
  • 0

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become themainstream technique for training LLM agents. However, RLVR highly depends onwell-crafted task queries and corresponding ground-truth answers to provideaccurate rewards, which requires massive human efforts and hinders the RLscaling processes, especially under agentic scenarios. Although a few recentworks explore task synthesis methods, the difficulty of generated agentic taskscan hardly be controlled to provide effective RL training advantages. Toachieve agentic RLVR with higher scalability, we explore self-play training fordeep search agents, in which the learning LLM utilizes multi-turn search enginecalling and acts simultaneously as both a task proposer and a problem solver.The task proposer aims to generate deep search queries with well-definedground-truth answers and increasing task difficulty. The problem solver triesto handle the generated search queries and output the correct answerpredictions. To ensure that each generated search query has accurate groundtruth, we collect all the searching results from the proposer's trajectory asexternal knowledge, then conduct retrieval-augmentation generation (RAG) totest whether the proposed query can be correctly answered with all necessarysearch documents provided. In this search self-play (SSP) game, the proposerand the solver co-evolve their agent capabilities through both competition andcooperation. With substantial experimental results, we find that SSP cansignificantly improve search agents' performance uniformly on variousbenchmarks without any supervision under both from-scratch and continuous RLtraining setups. The code is at https://github.com/Alibaba-Quark/SSP.

 

Quick Read (beta)

loading the full paper ...