Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Abstract

LLM-based search agents are increasingly trained on entity-centric syntheticdata to solve complex, knowledge-intensive tasks. However, prevailing trainingmethods like Group Relative Policy Optimization (GRPO) discard this rich entityinformation, relying instead on sparse, outcome-based rewards. This criticallimitation renders them unable to distinguish informative "near-miss"samples-those with substantially correct reasoning but a flawed finalanswer-from complete failures, thus discarding valuable learning signals. Weaddress this by leveraging the very entities discarded during training. Ourempirical analysis reveals a strong positive correlation between the number ofground-truth entities identified during an agent's reasoning process and finalanswer accuracy. Building on this insight, we introduce Entity-aware GroupRelative Policy Optimization (E-GRPO), a novel framework that formulates adense entity-aware reward function. E-GRPO assigns partial rewards to incorrectsamples proportional to their entity match rate, enabling the model toeffectively learn from these "near-misses". Experiments on diversequestion-answering (QA) and deep research benchmarks show that E-GRPOconsistently and significantly outperforms the GRPO baseline. Furthermore, ouranalysis reveals that E-GRPO not only achieves superior accuracy but alsoinduces more efficient reasoning policies that require fewer tool calls,demonstrating a more effective and sample-efficient approach to aligning searchagents.

Quick Read (beta)

loading the full paper ...