Abstract
Search agents have achieved significant advancements in enabling intelligentinformation retrieval and decision-making within interactive environments.Although reinforcement learning has been employed to train agentic modelscapable of more dynamic interactive retrieval, existing methods are limited byshallow tool-use depth and the accumulation of errors over multiple iterativeinteractions. In this paper, we present WebSeer, a more intelligent searchagent trained via reinforcement learning enhanced with a self-reflectionmechanism. Specifically, we construct a large dataset annotated with reflectionpatterns and design a two-stage training framework that unifies cold start andreinforcement learning within the self-reflection paradigm for real-worldweb-based environments, which enables the model to generate longer and morereflective tool-use trajectories. Our approach substantially extends tool-usechains and improves answer accuracy. Using a single 14B model, we achievestate-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and90.0%, respectively, and demonstrate strong generalization toout-of-distribution datasets. The code is available athttps://github.com/99hgz/WebSeer