PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Abstract

Large Language Models (LLMs) have shown remarkable advancements in tacklingagent-oriented tasks. Despite their potential, existing work faces challengeswhen deploying LLMs in agent-based environments. The widely adopted agentparadigm ReAct centers on integrating single-step reasoning with immediateaction execution, which limits its effectiveness in complex tasks requiringlong-term strategic planning. Furthermore, the coordination between the plannerand executor during problem-solving is also a critical factor to consider inagent design. Additionally, current approaches predominantly rely on supervisedfine-tuning, which often leads models to memorize established task completiontrajectories, thereby restricting their generalization ability when confrontedwith novel problem contexts. To address these challenges, we introduce anadaptive global plan-based agent paradigm AdaPlan, aiming to synergizehigh-level explicit guidance with execution to support effective long-horizondecision-making. Based on the proposed paradigm, we further put forwardPilotRL, a global planning-guided training framework for LLM agents driven byprogressive reinforcement learning. We first develop the model's ability tofollow explicit guidance from global plans when addressing agent tasks.Subsequently, based on this foundation, we focus on optimizing the quality ofgenerated plans. Finally, we conduct joint optimization of the model's planningand execution coordination. Experiments indicate that PilotRL could achievestate-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassingclosed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78%comparing to GPT-4o-mini at a comparable parameter scale.

Quick Read (beta)

loading the full paper ...