Group-in-Group Policy Optimization for LLM Agent Training

Abstract

Recent advances in group-based reinforcement learning (RL) have drivenfrontier large language models (LLMs) in single-turn tasks like mathematicalreasoning. However, their scalability to multi-turn LLM agent training remainslimited. Unlike static tasks, agent-environment interactions unfold over manysteps and often yield sparse or delayed rewards, making credit assignmentacross individual steps significantly more challenging. In this work, wepropose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm thatachieves fine-grained credit assignment for LLM agents while preserving theappealing properties of group-based RL: critic-free, low memory, and stableconvergence. GiGPO introduces a two-level structure for estimating relativeadvantage: (i) At the episode-level, GiGPO computes macro relative advantagesbased on groups of complete trajectories; (ii) At the step-level, GiGPOintroduces an anchor state grouping mechanism that retroactively constructsstep-level groups by identifying repeated environment states acrosstrajectories. Actions stemming from the same state are grouped together,enabling micro relative advantage estimation. This hierarchical structureeffectively captures both global trajectory quality and local stepeffectiveness without relying on auxiliary models or additional rollouts. Weevaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop,as well as tool-integrated reasoning on search-augmented QA tasks, usingQwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-stepcredit signals, achieves performance gains of > 12% on ALFWorld and > 9% onWebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3Band 47.2% on 7B): all while maintaining the same GPU memory overhead, identicalLLM rollout, and incurring little to no additional time cost.

Quick Read (beta)

loading the full paper ...