Group-in-Group Policy Optimization for LLM Agent Training

Abstract

Recent advances in group-based reinforcement learning (RL) have drivenfrontier large language models (LLMs) in single-turn tasks like mathematicalreasoning. However, their scalability to long-horizon LLM agent trainingremains limited. Unlike static tasks, agent-environment interactions unfoldover many steps and often yield sparse or delayed rewards, making creditassignment across individual steps significantly more challenging. In thiswork, we propose Group-in-Group Policy Optimization (GiGPO), a novel RLalgorithm that achieves fine-grained credit assignment for LLM agents whilepreserving the appealing properties of group-based RL: critic-free, low memory,and stable convergence. GiGPO introduces a two-level structure for estimatingrelative advantage: (i) At the episode-level, GiGPO computes macro relativeadvantages based on groups of complete trajectories; (ii) At the step-level,GiGPO introduces an anchor state grouping mechanism that retroactivelyconstructs step-level groups by identifying repeated environment states acrosstrajectories. Actions stemming from the same state are grouped together,enabling micro relative advantage estimation. This hierarchical structureeffectively captures both global trajectory quality and local stepeffectiveness without relying on auxiliary models or additional rollouts. Weevaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, usingQwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO deliversfine-grained per-step credit signals and achieves performance gains of > 12\%on ALFWorld and > 9\% on WebShop over the GRPO baseline: all while maintainingthe same GPU memory overhead, identical LLM rollout, and incurring little to noadditional time cost.

Quick Read (beta)

loading the full paper ...