Abstract
Group based reinforcement learning (RL) has shown impressive results oncomplex reasoning and mathematical tasks. Yet, when applied to trainmulti-turn, interactive LLM agents, these methods often suffer from structuralblindness-the inability to exploit the underlying connectivity of theenvironment. This manifests in three critical challenges: (1) inefficient,unguided exploration, (2) imprecise credit assignment due to overlookingpivotal states, and (3) myopic planning caused by static reward discounting. Weaddress these issues with Graph-Enhanced Policy Optimization (GEPO), whichdynamically constructs a state-transition graph from agent experience andemploys graph-theoretic centrality to provide three synergistic learningsignals: (1)structured intrinsic rewards that guide exploration towardhigh-impact states, (2) a graph-enhanced advantage function for topology-awarecredit assignment, and (3) a dynamic discount factor adapted to each state'sstrategic value. On the ALFWorld, WebShop, and a proprietary Workbenchbenchmarks, GEPO demonstrates strong performance, achieving absolute successrate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. Theseresults highlight that explicitly modeling environmental structure is a robust,generalizable strategy for advancing LLM agent training.