Online Process Reward Leanring for Agentic Reinforcement Learning

Abstract

Large language models (LLMs) are increasingly trained with reinforcementlearning (RL) as autonomous agents that reason and act over long horizons ininteractive environments. However, sparse and sometimes unverifiable rewardsmake temporal credit assignment extremely challenging. Recent work attempts tointegrate process supervision into agent learning but suffers from biasedannotation, reward hacking, high-variance from overly fine-grained signals orfailtures when state overlap is rare. We therefore introduce Online ProcessReward Learning (OPRL), a general credit-assignment strategy for agentic RLthat integrates seamlessly with standard on-policy algorithms without relyingon additional rollouts or explicit step labels. In OPRL, we optimize animplicit process reward model (PRM) alternately with the agent's policy totransform trajectory preferences into implicit step rewards through atrajectory-based DPO objective. These step rewards are then used to computestep-level advantages, which are combined with episode-level advantages fromoutcome rewards for policy update, creating a self-reinforcing loop.Theoretical findings guarantee that the learned step rewards are consistentwith trajectory preferences and act as potential-based shaping rewards,providing bounded gradients to stabilize training. Empirically, we evaluateOPRL on three distinct agent benmarks, including WebShop and VisualSokoban, aswell as open-ended social interactions with unverfiable rewards in SOTOPIA.Crucially, OPRL shows superior performance over frontier LLMs and strong RLbaselines across domains, achieving state-of-the-art results with highersample-efficiency and lower variance during training. Further analysis alsodemonstrates the efficient exploration by OPRL using fewer actions,underscoring its potential for agentic learning in real-world scenarios.

Quick Read (beta)

loading the full paper ...