From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

Abstract

The outstanding capabilities of large language models (LLMs) render them acrucial component in various autonomous agent systems. While traditionalmethods depend on the inherent knowledge of LLMs without fine-tuning, morerecent approaches have shifted toward the reinforcement learning strategy tofurther enhance agents' ability to solve complex interactive tasks withenvironments and tools. However, previous approaches are constrained by thesparse reward issue, where existing datasets solely provide a final scalarreward for each multi-step reasoning chain, potentially leading toineffectiveness and inefficiency in policy learning. In this paper, weintroduce StepAgent, which utilizes step-wise reward to optimize the agent'sreinforcement learning process. Inheriting the spirit of novice-to-experttheory, we first compare the actions of the expert and the agent toautomatically generate intermediate rewards for fine-grained optimization.Additionally, we propose implicit-reward and inverse reinforcement learningtechniques to facilitate agent reflection and policy adjustment. Furthertheoretical analysis demonstrates that the action distribution of the agent canconverge toward the expert action distribution over multiple training cycles.Experimental results across various datasets indicate that StepAgentoutperforms existing baseline methods.

Quick Read (beta)

loading the full paper ...