Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities acrossa range of text-generation tasks. However, LLMs still struggle with problemsrequiring multi-step decision-making and environmental feedback, such as onlineshopping, scientific reasoning, and mathematical problem-solving. Unlike puretext data, collecting large-scale decision-making data is challenging.Moreover, many powerful LLMs are only accessible through APIs, which hinderstheir fine-tuning for agent tasks due to cost and complexity. To address LLMagents' limitations, we propose a framework that can automatically learn areward model from the environment without human annotations. This model can beused to evaluate the action trajectories of LLM agents and provide heuristicsfor task planning. Specifically, our approach involves employing one LLM-basedagent to navigate an environment randomly, generating diverse actiontrajectories. Subsequently, a separate LLM is leveraged to assign a task intentand synthesize a negative response alongside the correct response for eachtrajectory. These triplets (task intent, positive response, and negativeresponse) are then utilized as training data to optimize a reward model capableof scoring action trajectories. The effectiveness and generalizability of ourframework are demonstrated through evaluations conducted on different agentbenchmarks. In conclusion, our proposed framework represents a significantadvancement in enhancing LLM agents' decision-making capabilities. Byautomating the learning of reward models, we overcome the challenges of datascarcity and API limitations, potentially revolutionizing the application ofLLMs in complex and interactive environments. This research paves the way formore sophisticated AI agents capable of tackling a wide range of real-worldproblems requiring multi-step decision-making.

Quick Read (beta)

loading the full paper ...