Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

Abstract

Reinforcement learning (RL) faces challenges in evaluating policytrajectories within intricate game tasks due to the difficulty in designingcomprehensive and precise reward functions. This inherent difficulty curtailsthe broader application of RL within game environments characterized by diverseconstraints. Preference-based reinforcement learning (PbRL) presents apioneering framework that capitalizes on human preferences as pivotal rewardsignals, thereby circumventing the need for meticulous reward engineering.However, obtaining preference data from human experts is costly andinefficient, especially under conditions marked by complex constraints. Totackle this challenge, we propose a LLM-enabled automatic preference generationframework named LLM4PG , which harnesses the capabilities of large languagemodels (LLMs) to abstract trajectories, rank preferences, and reconstructreward functions to optimize conditioned policies. Experiments on tasks withcomplex language constraints demonstrated the effectiveness of our LLM-enabledreward functions, accelerating RL convergence and overcoming stagnation causedby slow or absent progress under original reward structures. This approachmitigates the reliance on specialized human knowledge and demonstrates thepotential of LLMs to enhance RL's effectiveness in complex environments in thewild.

Quick Read (beta)

loading the full paper ...