Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

Abstract

Reinforcement Learning (RL) suffers from sample inefficiency in sparse rewarddomains, and the problem is further pronounced in case of stochastictransitions. To improve the sample efficiency, reward shaping is a well-studiedapproach to introduce intrinsic rewards that can help the RL agent converge toan optimal policy faster. However, designing a useful reward shaping functionfor all desirable states in the Markov Decision Process (MDP) is challenging,even for domain experts. Given that Large Language Models (LLMs) havedemonstrated impressive performance across a magnitude of natural languagetasks, we aim to answer the following question: `Can we obtain heuristics usingLLMs for constructing a reward shaping function that can boost an RL agent'ssample efficiency?' To this end, we aim to leverage off-the-shelf LLMs togenerate a plan for an abstraction of the underlying MDP. We further use thisLLM-generated plan as a heuristic to construct the reward shaping signal forthe downstream RL agent. By characterizing the type of abstraction based on theMDP horizon length, we analyze the quality of heuristics when generated usingan LLM, with and without a verifier in the loop. Our experiments acrossmultiple domains with varying horizon length and number of sub-goals from theBabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) theadvantages and limitations of querying LLMs with and without a verifier togenerate a reward shaping heuristic, and, 2) a significant improvement in thesample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generatedheuristics.

Quick Read (beta)

loading the full paper ...