Offline Reinforcement Learning with Imputed Rewards

Abstract

Offline Reinforcement Learning (ORL) offers a robust solution to trainingagents in applications where interactions with the environment must be strictlylimited due to cost, safety, or lack of accurate simulation environments.Despite its potential to facilitate deployment of artificial agents in the realworld, Offline Reinforcement Learning typically requires very manydemonstrations annotated with ground-truth rewards. Consequently,state-of-the-art ORL algorithms can be difficult or impossible to apply indata-scarce scenarios. In this paper we propose a simple but effective RewardModel that can estimate the reward signal from a very limited sample ofenvironment transitions annotated with rewards. Once the reward signal ismodeled, we use the Reward Model to impute rewards for a large sample ofreward-free transitions, thus enabling the application of ORL techniques. Wedemonstrate the potential of our approach on several D4RL continuous locomotiontasks. Our results show that, using only 1\% of reward-labeled transitions fromthe original datasets, our learned reward model is able to impute rewards forthe remaining 99\% of the transitions, from which performant agents can belearned using Offline Reinforcement Learning.

Quick Read (beta)

loading the full paper ...