Reinforcement Learning with Perturbed Rewards

Abstract

Recent studies have shown that reinforcement learning (RL) models arevulnerable in various noisy scenarios. For instance, the observed rewardchannel is often subject to noise in practice (e.g., when rewards are collectedthrough sensors), and is therefore not credible. In addition, for applicationssuch as robotics, a deep reinforcement learning (DRL) algorithm can bemanipulated to produce arbitrary errors by receiving corrupted rewards. In thispaper, we consider noisy RL problems with perturbed rewards, which can beapproximated with a confusion matrix. We develop a robust RL framework thatenables agents to learn in noisy environments where only perturbed rewards areobserved. Our solution framework builds on existing RL/DRL algorithms andfirstly addresses the biased noisy reward setting without any assumptions onthe true distribution (e.g., zero-mean Gaussian noise as made in previousworks). The core ideas of our solution include estimating a reward confusionmatrix and defining a set of unbiased surrogate rewards. We prove theconvergence and sample complexity of our approach. Extensive experiments ondifferent DRL platforms show that trained policies based on our estimatedsurrogate reward can achieve higher expected rewards, and converge faster thanexisting baselines. For instance, the state-of-the-art PPO algorithm is able toobtain 84.6% and 80.8% improvements on average score for five Atari games, witherror rates as 10% and 30% respectively.

Quick Read (beta)

loading the full paper ...