Abstract
Recent studies have shown that reinforcement learning (RL) models arevulnerable in various noisy scenarios. For instance, the observed rewardchannel is often subject to noise in practice (e.g., when rewards are collectedthrough sensors), and is therefore not credible. In addition, for applicationssuch as robotics, a deep reinforcement learning (DRL) algorithm can bemanipulated to produce arbitrary errors by receiving corrupted rewards. In thispaper, we consider noisy RL problems with perturbed rewards, which can beapproximated with a confusion matrix. We develop a robust RL framework thatenables agents to learn in noisy environments where only perturbed rewards areobserved. Our solution framework builds on existing RL/DRL algorithms andfirstly addresses the biased noisy reward setting without any assumptions onthe true distribution (e.g., zeromean Gaussian noise as made in previousworks). The core ideas of our solution include estimating a reward confusionmatrix and defining a set of unbiased surrogate rewards. We prove theconvergence and sample complexity of our approach. Extensive experiments ondifferent DRL platforms show that trained policies based on our estimatedsurrogate reward can achieve higher expected rewards, and converge faster thanexisting baselines. For instance, the stateoftheart PPO algorithm is able toobtain 84.6% and 80.8% improvements on average score for five Atari games, witherror rates as 10% and 30% respectively.
Quick Read (beta)
Reinforcement Learning with Perturbed Rewards
Abstract
Recent studies have shown that reinforcement learning (RL) models are vulnerable in various noisy scenarios. For instance, the observed reward channel is often subject to noise in practice (e.g., when rewards are collected through sensors), and is therefore not credible. In addition, for applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors by receiving corrupted rewards. In this paper, we consider noisy RL problems with perturbed rewards, which can be approximated with a confusion matrix. We develop a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed. Our solution framework builds on existing RL/DRL algorithms and firstly addresses the biased noisy reward setting without any assumptions on the true distribution (e.g., zeromean Gaussian noise as made in previous works). The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the stateoftheart PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively.
Reinforcement Learning with Perturbed Rewards
Jingkang Wang University of Toronto & Vector Institute Toronto, Canada [email protected] Yang Liu University of California, Santa Cruz California, USA [email protected] Bo Li University of Illinois, Urbana–Champaign Illinois, USA [email protected]
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Designing a suitable reward function plays a critical role in building reinforcement learning models for realworld applications. Ideally, one would want to customize reward functions to achieve applicationspecific goals (?). In practice, however, it is difficult to design a reward function that produces credible rewards in the presence of noise. This is because the output from any reward function is subject to multiple kinds of randomness:

•
Inherent Noise. For instance, sensors on a robot will be affected by physical conditions such as temperature and lighting, and therefore will report back noisy observed rewards.

•
ApplicationSpecific Noise. In machine teaching tasks (?), when an RL agent receives feedback/instructions, different human instructors might provide drastically different feedback that leads to biased rewards for machine.

•
Adversarial Noise. ? have shown that by adding adversarial perturbation to each frame of the game, they can mislead pretrained RL policies arbitrarily.
Assuming an arbitrary noise model makes solving this noisy RL problem extremely challenging. Instead, we focus on a specific noisy reward model which we call perturbed rewards, where the observed rewards by RL agents are learnable. The perturbed rewards are generated via a confusion matrix that flips the true reward to another one according to a certain distribution. This is not a very restrictive setting (?) to start with, even considering that the noise could be adversarial: For instance, adversaries can manipulate sensors via reversing the reward value.
In this paper, we develop an unbiased reward estimator aided robust framework that enables an RL agent to learn in a noisy environment with observing only perturbed rewards. The main challenge is that the observed rewards are likely to be biased, and in RL or DRL the accumulated errors could amplify the reward estimation error over time. To the best of our knowledge, this is the first work addressing robust RL in the biased rewards setting (existing work need to assume the unbiased noise distribution). We do not require any assumption on the knowledge of true reward distribution or adversarial strategies, other than the fact that the generation of noises follows a reward confusion matrix. We address the issue of estimating the reward confusion matrices by proposing an efficient and flexible estimation module for settings with deterministic rewards.
? provided preliminary studies for this noisy reward problem and gave some general negative results. The authors proved a No Free Lunch theorem, which is, without any assumption about what the reward corruption is, all agents can be misled. Our results do not contradict with the results therein, as we consider a stochastic noise generation model (that leads to a set of perturbed rewards).
We analyze the convergence and sample complexity for the policy trained using our proposed method based on surrogate rewards, using $Q$Learning as an example. We then conduct extensive experiments on OpenAI Gym (?) and show that the proposed reward robust RL method achieves comparable performance with the policy trained using the true rewards. In some cases, our method even achieves higher cumulative reward  this is surprising to us at first, but we conjecture that the inserted noise together with our noiseremoval unbiased estimator add another layer of exploration, which proves to be beneficial in some settings.
Our contributions are summarized as follows: (1) We formulate and generalize the idea of defining a simple but effective unbiased estimator for true rewards under reinforcement learning setting. The proposed estimator helps guarantee the convergence to the optimal policy even when the RL agents only have noisy observations of the rewards. (2) We analyze the convergence to the optimal policy and the finite sample complexity of our rewardrobust RL methods, using $Q$Learning as the example. (3) Extensive experiments on OpenAI Gym show that our proposed algorithms perform robustly even at high noise rates. Code is online available: https://github.com/wangjksjtu/rlperturbedreward.
Related Work
Robust Reinforcement Learning
It is known that RL algorithms are vulnerable in noisy environments (?). Recent studies (?; ?; ?) show that learned RL policies can be easily misled with small perturbations in observations. The presence of noise is very common in realworld environments, especially in roboticsrelevant applications (?; ?). Consequently, robust RL algorithms have been widely studied, aiming to train a robust policy that is capable of withstanding perturbed observations (?; ?; ?) or transferring to unseen environments (?; ?). However, these algorithms mainly focus on noisy vision observations, instead of observed rewards. Some early works (?; ?; ?; ?) on noisy reward RL rely on the knowledge of unbiased noise distribution, which limits their applicability to more general biased rewards settings. A couple of recent works (?; ?) have looked into a parallel question of training robust RL algorithms with uncertainty in models.
Learning with Noisy Data
Learning appropriately with biased data has received quite a bit of attention in recent machine learning studies (?; ?; ?; ?; ?; ?). The idea of this line of works is to define unbiased surrogate loss functions to recover the true loss using the knowledge of the noise. Our work is the first to formally establish this extension both theoretically and empirically. Our quantitative understandings will provide practical insights when implementing reinforcement learning algorithms in noisy environments.
Problem Formulation and Preliminaries
In this section, we define our problem of learning from perturbed rewards in reinforcement learning. Throughout this paper, we will use perturbed reward and noisy reward interchangeably, considering that the noise could come from both intentional perturbation and natural randomness. In what follows, we formulate our Markov Decision Process (MDP) and reinforcement learning (RL) problem with perturbed rewards.
Reinforcement Learning: The NoiseFree Setting
Our RL agent interacts with an unknown environment and attempts to maximize the total of its collected reward. The environment is formalized as a Markov Decision Process (MDP), denoting as $\mathcal{M}=\u27e8\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{P},\gamma \u27e9$. At each time $t$, the agent in state ${s}_{t}\in \mathcal{S}$ takes an action ${a}_{t}\in \mathcal{A}$, which returns a reward $r({s}_{t},{a}_{t},{s}_{t+1})\in \mathcal{R}$ (which we will also shorthand as ${r}_{t}$) ^{1}^{1} 1 We do not restrict the reward to deterministic in general, except for when we need to estimate the noises in the perturbed reward (Section 3.3)., and leads to the next state ${s}_{t+1}\in \mathcal{S}$ according to a transition probability kernel $\mathcal{P}$. $\mathcal{P}$ encodes the probability ${\mathbb{P}}_{a}({s}_{t},{s}_{t+1})$, and commonly is unknown to the agent. The agent’s goal is to learn the optimal policy, a conditional distribution $\pi (as)$ that maximizes the state’s value function. The value function calculates the cumulative reward the agent is expected to receive given it would follow the current policy $\pi $ after observing the current state ${s}_{t}$: ${V}^{\pi}(s)={\mathbb{E}}_{\pi}[{\sum}_{k=0}^{\mathrm{\infty}}{\gamma}^{k}{r}_{t+k+1}\mid {s}_{t}=s],$ where $0\le \gamma \le 1$ is a discount factor ($\gamma =1$ indicates an undiscounted MDP setting (?; ?; ?)). Intuitively, the agent evaluates how preferable each state is, given the current policy. From the Bellman Equation, the optimal value function is given by ${V}^{\ast}(s)={\mathrm{max}}_{a\in \mathcal{A}}{\sum}_{{s}_{t+1}\in \mathcal{S}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1})\left[{r}_{t}+\gamma {V}^{\ast}({s}_{t+1})\right].$ It is a standard practice for RL algorithms to learn a stateaction value function, also called the $Q$function. $Q$function denotes the expected cumulative reward if agent chooses $a$ in the current state and follows $\pi $ thereafter: ${Q}^{\pi}(s,a)={\mathbb{E}}_{\pi}[r({s}_{t},{a}_{t},{s}_{t+1})+\gamma {V}^{\pi}({s}_{t+1})\mid {s}_{t}=s,{a}_{t}=a].$
Perturbed Reward in RL
In many practical settings, the RL agent does not observe the reward feedback perfectly. We consider the following MDP with perturbed reward, denoting as $\stackrel{~}{\mathcal{M}}=\u27e8\mathcal{S},\mathcal{A},\mathcal{R},C,\mathcal{P},\gamma \u27e9$^{2}^{2} 2 The MDP with perturbed reward can equivalently be defined as a tuple $\stackrel{~}{\mathcal{M}}=\u27e8\mathcal{S},\mathcal{A},\mathcal{R},\stackrel{~}{\mathcal{R}},\mathcal{P},\gamma \u27e9$, with the perturbation function $C$ implicitly defined as the difference between $\mathcal{R}$ and $\stackrel{~}{\mathcal{R}}$.: instead of observing ${r}_{t}\in \mathcal{R}$ at each time $t$ directly (following his action), our RL agent only observes a perturbed version of ${r}_{t}$, denoting as ${\stackrel{~}{r}}_{t}\in \stackrel{~}{\mathcal{R}}$. For most of our presentations, we focus on the cases where $\mathcal{R}$, $\stackrel{~}{\mathcal{R}}$ are finite sets; but our results generalize to the continuous reward settings with discretization techinques.
The generation of $\stackrel{~}{r}$ follows a certain function $C:\mathcal{S}\times \mathcal{R}\to \stackrel{~}{\mathcal{R}}$. To let our presentation stay focused, we consider the following stateindependent flipping error rates model: if the rewards are binary (consider ${r}_{+}$ and ${r}_{}$), $\stackrel{~}{r}({s}_{t},{a}_{t},{s}_{t+1})$ (${\stackrel{~}{r}}_{t}$) can be characterized by the following noise rate parameters ${e}_{+},{e}_{}$: ${e}_{+}=\mathbb{P}(\stackrel{~}{r}({s}_{t},{a}_{t},{s}_{t+1})={r}_{}r({s}_{t},{a}_{t},{s}_{t+1})={r}_{+}),{e}_{}=\mathbb{P}(\stackrel{~}{r}({s}_{t},{a}_{t},{s}_{t+1})={r}_{+}r({s}_{t},{a}_{t},{s}_{t+1})={r}_{})$ . When the signal levels are beyond binary, suppose there are $M$ outcomes in total, denoting as $[{R}_{0},{R}_{1},\mathrm{\cdots},{R}_{M1}]$. ${\stackrel{~}{r}}_{t}$ will be generated according to the following confusion matrix ${\mathbf{C}}_{M\times M}$ where each entry ${c}_{j,k}$ indicates the flipping probability for generating a perturbed outcome: ${c}_{j,k}=\mathbb{P}({\stackrel{~}{r}}_{t}={R}_{k}{r}_{t}={R}_{j}).$ Again we’d like to note that we focus on settings with finite reward levels for most of our paper, but we provide discussions later on how to handle continuous rewards.
In the paper, we also generalize our solution to the case without knowing the noise rates (i.e., the reward confusion matrices) for settings in which the rewards for each (state, action) pair is deterministic, which is different from the assumption of knowing them as adopted in many supervised learning works (?). Instead we will estimate the confusion matrices in our framework.
Learning with Perturbed Rewards
In this section, we first introduce an unbiased estimator for binary rewards in our reinforcement learning setting when the error rates are known. This idea is inspired by (?), but we will extend the method to the multioutcome, as well as the continuous reward settings.
Unbiased Estimator for True Reward
With the knowledge of noise rates (reward confusion matrices), we are able to establish an unbiased approximation of the true reward in a similar way as done in (?). We will call such a constructed unbiased reward as a surrogate reward. To give an intuition, we start with replicating the results for binary reward $\mathcal{R}=\{{r}_{},{r}_{+}\}$ in our RL setting:
Lemma 1.
Let $r$ be bounded. Then, if we define,
$\begin{array}{c}\widehat{r}({s}_{t},{a}_{t},{s}_{t+1}):=\{\begin{array}{cc}\frac{(1{e}_{})\cdot {r}_{+}{e}_{+}\cdot {r}_{}}{1{e}_{+}{e}_{}}\hfill & (\stackrel{~}{r}({s}_{t},{a}_{t},{s}_{t+1})={r}_{+})\hfill \\ \frac{(1{e}_{+})\cdot {r}_{}{e}_{}\cdot {r}_{+}}{1{e}_{+}{e}_{}}\hfill & (\stackrel{~}{r}({s}_{t},{a}_{t},{s}_{t+1})={r}_{})\hfill \end{array}\hfill \end{array}$  (1) 
we have for any $r\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}$, ${\mathrm{E}}_{\stackrel{\mathrm{~}}{r}\mathrm{}r}\mathit{}\mathrm{[}\widehat{r}\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}\mathrm{]}\mathrm{=}r\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}\mathrm{.}$
In the standard supervised learning setting, the above property guarantees convergence  as more training data are collected, the empirical surrogate risk converges to its expectation, which is the same as the expectation of the true risk (due to unbiased estimators). This is also the intuition why we would like to replace the reward terms with surrogate rewards in our RL algorithms.
The above idea can be generalized to the multioutcome setting in a fairly straightforward way. Define $\widehat{\mathbf{R}}:=[\widehat{r}(\stackrel{~}{r}={R}_{0}),\widehat{r}(\stackrel{~}{r}={R}_{1}),\mathrm{\dots},\widehat{r}(\stackrel{~}{r}={R}_{M1})]$, where $\widehat{r}(\stackrel{~}{r}={R}_{k})$ denotes the value of the surrogate reward when the observed reward is ${R}_{k}$. Let $\mathbf{R}=[{R}_{0};{R}_{1};\mathrm{\cdots};{R}_{M1}]$ be the bounded reward matrix with $M$ values. We have the following results:
Lemma 2.
Suppose ${\mathrm{C}}_{M\mathrm{\times}M}$ is invertible. With defining:
$\widehat{\mathbf{R}}={\mathbf{C}}^{1}\cdot \mathbf{R}$  (2) 
we have for any $r\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}$, ${\mathrm{E}}_{\stackrel{\mathrm{~}}{r}\mathrm{}r}\mathit{}\mathrm{[}\widehat{r}\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}\mathrm{]}\mathrm{=}r\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}\mathrm{.}$
Continuous reward
When the reward signal is continuous, we discretize it into $M$ intervals, and view each interval as a reward level, with its value approximated by its middle point. With increasing $M$, this quantization error can be made arbitrarily small. Our method is then the same as the solution for the multioutcome setting, except for replacing rewards with discretized ones. Note that the finerdegree quantization we take, the smaller the quantization error  but we would suffer from learning a bigger reward confusion matrix. This is a tradeoff question that can be addressed empirically.
So far we have assumed knowing the confusion matrices and haven’t restricted our solution to any specific setting, but we will address this additional estimation issue focusing on determinisitc reward settings, and present our complete algorithm therein.
Convergence and Sample Complexity: $Q$Learning
We now analyze the convergence and sample complexity of our surrogate reward based RL algorithms (with assuming knowing $\mathbf{C}$), taking $Q$Learning as an example.
Convergence guarantee
First, the convergence guarantee is stated in the following theorem:
Theorem 1.
Given a finite MDP, denoting as $\widehat{\mathrm{M}}\mathrm{=}\mathrm{\u27e8}\mathrm{S}\mathrm{,}\mathrm{A}\mathrm{,}\widehat{\mathrm{R}}\mathrm{,}\mathrm{P}\mathrm{,}\gamma \mathrm{\u27e9}$, the $Q$learning algorithm with surrogate rewards, given by the update rule,
${Q}_{t+1}({s}_{t},{a}_{t})=(1{\alpha}_{t})Q({s}_{t},{a}_{t})+{\alpha}_{t}\left[{\widehat{r}}_{t}+\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}_{t+1},b)\right],$  (3) 
converges w.p.1 to the optimal $Q$function as long as ${\mathrm{\sum}}_{t}{\alpha}_{t}\mathrm{=}\mathrm{\infty}$ and $$.
Note that the term on the right hand of Eqn. (3) includes surrogate reward $\widehat{r}$ estimated using Eqn. (1) and Eqn. (2). Theorem 1 states that agents will converge to the optimal policy w.p.1 when replacing the rewards with surrogate rewards, despite of the noises in the observed rewards. This result is not surprising  though the surrogate rewards introduce larger variance, we are grateful of their unbiasedness, which grants us the convergence. In other words, the addition of the perturbed reward does not affect the convergence guarantees of $Q$Learning with surrogate rewards.
Sample complexity
To establish our sample complexity results, we first introduce a generative model following previous literature (?; ?; ?). This is a practical MDP setting to simplify the analysis.
Definition 1.
A generative model $G\mathit{}\mathrm{(}\mathrm{M}\mathrm{)}$ for an MDP $\mathrm{M}$ is a sampling model which takes a stateaction pair $\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{)}$ as input, and outputs the corresponding reward $r\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{a}_{t}\mathrm{)}$ and the next state ${s}_{t\mathrm{+}\mathrm{1}}$ randomly with the probability of ${\mathrm{P}}_{a}\mathit{}\mathrm{(}{s}_{t}\mathrm{,}{s}_{t\mathrm{+}\mathrm{1}}\mathrm{)}$, i.e., ${s}_{t\mathrm{+}\mathrm{1}}\mathrm{\sim}\mathrm{P}\mathrm{(}\mathrm{\cdot}\mathrm{}s\mathrm{,}a\mathrm{)}$.
Exact value iteration is impractical if the agents follow the generative models above exactly (?). Consequently, we introduce a phased QLearning which is similar to the ones presented in (?; ?) for the convenience of proving our sample complexity results. We briefly outline phased QLearning as follows  the complete description (Algorithm 2) can be found in Appendix A.
Definition 2.
Phased QLearning algorithm takes $m$ samples per phase by calling generative model $G\mathit{}\mathrm{(}\mathrm{M}\mathrm{)}$. It uses the collected $m$ samples to estimate the transition probability $\mathrm{P}$ and then update the estimated value function per phase. Calling generative model $G\mathit{}\mathrm{(}\widehat{\mathrm{M}}\mathrm{)}$ means that surrogate rewards $\widehat{r}$ are returned and used to update the value function.
The sample complexity of Phased $Q$Learning is given as follows:
Theorem 2.
(Upper Bound) Let $r\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{R}_{\mathrm{max}}\mathrm{]}$ be bounded reward, $\mathrm{C}$ be an invertible reward confusion matrix with $\mathrm{det}\mathit{}\mathrm{(}\mathrm{C}\mathrm{)}$ denoting its determinant. For an appropriate choice of $m$, the Phased $Q$Learning algorithm calls the generative model $G\mathit{}\mathrm{(}\widehat{\mathrm{M}}\mathrm{)}$ $O\mathit{}\mathrm{\left(}\frac{\mathrm{}\mathrm{S}\mathrm{}\mathit{}\mathrm{}\mathrm{A}\mathrm{}\mathit{}T}{{\u03f5}^{\mathrm{2}}\mathit{}{\mathrm{(}\mathrm{1}\mathrm{}\gamma \mathrm{)}}^{\mathrm{2}}\mathit{}\mathrm{det}\mathit{}{\mathrm{(}\mathrm{C}\mathrm{)}}^{\mathrm{2}}}\mathit{}\mathrm{log}\mathit{}\frac{\mathrm{}\mathrm{S}\mathrm{}\mathit{}\mathrm{}\mathrm{A}\mathrm{}\mathit{}T}{\delta}\mathrm{\right)}$ times in $T$ epochs, and returns a policy such that for all state $s\mathrm{\in}\mathrm{S}$, $\mathrm{\left}{V}_{\pi}\mathit{}\mathrm{(}s\mathrm{)}\mathrm{}{V}^{\mathrm{\ast}}\mathit{}\mathrm{(}s\mathrm{)}\mathrm{\right}\mathrm{\le}\u03f5\mathrm{,}\u03f5\mathrm{>}\mathrm{0}\mathrm{,}$ w.p. $$.
Theorem 2 states that, to guarantee the convergence to the optimal policy, the number of samples needed is no more than $O(1/\mathrm{det}{(\mathbf{C})}^{2})$ times of the one needed when the RL agent observes true rewards perfectly. This additional constant is the price we pay for the noise presented in our learning environment. When the noise level is high, we expect to see a much higher $1/\mathrm{det}{(\mathbf{C})}^{2}$; otherwise when we are in a lownoise regime , $Q$Learning can be very efficient with surrogate reward (?). Note that Theorem 2 gives the upper bound in discounted MDP setting; for undiscounted setting ($\gamma =1$), the upper bound is at the order of $O\left(\frac{\mathcal{S}\mathcal{A}{T}^{3}}{{\u03f5}^{2}\mathrm{det}{(\mathbf{C})}^{2}}\mathrm{log}\frac{\mathcal{S}\mathcal{A}T}{\delta}\right)$. This result is not surprising, as the phased $Q$Learning helps smooth out the noise in rewards in consecutive steps. We will experimentally test how the bias removal step performs without explicit phases.
While the surrogate reward guarantees the unbiasedness, we sacrifice the variance at each of our learning steps, and this in turn delays the convergence (as also evidenced in the sample complexity bound). It can be verified that the variance of surrogate reward is bounded when $\mathbf{C}$ is invertible, and it is always higher than the variance of true reward. This is summarized in the following theorem:
Theorem 3.
Let $r\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{R}_{\mathrm{max}}\mathrm{]}$ be bounded reward and confusion matrix $\mathrm{C}$ is invertible. Then, the variance of surrogate reward $\widehat{r}$ is bounded as follows: $\mathrm{Var}\mathit{}\mathrm{(}r\mathrm{)}\mathrm{\le}\mathrm{Var}\mathit{}\mathrm{(}\widehat{r}\mathrm{)}\mathrm{\le}\frac{{M}^{\mathrm{2}}}{\mathrm{det}\mathit{}{\mathrm{(}\mathrm{C}\mathrm{)}}^{\mathrm{2}}}\mathrm{\cdot}{R}_{\mathrm{max}}^{\mathrm{2}}\mathrm{.}$
To give an intuition of the bound, when we have binary reward, the variance for surrogate reward bounds as follows: $\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(r)\le \mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(\widehat{r})\le \frac{4{R}_{\mathrm{max}}^{2}}{{(1{e}_{+}{e}_{})}^{2}}.$ As ${e}_{}+{e}_{+}\to 1$, the variance becomes unbounded and the proposed estimator is no longer effective, nor will it be welldefined.
Variance reduction
In practice, there is a tradeoff question between bias and variance by tuning a linear combination of $\mathbf{R}$ and $\widehat{\mathbf{R}}$, i.e., ${\mathbf{R}}_{proxy}=\eta \mathbf{R}+(1\eta )\widehat{\mathbf{R}}$, via choosing an appropriate $\eta \in [0,1]$. Other variance reduction techniques in RL with noisy environment, for instance (?), can be combined with our proposed bias removal technique too. We test them in the experiment section.
Estimation of Confusion Matrices
In previous solutions, we have assumed the knowledge of reward confusion matrices, in order to compute the surrogate reward. This knowledge is often not available in practice. Estimating these confusion matrices is challenging without knowing any ground truth reward information; but we would like to remark that efficient algorithms have been developed to estimate the confusion matrices in supervised learning settings (?; ?; ?; ?). The idea in these algorithms is to dynamically refine the error rates based on aggregated rewards. Note this approach is not different from the inference methods in aggregating crowdsourcing labels, as referred in the literature (?; ?; ?). We adapt this idea to our reinforcement learning setting, which is detailed as follows.
The estimation procedure is only for the case with deterministic reward, but not for stochastic rewards. The reason is that we will use repeated observations to refine an estimated ground truth reward, which will be leveraged to estimate the confusion matrix. With uncertainty in the true reward, it is not possible to distinguish a clean case with true reward $\mathbf{C}\cdot \mathbf{R}$ from the perturbed reward case with true reward $\mathbf{R}$ and added noise by confusion matrix $\mathbf{C}$.
At each training step, the RL agent collects the noisy reward and the current stateaction pair. Then, for each pair in $\mathcal{S}\times \mathcal{A}$, the agent predicts the true reward based on accumulated historical observations of reward for the corresponding stateaction pair via, e.g., averaging (majority voting). Finally, with the predicted true reward and the accuracy (error rate) for each stateaction pair, the estimated reward confusion matrices $\stackrel{~}{\mathbf{C}}$ are given by
$\overline{r}(s,a)=\underset{{R}_{i}\in \mathcal{R}}{\mathrm{arg}\mathrm{max}}\mathrm{\#}[\stackrel{~}{r}(s,a)={R}_{i}],$  (4)  
${\stackrel{~}{c}}_{i,j}={\displaystyle \frac{{\sum}_{(s,a)\in \mathcal{S}\times \mathcal{A}}\mathrm{\#}[\stackrel{~}{r}(s,a)={R}_{j}\overline{r}(s,a)={R}_{i}]}{{\sum}_{(s,a)\in \mathcal{S}\times \mathcal{A}}\mathrm{\#}[\overline{r}(s,a)={R}_{i}]}},$  (5) 
where in above $\mathrm{\#}[\cdot ]$ denotes the number of stateaction pair that satisfies the condition $[\cdot ]$ in the set of observed rewards $\stackrel{~}{R}(s,a)$ (see Algorithm 1 and 3); $\overline{r}(s,a)$ and $\stackrel{~}{r}(s,a)$ denote predicted true rewards (using majority voting) and observed rewards when the stateaction pair is $(s,a)$. We break potential ties in Eqn. (4) equally likely. The above procedure of updating ${\stackrel{~}{c}}_{i,j}$ continues indefinitely as more observation arrives. Our final definition of surrogate reward replaces a known reward confusion $\mathbf{C}$ in Eqn. (2) with our estimated one $\stackrel{~}{\mathbf{C}}$. We denote this estimated surrogate reward as $\dot{r}$.
We present (Reward Robust RL) in Algorithm 1^{3}^{3} 3 One complete $Q$Learning implementation (Algorithm 3) is provided in Appendix C.. Note that the algorithm is rather generic, and we can plug in any exisitng RL algorithm into our reward robust one, with only changes in replacing the rewards with our estimated surrogate rewards.
Experimental Results
In this section, we conduct extensive experiments to evaluate the noisy reward robust RL mechanism with different games, under various noise settings. Due to the space limit, more experimental results can be found in Appendix D.
Experimental Setup
Environments and RL Algorithms
To fully test the performance under different environments, we evaluate the proposed robust reward RL method on two classic control games (CartPole, Pendulum) and seven Atari 2600 games (AirRaid, Alien, Carnival, MsPacman, Pong, Phoenix, Seaquest), which encompass a large variety of environments, as well as rewards. Specifically, the rewards could be unary (CartPole), binary (most of Atari games), multivariate (Pong) and even continuous (Pendulum). A set of stateoftheart RL algorithms are experimented with, while training under different amounts of noise (See Table 3)^{4}^{4} 4 The detailed settings are accessible in Appendix B.. For each game and algorithm, unless otherwise stated, three policies are trained with different random initialization to decrease the variance.
Reward PostProcessing
For each game and RL algorithm, we test the performance for learning with true rewards, noisy rewards and surrogate rewards. Both symmetric and asymmetric noise settings with different noise levels are tested. For symmetric noise, the confusion matrices are symmetric. As for asymmetric noise, two types of random noise are tested: 1) randone, each reward level can only be perturbed into another reward; 2) randall, each reward could be perturbed to any other reward, via adding a random noise matrix. To measure the amount of noise w.r.t confusion matrices, we define the weight of noise $\omega $ in Appendix B. The larger $\omega $ is, the higher the noise rates are.
Robustness Evaluation
CartPole
The goal in CartPole is to prevent the pole from falling by controlling the cart’s direction and velocity. The reward is $+1$ for every step taken, including the termination step. When the cart or pole deviates too much or the episode length is longer than 200, the episode terminates. Due to the unary reward $\{+1\}$ in CartPole, a corrupted reward $1$ is added as the unexpected error (${e}_{}=0$). As a result, the reward space $\mathcal{R}$ is extended to $\{+1,1\}$. Five algorithms $Q$Learning (?), CEM (?), SARSA (?), DQN (?) and DDQN (?) are evaluated.
Noise Rate  Reward  $Q$Learn  CEM  SARSA  DQN  DDQN  DDPG  NAF $$ 

$\omega =0.1$  $\stackrel{~}{r}$  170.0  98.1  165.2  187.2  187.8  1.03  4.48 
$\widehat{r}$  165.8  108.9  173.6  200.0  181.4  0.87  0.89  
$\dot{\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$r$}}$  181.9  99.3  171.5  200.0  185.6  0.90  1.13  
$\omega =0.3$  $\stackrel{~}{r}$  134.9  28.8  144.4  173.4  168.6  1.23  4.52 
$\widehat{r}$  149.3  85.9  152.4  175.3  198.7  1.03  1.15  
$\dot{\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$r$}}$  161.1  82.2  159.6  186.7  200.0  1.05  1.36  
$\omega =0.7$  $\stackrel{~}{r}$  56.6  19.2  12.6  17.2  11.8  8.76  7.35 
$\widehat{r}$  177.6  87.1  151.4  185.8  195.2  1.09  2.26  
$\dot{\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$r$}}$  172.1  83.0  174.4  189.3  191.3  –  – 
Noise Rate  Reward  Lift ($\uparrow $)  Alien  Carnival  Phoenix  MsPacman  Seaquest 

$\omega =0.1$  $\stackrel{~}{r}$  –  1835.1  1239.3  4609.0  1709.1  849.2 
$\widehat{r}$  70.4%$\mathrm{\uparrow}$  1737.0  3966.8  7586.4  2547.3  1610.6  
$\dot{\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$r$}}$  84.6%$\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$\mathrm{\uparrow}$}$  2844.1  5515.0  5668.8  2294.5  2333.9  
$\omega =0.3$  $\stackrel{~}{r}$  –  538.2  919.9  2600.3  1109.6  408.7 
$\widehat{r}$  119.8%$\mathrm{\uparrow}$  1668.6  4220.1  4171.6  1470.3  727.8  
$\dot{\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$r$}}$  80.8%$\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$\mathrm{\uparrow}$}$  1542.9  4094.3  2589.1  1591.2  262.4  
$\omega =0.7$  $\stackrel{~}{r}$  –  495.2  380.3  126.5  491.6  0.0 
$\widehat{r}$  757.4%$\mathrm{\uparrow}$  1805.9  4088.9  4970.4  1447.8  492.5  
$\dot{\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$r$}}$  648.9%$\colorbox[rgb]{0.92156862745098,0.92156862745098,0.92156862745098}{$\mathrm{\uparrow}$}$  1618.0  4529.2  2792.1  1916.7  328.5  
In Figure 1, we show that our estimator successfully produces meaningful surrogate rewards that adapt the underlying RL algorithms to the noisy settings, without any assumption of the true distribution of rewards. With the noise rate increasing (from 0.1 to 0.9), the models with noisy rewards converge slower due to larger biases. However, we observe that the models (DQN and DDQN) always converge to the best score 200 with the help of surrogate rewards.
In some circumstances (slight noise  see Figure 0(b), 0(c)), the surrogate rewards even lead to faster convergence. This points out an interesting observation: learning with surrogate reward sometimes even outperforms the case with observing the true reward. We conjecture that the way of adding noise and then removing the bias (or moderate noise) introduces implicit exploration. This may also imply why some algorithms with estimated confusion matrices $\stackrel{~}{\mathbf{C}}$ leads to better results than with known $\mathbf{C}$ in some cases (Table 1).
Pendulum
The goal in Pendulum is to keep a frictionless pendulum standing up. Different from the CartPole setting, the rewards in pendulum are continuous: $r\in (16.28,0.0]$. The closer the reward is to zero, the better performance the model achieves. For simplicity, we firstly discretized $(17,0]$ into 17 intervals: $(17,16],(16,15],\mathrm{\cdots},(1,0]$, with its value approximated using its maximum point. After the quantization step, the surrogate rewards can be estimated using multioutcome extensions.
We experiment two popular algorithms, DDPG (?) and NAF (?) in this game. In Figure 2, both algorithms perform well with surrogate rewards under different amounts of noise. In most cases, the biases were corrected in the longterm, even when the amount of noise is extensive (e.g., $\omega =0.7$). The quantitative scores on CartPole and Pendulum are given in Table 1, where the scores are averaged based on the last 30 episodes. Our reward robust method is able to achieve good scores consistently.
Atari
We validate our algorithm on seven Atari 2600 games using the stateoftheart algorithm PPO (?). The games are chosen to cover a variety of environments. The rewards in the Atari games are clipped into $\{1,0,1\}$. We leave the detailed settings to Appendix B.
Results for PPO on Pongv4 in symmetric noise setting are presented in Figure 3. More results on other Atari games and noise settings are given in Appendix D. Similar to previous results, our surrogate estimator performs consistently well and helps PPO converge to the optimal policy. Table 2 shows the average scores of PPO on five selected Atari games with different amounts of noise (symmetric $\&$ asymmetric). In particular, when the noise rates ${e}_{+}={e}_{}>0.3$, agents with surrogate rewards obtain significant amounts of improvements in average scores. For the cases with unknown $\mathbf{C}$ ($\dot{r}$ in Table 2), due to the large statespace (imageinput) in confusion matrix estimation, we embed and consider the adjacent frames within a batch as the same state and set the memory size for states as 1,000. Please refer to Appendix B for details.
Compatible with Variance Reduction Techniques
As illustrated in Theorem 3, our surrogate rewards introduce larger variance while conducting unbiased estimation, which are likely to decrease the stability of RL algorithms. Apart from the linear combination idea (a linear tradeoff), some variance reduction techniques in statistics (e.g., correlated sampling) can also be applied to our method. Specially, ? proposed to use a reward estimator to compensate for stochastic corruptedreward signals. It is worthy to notice that their method is designed for variance reduction under zeromean noises, which is no longer efficacious in more general perturbedreward setting. However, it is potential to integrate their method with our robustreward RL framework because surrogate rewards provide unbiasedness guarantee.
To verify this idea, we repeated the experiments of Cartpole but included variance reduction step for estimated surrogate rewards. Following ?, we adopted sample mean as a simple approximator during the training and set sequence length as $100$. As shown in Figure 4, the models with only variance reduction technique (red lines) suffer from huge regrets, and in general do not converge to the optimal policies. Nevertheless, the variance reduction step helps surrogate rewards (purple lines) to achieve faster convergence or better performance in multiple cases. Similarly, Table 4 in Appendix C provides quantitative results which show that our surrogate reward benefits from variance reduction techniques (“ours + VRT”), especially when the noise rate is high.
Conclusions
Improving the robustness of RL in the settings with perturbed and noisy rewards is important given the fact that such noises are common when exploring a realworld scenario, such as sensor errors. In addition, in adversarial environments, perturbed reward could be leveraged Different robust RL algorithms have been proposed but they either only focus on the noisy observations or need strong assumption on the unbiased noise distribution for observed rewards. In this paper, we propose the first simple yet effective RL framework for dealing with biased noisy rewards. The convergence guarantee and finite sample complexity of $Q$Learning (or its variant) with estimated surrogate rewards are provided. To validate the effectiveness of our approach, extensive experiments are conducted on OpenAI Gym, showing that surrogate rewards successfully rescue models from misleading rewards even at high noise rates. We believe this work will further shed light on exploring robust RL approaches under different noisy rewards observations in realworld environments.
Acknowledgement
This work was supported by National Science Foundation award CCF1910100 and DARPA award ASED00009970.
References
Appendix A Proofs
Proof of Lemma 1.
For simplicity, we shorthand $\widehat{r}({s}_{t},{a}_{t},{s}_{t+1}),\stackrel{~}{r}({s}_{t},{a}_{t},{s}_{t+1}),r({s}_{t},{a}_{t},{s}_{t+1})$ as $\widehat{r},\stackrel{~}{r},r$, and let ${r}_{+},{r}_{},{\widehat{r}}_{+},{\widehat{r}}_{}$ denote the general reward levels and corresponding surrogate ones:
${\mathbb{E}}_{\stackrel{~}{r}r}(\widehat{r})={\mathbb{P}}_{\stackrel{~}{r}r}(\widehat{r}={\widehat{r}}_{}){\widehat{r}}_{}+{\mathbb{P}}_{\stackrel{~}{r}r}(\widehat{r}={\widehat{r}}_{+}){\widehat{r}}_{+}.$  (6) 
When $r={r}_{+}$, from the definition in Lemma 1:
${\mathbb{P}}_{\stackrel{~}{r}r}(\widehat{r}={\widehat{r}}_{})={e}_{+},{\mathbb{P}}_{\stackrel{~}{r}r}(\widehat{r}={\widehat{r}}_{+})=1{e}_{+}.$ 
Taking the definition of surrogate rewards Eqn. (1) into Eqn. (6), we have
${\mathbb{E}}_{\stackrel{~}{r}r}(\widehat{r})$  $={e}_{+}\cdot {\widehat{r}}_{}+(1{e}_{+})\cdot {\widehat{r}}_{+}$  
$={e}_{+}\cdot {\displaystyle \frac{(1{e}_{+}){r}_{7}{e}_{}{r}_{+}}{1{e}_{}{e}_{+}}}+(1{e}_{+})\cdot {\displaystyle \frac{(1{e}_{}){r}_{+}{e}_{+}{r}_{}}{1{e}_{}{e}_{+}}}={r}_{+}.$ 
Similarly, when $r={r}_{}$, it also verifies ${\mathbb{E}}_{\stackrel{~}{r}r}[\widehat{r}({s}_{t},{a}_{t},{s}_{t+1})]=r({s}_{t},{a}_{t},{s}_{t+1}).$ ∎
Proof of Lemma 2.
The idea of constructing unbiased estimator is easily adapted to multioutcome reward settings via writing out the conditions for the unbiasedness property (s.t. ${\mathbb{E}}_{\stackrel{~}{r}r}[\widehat{r}]=r.$). For simplicity, we shorthand $\widehat{r}(\stackrel{~}{r}={R}_{i})$ as ${\widehat{R}}_{i}$ in the following proofs. Similar to Lemma 1, we need to solve the following set of functions to obtain $\widehat{r}$:
$$\{\begin{array}{cc}\begin{array}{cc}\hfill {R}_{0}& ={c}_{0,0}\cdot {\widehat{R}}_{0}+{c}_{0,1}\cdot {\widehat{R}}_{1}+\mathrm{\cdots}+{c}_{0,M1}\cdot {\widehat{R}}_{M1}\hfill \\ \hfill {R}_{1}& ={c}_{1,0}\cdot {\widehat{R}}_{0}+{c}_{1,1}\cdot {\widehat{R}}_{1}+\mathrm{\cdots}+{c}_{1,M1}\cdot {\widehat{R}}_{M1}\hfill \\ & \mathrm{\cdots}\hfill \\ \hfill {R}_{M1}& ={c}_{M1,0}\cdot {\widehat{R}}_{0}+{c}_{M1,1}\cdot {\widehat{R}}_{1}+\mathrm{\cdots}+{c}_{M1,M1}\cdot {\widehat{R}}_{M1}\hfill \end{array}\hfill & \end{array}$$ 
where ${\widehat{R}}_{i}$ denotes the value of the surrogate reward when the observed reward is ${R}_{i}$. Define $\mathbf{R}:=[{R}_{0};{R}_{1};\mathrm{\cdots};{R}_{M1}]$, and $\widehat{\mathbf{R}}:=[{\widehat{R}}_{0},{\widehat{R}}_{1},\mathrm{\dots},{\widehat{R}}_{M1}]$, then the above equations are equivalent to: $\mathbf{R}=\mathbf{C}\cdot \widehat{\mathbf{R}}.$ If the confusion matrix $\mathbf{C}$ is invertible, we obtain the surrogate reward:
$\widehat{\mathbf{R}}={\mathbf{C}}^{1}\cdot \mathbf{R}.$ 
According to above definition, for any true reward level ${R}_{i},i=0,1,\mathrm{\cdots},M1$, we have
${\mathbb{E}}_{\stackrel{~}{r}r={R}_{i}}[\widehat{r}]={c}_{i,0}\cdot {\widehat{R}}_{0}+{c}_{i,1}\cdot {\widehat{R}}_{1}+\mathrm{\cdots}+{c}_{i,M1}\cdot {\widehat{R}}_{M1}={R}_{i}.$ 
∎
Furthermore, the probabilities for observing surrogate rewards can be written as follows:
$\widehat{\mathbf{P}}$  $=[{\widehat{p}}_{1},{\widehat{p}}_{2},\mathrm{\cdots},{\widehat{p}}_{M}]=[{\displaystyle \sum _{j}}{p}_{j}{c}_{j,1},{\displaystyle \sum _{j}}{p}_{j}{c}_{j,2},\mathrm{\cdots},{\displaystyle \sum _{j}}{p}_{j}{c}_{j,M}],$ 
where ${\widehat{p}}_{i}={\sum}_{j}{p}_{j}{c}_{j,i}$, and ${\widehat{p}}_{i}$, ${p}_{i}$ represent the probabilities of occurrence for surrogate reward ${\widehat{R}}_{i}$ and true reward ${R}_{i}$ respectively.
Corollary 1.
Let ${\widehat{p}}_{i}$ and ${p}_{i}$ denote the probabilities of occurrence for surrogate reward $\widehat{r}\mathrm{(}\stackrel{\mathrm{~}}{r}\mathrm{=}{R}_{i}\mathrm{)}$ and true reward ${R}_{i}$. Then the surrogate reward satisfies,
$\sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1})r({s}_{t},a,{s}_{t+1})={\displaystyle \sum _{j}}{p}_{j}{R}_{j}={\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{R}}_{j}.$  (7) 
Proof of Corollary 1.
From Lemma 2, we have,
$\sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1})r({s}_{t},a,{s}_{t+1})={\displaystyle \sum _{{s}_{t+1}\in \mathcal{S};{R}_{j}\in \mathcal{R}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1},{R}_{j}){R}_{j$  
$={\displaystyle \sum _{{R}_{j}\in \mathcal{R}}}{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1}){R}_{j}={\displaystyle \sum _{{R}_{j}\in \mathcal{R}}}{p}_{j}{R}_{j}={\displaystyle \sum _{j}}{p}_{j}{R}_{j}.$ 
Consequently,
$\sum _{j}}{\widehat{p}}_{j}{\widehat{R}}_{j$  $={\displaystyle \sum _{j}}{\displaystyle \sum _{k}}{p}_{k}{c}_{k,j}{\widehat{R}}_{j}={\displaystyle \sum _{k}}{p}_{k}{\displaystyle \sum _{j}}{c}_{k,j}{\widehat{R}}_{j}$  
$={\displaystyle \sum _{k}}{p}_{k}{R}_{k}={\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1})r({s}_{t},a,{s}_{t+1}).$ 
∎
To establish Theorem 1, we need an auxiliary result (Lemma 3) from stochastic process approximation, which is widely adopted for the convergence proof for $Q$Learning (?; ?).
Lemma 3.
The random process $\mathrm{\{}{\mathrm{\Delta}}_{t}\mathrm{\}}$ taking values in ${\mathrm{R}}^{n}$ and defined as
$${\mathrm{\Delta}}_{t+1}(x)=(1{\alpha}_{t}(x)){\mathrm{\Delta}}_{t}(x)+{\alpha}_{t}(x){F}_{t}(x)$$ 
converges to zero w.p.1 under the following assumptions:

•
$0\le {\alpha}_{t}\le 1$, ${\sum}_{t}{\alpha}_{t}(x)=\mathrm{\infty}$ and $$;

•
$\mathbb{E}[{F}_{t}(x){\mathcal{F}}_{t}]{}_{W}\le \gamma {\mathrm{\Delta}}_{t}$, with $$;

•
$\text{\bm{v}\bm{a}\bm{r}}\left[{F}_{t}(x){\mathcal{F}}_{t}\right]\le C(1+{{\mathrm{\Delta}}_{t}}_{W}^{2})$, for $C>0$.
Here ${\mathrm{F}}_{t}\mathrm{=}\mathrm{\{}{\mathrm{\Delta}}_{t}\mathrm{,}{\mathrm{\Delta}}_{t\mathrm{}\mathrm{1}}\mathrm{,}\mathrm{\cdots}\mathrm{,}{F}_{t\mathrm{}\mathrm{1}}\mathit{}\mathrm{\cdots}\mathrm{,}{\alpha}_{t}\mathrm{,}\mathrm{\cdots}\mathrm{\}}$ stands for the past at step $t$, ${\alpha}_{t}\mathit{}\mathrm{(}x\mathrm{)}$ is allowed to depend on the past insofar as the above conditions remain valid. The notation $\mathrm{}\mathrm{}\mathrm{\cdot}\mathrm{}{\mathrm{}}_{W}$ refers to some weighted maximum norm.
Proof of Lemma 3.
See previous literature (?; ?). ∎
Proof of Theorem 1.
For simplicity, we abbreviate ${s}_{t}$, ${s}_{t+1}$, ${Q}_{t}$, ${Q}_{t+1}$, ${r}_{t}$, ${\widehat{r}}_{t}$ and ${\alpha}_{t}$ as $s$, ${s}^{\prime}$, $Q$, ${Q}^{\prime}$, $r$, $\widehat{r}$, and $\alpha $, respectively.
Subtracting from both sides the quantity ${Q}^{\ast}(s,a)$ in Eqn. (3):
${Q}^{\prime}(s,a){Q}^{\ast}(s,a)=$  $(1\alpha )\left(Q(s,a){Q}^{\ast}(s,a)\right)+\alpha \left[\widehat{r}+\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b){Q}^{\ast}(s,a)\right].$ 
Let ${\mathrm{\Delta}}_{t}(s,a)=Q(s,a){Q}^{\ast}(s,a)$ and ${F}_{t}(s,a)=\widehat{r}+\gamma {\mathrm{max}}_{b\in \mathcal{A}}Q({s}^{\prime},b){Q}^{\ast}(s,a)$.
$${\mathrm{\Delta}}_{t+1}({s}^{\prime},a)=(1\alpha ){\mathrm{\Delta}}_{t}(s,a)+\alpha {F}_{t}(s,a).$$ 
In consequence,
$\mathbb{E}\left[{F}_{t}(x){\mathcal{F}}_{t}\right]$  $={\displaystyle \sum _{{s}^{\prime}\in \mathcal{S};\widehat{r}\in \mathcal{R}}}{\mathbb{P}}_{a}(s,{s}^{\prime},\widehat{r})\left[\widehat{r}+\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b)\right]{Q}^{\ast}(s,a)$  
$={\displaystyle \sum _{{s}^{\prime}\in \mathcal{S};\widehat{r}\in \mathcal{R}}}{\mathbb{P}}_{a}(s,{s}^{\prime},\widehat{r})\widehat{r}+{\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})\left[\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b)r\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}{Q}^{\ast}({s}^{\prime},b)\right]$  
$={\displaystyle \sum _{{s}^{\prime}\in \mathcal{S};\widehat{r}\in \mathcal{R}}}{\mathbb{P}}_{a}(s,{s}^{\prime},\widehat{r})\widehat{r}{\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})r+{\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})\gamma \left[\underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b)\underset{b\in \mathcal{A}}{\mathrm{max}}{Q}^{\ast}({s}^{\prime},b)\right]$  
$={\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{r}}_{j}{\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})r+{\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})\gamma \left[\underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b)\underset{b\in \mathcal{A}}{\mathrm{max}}{Q}^{\ast}({s}^{\prime},b)\right]$  
$={\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})\gamma \left[\underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b)\underset{b\in \mathcal{A}}{\mathrm{max}}{Q}^{\ast}({s}^{\prime},b)\right]$  
$\le \gamma {\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime})\underset{b\in \mathcal{A},{s}^{\prime}\in \mathcal{S}}{\mathrm{max}}\leftQ({s}^{\prime},b){Q}^{\ast}({s}^{\prime},b)\right$  
$=\gamma {\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathbb{P}}_{a}(s,{s}^{\prime}){Q{Q}^{\ast}}_{\mathrm{\infty}}=\gamma {Q{Q}^{\ast}}_{\mathrm{\infty}}=\gamma {{\mathrm{\Delta}}_{t}}_{\mathrm{\infty}}.$ 
Finally,
$\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}\left[{F}_{t}(x){\mathcal{F}}_{t}\right]$  $=\mathbb{E}\left[{\left(\widehat{r}+\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b){\displaystyle \sum _{{s}^{\prime}\in \mathcal{S};\widehat{r}\in \mathcal{R}}}{\mathbb{P}}^{\prime}(s,{s}^{\prime},\widehat{r})\left[\widehat{r}+\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b)\right]\right)}^{2}\right]$  
$=\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}\left[\widehat{r}+\gamma \underset{b\in \mathcal{A}}{\mathrm{max}}Q({s}^{\prime},b){\mathcal{F}}_{t}\right]$  
$.$ 
Because $\widehat{r}$ is bounded, it can be clearly verified that
$$\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}\left[{F}_{t}(x){\mathcal{F}}_{t}\right]\le C(1+{{\mathrm{\Delta}}_{t}}_{W}^{2})$$ 
for some constant $C$. Then, due to the Lemma 3, ${\mathrm{\Delta}}_{t}$ converges to zero w.p.1, i.e., ${Q}^{\prime}(s,a)$ converges to ${Q}^{\ast}(s,a)$. ∎
The procedure of Phased $Q$Learning is described as Algorithm 2:
$${\widehat{\mathbb{P}}}_{a}({s}_{t},{s}_{t+1})=\frac{\mathrm{\#}[({s}_{t},{a}_{t})\to {s}_{t+1}]}{m}$$ 
$\widehat{V}({s}_{t})$  $=\underset{a\in \mathcal{A}}{\mathrm{max}}{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\widehat{\mathbb{P}}}_{a}({s}_{t},{s}_{t+1})\left[{r}_{t}+\gamma \widehat{V}({s}_{t+1})\right]$  
$\widehat{\pi}(s,t)$  $=\underset{a\in \mathcal{A}}{\mathrm{arg}\mathrm{max}}\widehat{V}({s}_{t})$ 
Note that $\widehat{\mathbb{P}}$ here is the estimated transition probability, which is different from $\mathbb{P}$ in Eqn. (7).
To obtain the sample complexity results, the range of our surrogate reward needs to be known. Assuming reward $r$ is bounded in $[0,{R}_{\mathrm{max}}]$, Lemma 4 below states that the surrogate reward is also bounded, when the confusion matrices are invertible:
Lemma 4.
Let $r\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{R}_{\mathrm{max}}\mathrm{]}$ be bounded, where ${R}_{\mathrm{max}}$ is a constant; suppose ${\mathrm{C}}_{M\mathrm{\times}M}$, the confusion matrix, is invertible with its determinant denoting as $\mathrm{det}\mathit{}\mathrm{(}\mathrm{C}\mathrm{)}$. Then the surrogate reward satisfies
$0\le \left\widehat{r}\right\le {\displaystyle \frac{M}{\mathrm{det}(\mathbf{C})}}{R}_{\mathrm{max}}.$  (8) 
Proof of Lemma 4.
From Eqn. (2), we have,
$\widehat{\mathbf{R}}={\mathbf{C}}^{1}\cdot \mathbf{R}={\displaystyle \frac{\mathrm{adj}(\mathbf{C})}{\mathrm{det}(\mathbf{C})}}\cdot \mathbf{R},$ 
where $\mathrm{adj}(\mathbf{C})$ is the adjugate matrix of $\mathbf{C}$; $\mathrm{det}(\mathbf{C})$ is the determinant of $\mathbf{C}$. It is known from linear algebra that,
$\mathrm{adj}{(\mathbf{C})}_{ij}={(1)}^{i+j}\cdot {\mathbf{M}}_{ji},$ 
where ${\mathbf{M}}_{ji}$ is the determinant of the $(M1)\times (M1)$ matrix that results from deleting row $j$ and column $i$ of $\mathbf{C}$. Therefore, ${\mathbf{M}}_{ji}$ is also bounded:
${\mathbf{M}}_{ji}\le {\displaystyle \sum _{\sigma \in {S}_{n}}}\left(\left\mathrm{sgn}(\sigma )\right{\displaystyle \prod _{m=1}}{c}_{m,{\sigma}_{n}}^{\prime}\right)\le {\displaystyle \prod _{m=0}^{M1}}\left({\displaystyle \sum _{n=0}^{M1}}{c}_{m,n}\right)={1}^{M}=1,$ 
where the sum is computed over all permutations $\sigma $ of the set $\{0,1,\mathrm{\cdots},M2\}$; ${c}^{\prime}$ is the element of ${\mathbf{M}}_{ji}$; $\mathrm{sgn}(\sigma )$ returns a value that is $+1$ whenever the reordering given by $\sigma $ can be achieved by successively interchanging two entries an even number of times, and $1$ whenever it can not.
Consequently,
$\left{\widehat{R}}_{i}\right={\displaystyle \frac{{\sum}_{j}\left\mathrm{adj}{(\mathbf{C})}_{ij}\right\cdot \left{R}_{j}\right}{\mathrm{det}(\mathbf{C})}}\le {\displaystyle \frac{M}{\mathrm{det}(\mathbf{C})}}\cdot {R}_{\mathrm{max}}.$ 
∎
Proof of Theorem 2.
From Hoeffding’s inequality, we obtain:
$P\left(\right{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1}){V}_{t+1}^{\ast}({s}_{t+1}){\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\widehat{\mathbb{P}}}_{a}({s}_{t},{s}_{t+1}){V}_{t+1}^{\ast}({s}_{t+1})\ge \u03f5)\le 2\mathrm{exp}\left({\displaystyle \frac{2m{\u03f5}^{2}{(1\gamma )}^{2}}{{R}_{\mathrm{max}}^{2}}}\right),$ 
because ${V}_{t}({s}_{t})$ is bounded within $\frac{{R}_{\mathrm{max}}}{1\gamma}$. In the same way, ${\widehat{r}}_{t}$ is bounded by $\frac{M}{\mathrm{det}(\mathbf{C})}\cdot {R}_{\mathrm{max}}$ from Lemma 4. We then have,
$P\left(\right{\displaystyle \sum _{\begin{array}{c}{s}_{t+1}\in \mathcal{S}\\ {\widehat{r}}_{t}\in \widehat{\mathcal{R}}\end{array}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1},{\widehat{r}}_{t}){\widehat{r}}_{t}{\displaystyle \sum _{\begin{array}{c}{s}_{t+1}\in \mathcal{S}\\ {\widehat{r}}_{t}\in \widehat{\mathcal{R}}\end{array}}}{\widehat{\mathbb{P}}}_{a}({s}_{t},{s}_{t+1},{\widehat{r}}_{t}){\widehat{r}}_{t}\ge \u03f5)\le 2\mathrm{exp}\left({\displaystyle \frac{2m{\u03f5}^{2}\mathrm{det}{(\mathbf{C})}^{2}}{{M}^{2}{R}_{\mathrm{max}}^{2}}}\right).$ 
Further, due to the unbiasedness of surrogate rewards, we have
$\sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1}){r}_{t}={\displaystyle \sum _{{s}_{t+1}\in \mathcal{S};{\widehat{r}}_{t}\in \widehat{\mathcal{R}}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1},{\widehat{r}}_{t}){\widehat{r}}_{t}.$ 
As a result,
$\left{V}_{t}^{\ast}(s){\widehat{V}}_{t}(s)\right$  $=\underset{a\in \mathcal{A}}{\mathrm{max}}{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1})\left[{r}_{t}+\gamma {V}_{t+1}^{\ast}({s}_{t+1})\right]\underset{a\in \mathcal{A}}{\mathrm{max}}{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\widehat{\mathbb{P}}}_{a}({s}_{t},{s}_{t+1})\left[{\widehat{r}}_{t}+\gamma {V}_{t+1}^{\ast}({s}_{t+1})\right]$  
$\le {\u03f5}_{1}+\gamma \underset{a\in \mathcal{A}}{\mathrm{max}}\left{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1}){V}_{t+1}^{\ast}({s}_{t+1}){\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\widehat{\mathbb{P}}}_{a}({s}_{t},{s}_{t+1}){V}_{t+1}^{\ast}({s}_{t+1})\right$  
$\mathrm{\hspace{1em}}+\underset{a\in \mathcal{A}}{\mathrm{max}}\left{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1}){r}_{t}{\displaystyle \sum _{{s}_{t+1}\in \mathcal{S};{\widehat{r}}_{t}\in \widehat{\mathcal{R}}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1},{\widehat{r}}_{t}){\widehat{r}}_{t}\right$  
$\le \gamma \underset{s\in \mathcal{S}}{\mathrm{max}}\left{V}_{t+1}^{\ast}(s){\widehat{V}}_{t+1}(s)\right+{\u03f5}_{1}+\gamma {\u03f5}_{2}$ 
In the same way,
$\left{V}_{t}(s){\widehat{V}}_{t}(s)\right\le \gamma \underset{s\in \mathcal{S}}{\mathrm{max}}\left{V}_{t+1}^{\ast}(s){\widehat{V}}_{t+1}(s)\right+{\u03f5}_{1}+\gamma {\u03f5}_{2}$ 
Recursing the two equations in two directions ($0\to T$), we get
$\underset{s\in \mathcal{S}}{\mathrm{max}}\left{V}^{\ast}(s)\widehat{V}(s)\right$  $\le ({\u03f5}_{1}+\gamma {\u03f5}_{2})+\gamma ({\u03f5}_{1}+\gamma {\u03f5}_{2})+\mathrm{\cdots}+{\gamma}^{T1}({\u03f5}_{1}+\gamma {\u03f5}_{2})$  
$={\displaystyle \frac{({\u03f5}_{1}+\gamma {\u03f5}_{2})(1{\gamma}^{T})}{1\gamma}}$  
$\underset{s\in \mathcal{S}}{\mathrm{max}}\leftV(s)\widehat{V}(s)\right$  $\le {\displaystyle \frac{({\u03f5}_{1}+\gamma {\u03f5}_{2})(1{\gamma}^{T})}{1\gamma}}$ 
Combining these two inequalities above we have:
$$\underset{s\in \mathcal{S}}{\mathrm{max}}\left{V}^{\ast}(s)V(s)\right\le 2\frac{({\u03f5}_{1}+\gamma {\u03f5}_{2})(1{\gamma}^{T})}{1\gamma}\le 2\frac{({\u03f5}_{1}+\gamma {\u03f5}_{2})}{1\gamma}.$$ 
Let ${\u03f5}_{1}={\u03f5}_{2}$, so ${\mathrm{max}}_{s\in \mathcal{S}}\left{V}^{\ast}(s)V(s)\right\le \u03f5$ as long as
$${\u03f5}_{1}={\u03f5}_{2}\le \frac{(1\gamma )\u03f5}{2(1+\gamma )}.$$ 
For arbitrarily small $\u03f5$, by choosing $m$ appropriately, there always exists ${\u03f5}_{1}={\u03f5}_{2}=\frac{(1\gamma )\u03f5}{2(1+\gamma )}$ such that the policy error is bounded within $\u03f5$. That is to say, the Phased QLearning algorithm can converge to the near optimal policy within finite steps using our proposed surrogate rewards.
Finally, there are $\mathcal{S}\mathcal{A}T$ transitions under which these conditions must hold, where $\cdot $ represent the number of elements in a specific set. Using a union bound, the probability of failure in any condition is smaller than
$$2\mathcal{S}\mathcal{A}T\cdot \mathrm{exp}\left(m\frac{{\u03f5}^{2}{(1\gamma )}^{2}}{2{(1+\gamma )}^{2}}\cdot \mathrm{min}\{{(1\gamma )}^{2},\frac{\mathrm{det}{(\mathbf{C})}^{2}}{{M}^{2}}\}\right).$$ 
We set the error rate less than $\delta $, and $m$ should satisfy that
$$m=O\left(\frac{1}{{\u03f5}^{2}{(1\gamma )}^{2}\mathrm{det}{(\mathbf{C})}^{2}}\mathrm{log}\frac{\mathcal{S}\mathcal{A}T}{\delta}\right).$$ 
In consequence, after $m\mathcal{S}\mathcal{A}T$ calls, which is, $O\left(\frac{\mathcal{S}\mathcal{A}T}{{\u03f5}^{2}{(1\gamma )}^{2}\mathrm{det}{(\mathbf{C})}^{2}}\mathrm{log}\frac{\mathcal{S}\mathcal{A}T}{\delta}\right)$, the value function converges to the optimal one for every state $s$, with probability greater than $1\delta $. ∎
The above bound is for discounted MDP setting with $$. For undiscounted setting $\gamma =1$, since the total error (for entire trajectory of $T$ timesteps) has to be bounded by $\u03f5$, therefore, the error for each time step has to be bounded by $\frac{\u03f5}{T}$. Repeating our anayslis, we obtain the following upper bound:
$$O\left(\frac{\mathcal{S}\mathcal{A}{T}^{3}}{{\u03f5}^{2}\mathrm{det}{(\mathbf{C})}^{2}}\mathrm{log}\frac{\mathcal{S}\mathcal{A}T}{\delta}\right).$$ 
Proof of Theorem 3.
$\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(\widehat{r})\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(r)$  $=\mathbb{E}\left[{(\widehat{r}\mathbb{E}[\widehat{r}])}^{2}\right]\mathbb{E}\left[{\left(r\mathbb{E}[r]\right)}^{2}\right]$  
$=\mathbb{E}[{\widehat{r}}^{2}]\mathbb{E}{[\widehat{r}]}^{2}+\mathbb{E}[{r}^{2}]\mathbb{E}{[r]}^{2}$  
$={\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{{R}_{j}}}^{2}{\left({\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{R}}_{j}\right)}^{2}\left[{\displaystyle \sum _{j}}{p}_{j}R_{j}{}^{2}{\left({\displaystyle \sum _{j}}{p}_{j}{R}_{j}\right)}^{2}\right]$  
$={\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{{R}_{j}}}^{2}{\displaystyle \sum _{j}}{p}_{j}R_{j}{}^{2}$  
$={\displaystyle \sum _{j}}{\displaystyle \sum _{i}}{p}_{i}{c}_{i,j}{\widehat{{R}_{j}}}^{2}{\displaystyle \sum _{j}}{p}_{j}{\left({\displaystyle \sum _{i}}{c}_{j,i}\widehat{{R}_{i}}\right)}^{2}$  
$={\displaystyle \sum _{j}}{p}_{j}\left({\displaystyle \sum _{i}}{c}_{j,i}{\widehat{{R}_{i}}}^{2}{\left({\displaystyle \sum _{i}}{c}_{j,i}\widehat{{R}_{i}}\right)}^{2}\right).$ 
Using the Cauchy–Schwarz inequality,
$\sum _{i}}{c}_{j,i}{\widehat{{R}_{i}}}^{2}={\displaystyle \sum _{i}}{\sqrt{{c}_{j,i}}}^{2}\cdot {\displaystyle \sum _{i}}{\left(\sqrt{{c}_{j,i}}\widehat{{R}_{i}}\right)}^{2}\ge {\left({\displaystyle \sum _{i}}{c}_{j,i}\widehat{{R}_{i}}\right)}^{2}.$ 
So we get,
$\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(\widehat{r})\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(r)\ge 0.$ 
In addition,
$\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(\widehat{r})$  $={\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{{R}_{j}}}^{2}{\left({\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{R}}_{j}\right)}^{2}\le {\displaystyle \sum _{j}}{\widehat{p}}_{j}{\widehat{{R}_{j}}}^{2}$  
$\le {\displaystyle \sum _{j}}{\widehat{p}}_{j}{\displaystyle \frac{{M}^{2}}{\mathrm{det}{(\mathbf{C})}^{2}}}\cdot {R}_{\mathrm{max}}^{2}={\displaystyle \frac{{M}^{2}}{\mathrm{det}{(\mathbf{C})}^{2}}}\cdot {R}_{\mathrm{max}}^{2}.$ 
∎
Appendix B Experimental Setup
We set up our experiments within the popular OpenAI baselines (?) and kerasrl (?) framework. Specifically, we integrate the algorithms and interact with OpenAI Gym (?) environments (Table 3).
Environment  RL Algorithm 

CartPole  $Q$Learning (?) 
CEM (?)  
SARSA (?)  
DQN (?; ?)  
DDQN (?)  
Pendulum  DDPG (?) 
NAF (?)  
Atari Games  PPO (?) 
RL Algorithms
A set of stateoftheart reinforcement learning algorithms are experimented with while training under different amounts of noise, including $Q$Learning (?; ?), CrossEntropy Method (CEM) (?), Deep SARSA (?), Deep $Q$Network (DQN) (?; ?; ?), Dueling DQN (DDQN) (?), Deep Deterministic Policy Gradient (DDPG) (?), Continuous DQN (NAF) (?) and Proximal Policy Optimization (PPO) (?) algorithms. For each game and algorithm, three policies are trained based on different random initialization to decrease the variance in experiments.
PostProcessing Rewards
We explore both symmetric and asymmetric noise of different noise levels. For symmetric noise, the confusion matrices are symmetric, which means the probabilities of corruption for each reward choice are equivalent. For instance, a confusion matrix
$$\mathbf{C}=\left[\begin{array}{cc}\hfill 0.8\hfill & \hfill 0.2\hfill \\ \hfill 0.2\hfill & \hfill 0.8\hfill \end{array}\right]$$ 
says that ${r}_{1}$ could be corrupted into ${r}_{2}$ with a probability of 0.2 and so does ${r}_{2}$ (weight = 0.2).
As for asymmetric noise, two types of random noise are tested: 1) randone, each reward level can only be perturbed into another reward; 2) randall, each reward could be perturbed to any other reward. To measure the amount of noise w.r.t confusion matrices, we define the weight of noise as follows:
$$\mathbf{C}=(1\omega )\cdot \mathbf{I}+\omega \cdot \mathbf{N},\omega \in [0,1],$$ 
where $\omega $ controls the weight of noise; $\mathbf{I}$ and $\mathbf{N}$ denote the identity and noise matrix respectively. Suppose there are $M$ outcomes for true rewards, $\mathbf{N}$ writes as:
$$\mathbf{N}=\left[\begin{array}{cccc}\hfill {n}_{0,0}\hfill & \hfill {n}_{0,1}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {n}_{0,M1}\hfill \\ \hfill \mathrm{\cdots}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \mathrm{\cdots}\hfill \\ \hfill {n}_{M1,0}\hfill & \hfill {n}_{M1,1}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {n}_{M1,M1}\hfill \end{array}\right],$$ 
where for each row $i$, 1) randone: randomly choose $j$, s.t ${n}_{i,j}=1$ and ${n}_{i,k}\ne 0$ if $k\ne j$; 2) randall: generate $M$ random numbers that sum to 1, i.e., ${\sum}_{j}{n}_{i,j}=1$. For the simplicity, for symmetric noise, we choose $\mathbf{N}$ as an antiidentity matrix. As a result, ${c}_{i,j}=0$, if $i\ne j$ or $i+j\ne M$.
PerturbedReward MDP Example
To obtain an intuitive view of the reward perturbation model, where the observed rewards are generated based on a reward confusion matrix, and meanwhile evaluate our estimation algorithm’s robustness to timevariant noise, we constructed a simple MDP and evaluated the performance of robust reward $Q$Learning (Algorithm 1) on different noise ratios (both symmetric and asymmetric). The finite MDP is formulated as Figure 4(a): when the agent reaches state 5, it gets an instant reward of ${r}_{+}=1$, otherwise a zero reward ${r}_{}=0$. During the explorations, the rewards are perturbed according to the confusion matrix ${\mathbf{C}}_{2\times 2}=[1{e}_{},{e}_{};{e}_{+},1{e}_{+}]$.
For timevariant noise, we generated varying amount of noise at different training stages: 1) ${e}_{}=0.1,{e}_{+}=0.3$ ($0$ to $1{e}^{4}$ steps); 2) ${e}_{}=0.2,{e}_{+}=0.1$ ($1{e}^{4}$ to $3{e}^{4}$ steps); 3) ${e}_{}=0.3,{e}_{+}=0.2$ ($3{e}^{4}$ to $5{e}^{4}$ steps); 4) ${e}_{}=0.1,{e}_{+}=0.2$ ($5{e}^{4}$ to $7{e}^{4}$ steps). In Figure 4(b), we show that Algorithm 1 is robust against timevariant noise, which dynamically adjusts the estimated $\stackrel{~}{\mathbf{C}}$ after the noise distribution changes. Note that we set a maximum memory size for collected noisy rewards to let the agents only learn with recent observations.
Training Details
CartPole and Pendulum
The policies use the default network from kerasrl framework. which is a fivelayer fully connected network^{5}^{5} 5 https://github.com/kerasrl/kerasrl/examples. There are three hidden layers, each of which has 16 units and followed by a rectified nonlinearity. The last output layer is activated by the linear function. For CartPole, We trained the models using Adam optimizer with the learning rate of $1{e}^{3}$ for 10,000 steps. The exploration strategy is Boltzmann policy. For DQN and DuelingDQN, the update rate of target model and the memory size are $1{e}^{2}$ and $50,000$. For Pendulum, We trained DDPG and NAF using Adam optimizer with the learning rate of $5{e}^{4}$ for $150,000$ steps. the update rate of target model and the memory size are $1{e}^{3}$ and $100,000$.
Atari Games
We adopt the preprocessing steps as well as the network architecture from (?). Specifically, the input to the network is $84\times 84\times 4$, which is a concatenation of the last 4 frames and converted into $84\times 84$ grayscale. The network comprises three convolutional layers and two fully connected layers^{6}^{6} 6 https://github.com/openai/baselines/tree/master/baselines/common. The kernel size of three convolutional layer are $8\times 8$ with stride 4 (32 filters), $4\times 4$ with stride 2 (64 filters) and $3\times 3$ with stride 1 (64 filters), respectively. Each hidden layer is followed by a rectified nonlinearity. Except for Pong where we train the policies for $3{e}^{7}$ steps, all the games are trained for $5{e}^{7}$ steps with the learning rate of $3{e}^{4}$. Note that the rewards in the Atari games are discrete and clipped into $\{1,0,1\}$. Except for Pong game, in which $r=1$ means missing the ball hit by the adversary, the agents in other games attempt to get higher scores in the episode with binary rewards $0$ and $1$.
Discretization for Continuous States
To apply proposed estimation algorithm to continuousstate MDPs, we adopt a discretization procedure similar to the preprocessing of continuous rewards As stated before, there is a also tradeoff between the quantization error as well as the estimation complexity. However, in practice, we found that the estimation step is highly robust to the quantization level.
For Cartpole, the observations (states) are speed and velocity of the cart, and we discretized them into $8\text{(speeds)}\times 10\text{(velocity)}=80$ independent states for collecting noisy rewards for each stateaction pair. In inverted Pendulum swingup problem, the states ($\mathrm{cos}\theta \in [1.0,1.0];\mathrm{sin}\theta \in [1.0,1.0];d\theta /dt\in [8.0,+8.0]$, $\theta $ denotes the rotation degree of pendulum) are discretized into $20(\mathrm{cos}\theta )\times 20(sin\theta )\times 40(d\theta /dt)=16,000$ states. When the statespace is highdimensional (e.g., the image inputs for Atari games), we propose a batchbased adjacency embedding policy. In particular, we embedded a batch (32) of adjacent image observations as one single state. For the consideration of time dependency and efficiency, we set a “state queue” which only records the noisy rewards for the latest 1,000 states. The confusion matrices are reestimated based on current collections of observed noisy rewards every 100 steps.
Appendix C Estimation of Confusion Matrices
Reward Robust RL Algorithms
As stated in proposed reward robust RL framework, the confusion matrix can be estimated dynamically based on the aggregated answers, similar to previous literature in supervised learning (?). To get a concrete view, we take $Q$Learning for an example, and the algorithm is called Reward Robust $Q$Learning (Algorithm 3). Note that is can be extended to other RL algorithms by plugging confusion matrix estimation steps and the computed surrogate rewards, as shown in the experiments (Figure 6).
StateDependent Perturbed Reward
In previous sections, to let our presentation stay focused, we consider the stateindependent perturbed reward environments, which share the same confusion matrix for all states. In other words, the noise for different states is generated within the same distribution. More generally, the generation of $\stackrel{~}{r}$ follows a certain function $C:\mathcal{S}\times \mathcal{R}\to \stackrel{~}{R}$, where different states may correspond to varied noise distributions (also varied confusion matrices). However, our algorithm is still applicable. One intuitive solution is to maintain different confusion matrices ${\mathbf{C}}_{s}$ for different states. It is worthy to notice that Theorem 1 holds because the surrogate rewards produce an unbiased estimation of true rewards for each state, i.e.,
${\mathbb{E}}_{\stackrel{~}{r}r,{s}_{t}}[\widehat{r}({s}_{t},{a}_{t},{s}_{t+1})]=r({s}_{t},{a}_{t},{s}_{t+1}).$ 
Then we have,
${\mathbb{E}}_{\stackrel{~}{r}r}[\widehat{r}({s}_{t},{a}_{t},{s}_{t+1})]$  $={\displaystyle \sum _{s\in \mathcal{S}}}{\mathbb{P}}_{a}({s}_{t},{s}_{t+1})r({s}_{t},{a}_{t},{s}_{t+1})=r({s}_{t},{a}_{t},{s}_{t+1})$ 
Theorem 4.
(Upper bound) Let $r\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{R}_{\mathrm{max}}\mathrm{]}$ be bounded reward, ${\mathrm{C}}_{s}$ be invertible reward confusion matrices with $\mathrm{det}\mathit{}\mathrm{(}{\mathrm{C}}_{s}\mathrm{)}$ denoting its determinant. For an appropriate choice of $m$, the Phased $Q$Learning algorithm calls the generative model $G\mathit{}\mathrm{(}\widehat{\mathrm{M}}\mathrm{)}$
$$O\left(\frac{\mathcal{S}\mathcal{A}T}{{\u03f5}^{2}{(1\gamma )}^{2}{\mathrm{min}}_{s\in \mathcal{S}}{\{\mathrm{det}({\mathbf{C}}_{s})\}}^{2}}\mathrm{log}\frac{\mathcal{S}\mathcal{A}T}{\delta}\right)$$ 
times in $T$ epochs, and returns a policy such that for all state $s\mathrm{\in}\mathrm{S}$, $\mathrm{\left}{V}_{\pi}\mathit{}\mathrm{(}s\mathrm{)}\mathrm{}{V}^{\mathrm{\ast}}\mathit{}\mathrm{(}s\mathrm{)}\mathrm{\right}\mathrm{\le}\u03f5\mathrm{,}\u03f5\mathrm{>}\mathrm{0}\mathrm{,}$ w.p. $$.
Theorem 5.
Let $r\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{R}_{\mathrm{max}}\mathrm{]}$ be bounded reward and all confusion matrices ${\mathrm{C}}_{s}$ are invertible. Then, the variance of surrogate reward $\widehat{r}$ is bounded as follows:
$$\mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(r)\le \mathrm{\mathbf{V}\mathbf{a}\mathbf{r}}(\widehat{r})\le \frac{{M}^{2}}{{\mathrm{min}}_{s\in \mathcal{S}}{\{\mathrm{det}({\mathbf{C}}_{s})\}}^{2}}\cdot {R}_{\mathrm{max}}^{2}.$$ 
Let ${\stackrel{~}{c}}_{i,js}$ represents the entry of confusion matrix ${\mathbf{C}}_{s}$, indicating the flipping probability for generating a perturbed outcome for state $s$, i.e., ${\stackrel{~}{c}}_{i,js}=\mathbb{P}({\stackrel{~}{r}}_{t}={R}_{k}{r}_{t}={R}_{j},s)$. Then the estimation step (see Eqn (5)) should be replaced by
${\stackrel{~}{c}}_{i,js}={\displaystyle \frac{{\sum}_{a\in \mathcal{A}}\mathrm{\#}[\stackrel{~}{r}(s,a)={R}_{j}\overline{r}(s,a)={R}_{i}]}{{\sum}_{a\in \mathcal{A}}\mathrm{\#}[\overline{r}(s,a)={R}_{i}]}}.$ 
Experimental Results
To validate the effectiveness of robust reward algorithms (like Algorithm 3), where the noise rates are unknown to the agents, we conduct extensive experiments in CartPole. It is worthwhile to notice that the noisy rates are unknown in the explorations of RL agents. Besides, we discretize the observation (velocity, angle, etc.) to construct a set of states and implement like Algorithm 3. The $\eta $ is set $1.0$ in the experiments.
Figure 6 provides learning curves from five algorithms with different kinds of rewards. The proposed estimation algorithms successfully obtain the approximate confusion matrices, and are robust in the unknown noise environments. From Figure 7, we can observe that the estimation of confusion matrices converges very fast. The results are inspiring because we don’t assume any additional knowledge about noise or true reward distribution in the implementation.
Noise Rate  Reward  $Q$Learn  CEM  SARSA  DQN  DDQN $$ 

$\omega =0.1$  VRT  173.5  99.7  167.3  181.9  187.4 
ours ($\dot{r}$)  181.9  99.3  171.5  200.0  185.6  
ours + VRT  184.5  98.2  174.2  199.3  186.5  
$\omega =0.3$  VRT  140.4  43.9  149.8  182.7  177.6 
ours ($\dot{r}$)  161.1  81.8  159.6  186.7  200.0  
ours + VRT  161.6  82.2  159.8  188.4  198.2  
$\omega =0.7$  VRT  71.1  16.1  13.2  15.6  14.7 
ours ($\dot{r}$)  172.1  83.0  174.4  189.3  191.3  
ours + VRT  182.3  79.5  178.9  195.9  194.2 