### Abstract

We demonstrate a method for separately estimating aleatoric risk andepistemic uncertainty in deep reinforcement learning. Aleatoric risk, whicharises from inherently stochastic environments or agents, must be accounted forin the design of risk-sensitive algorithms. Epistemic uncertainty, which stemsfrom limited data, is important both for risk-sensitivity and to efficientlyexplore an environment. We present a Bayesian framework for learning the returndistribution in reinforcement learning, which provides theoretical foundationsfor quantifying both types of uncertainty. The variance of the returndistribution yields the aleatoric uncertainty, and our Bayesian formulationprovides the epistemic uncertainty. Based on our framework, we show that thedisagreement between only two neural networks is sufficient to produce anestimate of the epistemic uncertainty on the expected return, thus providing asimple and computationally cheap uncertainty metric. We demonstrate experimentsthat illustrate our method and some applications.

### Quick Read (beta)

###### Abstract

We demonstrate a method for separately estimating aleatoric risk and epistemic uncertainty in deep reinforcement learning. Aleatoric risk, which arises from inherently stochastic environments or agents, must be accounted for in the design of risk-sensitive algorithms. Epistemic uncertainty, which stems from limited data, is important both for risk-sensitivity and to efficiently explore an environment. We present a Bayesian framework for learning the return distribution in reinforcement learning, which provides theoretical foundations for quantifying both types of uncertainty. The variance of the return distribution yields the aleatoric uncertainty, and our Bayesian formulation provides the epistemic uncertainty. Based on our framework, we show that the disagreement between only two neural networks is sufficient to produce an estimate of the epistemic uncertainty on the expected return, thus providing a simple and computationally cheap uncertainty metric. We demonstrate experiments that illustrate our method and some applications.

Estimating Risk and Uncertainty in Deep Reinforcement Learning

William R. Clements${}^{\mathrm{1}}$ Benoît-Marie Robaglia${}^{\mathrm{1}}$ Bastien Van Delft${}^{\mathrm{1}}$ Reda Bahi Slaoui${}^{\mathrm{1}\mathrm{,}\mathrm{2}}$ Sébastien Toth${}^{\mathrm{1}}$ ${}^{1}$Uncharted Technologies, Paris, France ${}^{2}$Ecole des Ponts Paristech, Champs-sur-Marne, France

Reinforcement learning algorithms [sutton2018reinforcement] are faced with two types of uncertainty: aleatoric uncertainty and epistemic uncertainty [knight2012risk]. Aleatoric uncertainty or risk is uncertainty that stems from inherent randomness in the environment, which may be characterized. For example, given an unbiased coin, we may not know what the outcome of the next flip will be, but we may be able to assign a probability to each outcome. Epistemic uncertainty, or parametric uncertainty, is uncertainty stemming from imperfect knowledge of the environment, which may be decreased with more information. For instance, given a biased coin that either always yields heads or always yields tails, the uncertainty on the outcome of the first coin flip is epistemic. However, the uncertainty vanishes after observing the outcome of the first flip.

Distinguishing between both types of uncertainty is important in reinforcement learning [osband2016risk, moerland2017efficient, nikolov2019information]. With the epistemic uncertainty, exploring a new environment can be done more efficiently, since actions can be taken to identify and better explore poorly known states. However, aleatoric risk can be irrelevant or even damaging [russo2014learning] for exploration. Conversely, when designing risk-aware policies, the aleatoric risk can be more informative than the epistemic uncertainty, since a full description of the environmental uncertainty can allow planning for all possible outcomes. Conflating both types of uncertainty can be lead to inadequate exploration or risk-awareness.

We propose a neural network method for separately estimating aleatoric risk and epistemic uncertainty in reinforcement learning, which is applicable to both stochastic and deterministic environments. We use the distributional reinforcement learning framework, which aims to learn the entire return distribution instead of only its expected value [bellemare2017distributional], to describe the aleatoric risk. Our main contributions are to show that 1) distributional reinforcement learning can be framed as a Bayesian inference problem, which allows us to introduce the notion of epistemic uncertainty in this setting, 2) the disagreement between an ensemble consisting of only two estimators of the return distribution provides a low-variance estimate of the epistemic uncertainty, and 3) these uncertainty metrics can successfully be used in complex reinforcement learning domains.

This paper is structured as follows. In sections 1 and 2, we provide an introduction to uncertainty estimation and learning the return distribution in reinforcement learning. Section 3 presents our Bayesian framework for distributional reinforcement learning, and shows how epistemic uncertainty can be estimated in this framework. Section 4 then experimentally illustrates our method for estimating uncertainties and some of its applications.

## 1 Background and related work

There has been increasing interest in developing practical methods for uncertainty estimation in reinforcement learning, with prior work mostly focusing either on aleatoric risk or epistemic uncertainty. For the estimation of epistemic uncertainty in complex environments, several methods have been proposed. First, ensemble-based methods use the disagreement between several neural networks as a measure of uncertainty [tibshirani1996comparison, osband2016deep, lakshminarayanan2017simple, pearce2018bayesian, osband2018randomized, lutjens2018safe, burda2018exploration]. Second, variational inference techniques can be used on the parameters of a neural network to approximate their posterior distribution given the observations [osband2014generalization, lipton2017bbq, azizzadenesheli2018efficient, gal2016dropout, moerland2017efficient, touati2018randomized]. Third, state visitation counts can be approximated and used as a proxy for the uncertainty on a given state [bellemare2016unifying, tang2017exploration]. Our work proposes a Bayesian framework for distributional reinforcement learning, which is compatible with methods such as [pearce2018bayesian, gal2016dropout, blundell2015weight] for sampling from approximate Bayesian posteriors in neural networks.

Prior work on aleatoric risk in reinforcement learning has focused on the stochastic nature of the returns, which can be caused either by randomness inherent to the environment or randomness in the agent’s actions. Several approaches have been developed to estimate higher-order terms of the return distribution [sobel1982variance, prashanth2013actor, tamar2016learning]. More recently, distributional reinforcement learning algorithms have been developed that aim to learn the entire distribution of the returns [morimura2010nonparametric, bellemare2017distributional]. Our work builds on this distributional approach by proposing a method for estimating epistemic uncertainty as well in this setting.

Accounting for both aleatoric risk and epistemic uncertainty has been used in model-based reinforcement learning to mitigate model bias [depeweg2018decomposition, chua2018deep, henaff2019model]. However, uncertainties on the model cannot straightforwardly be used to estimate the agent’s own uncertainty about its actions in a real environment. Both types of uncertainty on the agent were accounted for in model-free reinforcement learning in [tang2018exploration, moerland2017efficient, nikolov2019information]. In [moerland2017efficient] and [tang2018exploration], sampling from the parameters that approximate the return distribution is used to simultaneously propagate both types of uncertainty, but they cannot be separately estimated. In [nikolov2019information], separate estimates of both types of uncertainty are obtained using the distributional approach for aleatoric uncertainty and bootstrapping [osband2016deep] for the epistemic uncertainty. These uncertainties are used to drive better exploration with information-directed sampling. Our work combines both types of uncertainty into a single theoretical framework, and we experimentally show that our method can also successfully be used with information-directed sampling.

## 2 Preliminaries

### 2.1 Reinforcement learning with the return distribution

We consider a discounted Markov Decision Process (MDP) defined by $(\chi ,A,R,P,\gamma )$, in which $\chi $ and $A$ represent the state and action spaces, $R$ is the distribution of rewards associated with performing actions given the states, $P$ is the probability of transitioning between states given the actions, and $\gamma $ is the reward discount factor. At each time step $t$, an agent observes state ${s}_{t}$, performs an action ${a}_{t}$ chosen according to a policy $\pi $ that maps states to actions, and receives a (potentially stochastic) reward $R({s}_{t},{a}_{t})$ and a new state observation ${s}_{t+1}$. The agent’s objective is to find a policy that maximizes the expected discounted return $\mathbb{E}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t},{a}_{t})]$.

The distribution of returns ${Z}^{\pi}(s,a)$ associated with taking action $a$ in state $s$ and then following a policy $\pi $ can be learned with temporal difference learning using the Bellman operator $\mathcal{T}$ [bellemare2017distributional],

$\mathcal{T}{Z}^{\pi}(s,a)\stackrel{D}{=}R(s,a)+\gamma {Z}^{\pi}({s}^{\prime},{a}^{\prime})$ | (1) |

where $\stackrel{D}{=}$ denotes that both distributions have equal probability laws, ${s}^{\prime}$ is distributed according to $P(\cdot |s,a)$, and ${a}^{\prime}$ is chosen according to $\pi $. A Bellman optimality operator ${\mathcal{T}}^{\prime}$ for the return distribution can also be defined as

${\mathcal{T}}^{\prime}Z(s,a)\stackrel{D}{=}R(s,a)+\gamma Z({s}^{\prime},{a}^{\prime}),$ | (2) | ||

$\text{where}\mathit{\hspace{1em}}{a}^{\prime}\sim {\text{argmax}}_{{a}^{\prime}}\mathbb{E}[Z({s}^{\prime},{a}^{\prime})]$ |

where ${s}^{\prime}$ is distributed according to $P(\cdot |s,a)$. The Bellman optimality operator can be used to learn an optimal policy over an MDP, which then consists of always picking the action with the highest expected return.

Distributional reinforcement learning has several advantages compared to related reinforcement learning methods such as [mnih2015human] that aim to learn only the expectation value of the returns. By learning the return distribution, distributional reinforcement learning can be used to account for risk [morimura2010nonparametric, morimura2012parametric, dabney2018implicit]. Moreover, learning the return distribution instead of only its expectation value has been shown to lead on its own to improved performance on reinforcement learning benchmarks [bellemare2017distributional, dabney2017distributional], and can lead to greater robustness to hyperparameter choices [barth2018distributed].

Distributional reinforcement learning algorithms can parameterize the return distribution in different ways. [bellemare2017distributional] use a categorical parameterization using a fixed number of atoms, which requires knowing the support of the return distribution in advance. [dabney2017distributional] propose using a quantile parameterization using a fixed number of quantiles, and [dabney2018implicit] learn an implicit mapping from a uniform distribution to the return distribution.

### 2.2 Quantile distributional reinforcement learning

Of particular relevance to our work is the quantile parameterization of [dabney2017distributional]. In this framework, a probability distribution $Z(s,a)$ is parameterized by $N$ quantiles ${\tau}_{i}=i/(N+1)$ for $i\in [1,N]$. We aim to learn the quantile values $\bm{q}=({q}_{1},\mathrm{\dots},{q}_{N})$ such that $P(z\sim Z(s,a)\le {q}_{i})={\tau}_{i}$. Learning the quantile values proceeds by minimizing the quantile regression loss [koenker2001quantile],

${\mathcal{L}}_{q}(\bm{q})={\mathbb{E}}_{z\sim Z(s,a)}{\displaystyle \sum _{i=1}^{N}}{\rho}_{{\tau}_{i}}(z-{q}_{i}(s,a))$ | (3) | ||

$$ |

This loss can be minimized stochastically for each new value $z$ sampled from $Z(s,a)$.

For temporal difference learning, $Z(s,a)$ is replaced with the Bellman target $R(s,a)+\gamma Z({s}^{\prime},{a}^{\prime})$ as per equation 1 for policy evaluation, or equation 2 for policy optimization. The loss is thus

${\mathcal{L}}_{TD}(\bm{q})={\mathbb{E}}_{R(s,a)}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j=1}^{N}}{\rho}_{{\tau}_{i}}({\delta}_{i,j})\right]$ | (4) | ||

$$ |

In [dabney2017distributional], quantile regression DQN (QR-DQN) reinforcement learning algorithms use both the strict quantile loss of equation 4 and a modified quantile Huber loss that is smooth at 0. In the following, we focus on the strict quantile loss since the quantile Huber loss produces biased quantile estimates.

## 3 Epistemic uncertainty and the return distribution

Here, we present a practical and theoretically grounded method for estimating the epistemic uncertainty in the distributional setting, thus allowing us to separately quantify aleatoric and epistemic uncertainty. We choose to use a quantile parameterization of the return distribution based on [dabney2017distributional], since such a parameterization does not assume a specific support for the returns and has empirically been shown to outperform a categorical parameterization on several benchmarks.

### 3.1 Bayesian inference for the return distribution

We first formulate learning the return distribution as a Bayesian inference problem. We first consider a given state $s$, action $a$ taken in state $s$, a policy $\pi $, and data $D$ consisting of $K$ samples $({z}_{1},\mathrm{\dots},{z}_{K})$ from ${Z}^{\pi}(s,a)$. We also consider a neural network with parameters $\bm{\theta}$, which conditioned on $s$ and $a$ returns a value $y(\bm{\theta},s,a)$ that approximates the value of a given quantile $\tau $ of ${Z}^{\pi}(s,a)$. We interpret possible values of $\bm{\theta}$ as different hypotheses about the function relating the state-action pair to quantile $\tau $ of ${Z}^{\pi}(s,a)$ [mackay2003information]. We define a normal prior on $\bm{\theta}$ centered around $0$. Following [yu2001bayesian], we also define a likelihood based on how well the output of the network matches the data using an asymmetric Laplace distribution,

$P(D|\bm{\theta})={\displaystyle \prod _{j=1}^{K}}{f}_{\tau}({z}_{j}-y(\bm{\theta},s,a))$ | (5) | ||

$\mathrm{\hspace{1em}}\text{where}\mathit{\hspace{1em}}{f}_{\tau}(u)={\displaystyle \frac{\tau (1-\tau )}{{\sigma}_{D}}}\mathrm{exp}(-{\displaystyle \frac{{\rho}_{\tau}(u)}{{\sigma}_{D}}})$ |

where ${\sigma}_{D}$ is a characteristic length scale and ${\rho}_{\tau}$ is the same as in equation 3. With some assumptions, it can be shown that this choice of likelihood is consistent even when the actual distribution of the data is different from that assumed by this choice of likelihood [sriram2013posterior].

To estimate the entire return distribution instead of a single quantile, we extend this formalism to a network with $N$ outputs ${y}_{i}(\bm{\theta},s,a)$ and (potentially shared) parameters $\bm{\theta}$, where each output $i$ approximates the value of quantile ${\tau}_{i}$. We maintain a normal prior on the parameters, and define the likelihood

$P(D|\bm{\theta})={\displaystyle \prod _{j=1}^{K}}{\displaystyle \prod _{i=1}^{N}}{f}_{{\tau}_{i}}({z}_{j}-{y}_{i}(\bm{\theta},s,a))$ | (6) |

We note that this choice of likelihood does not consider dependencies between the quantiles. In particular, quantiles estimates that are not ordered are not penalized: the resulting estimated distribution may therefore be ill-defined. In practice, as more data is collected, ordered quantiles become more likely.

A link can now be established between the inference problem studied in this section and the neural network training procedure described in the previous section. Minimizing the loss in equation 3 is equivalent to maximizing the likelihood in equation 6. Furthermore, with a normal prior on $\bm{\theta}$ with mean 0 and standard deviation ${\sigma}_{p}$, finding maximum a posteriori (MAP) parameters for the neural network is equivalent to minimizing the regularized quantile loss

${\mathcal{L}}_{reg}(\bm{\theta})={\mathcal{L}}_{q}(\bm{y}(\bm{\theta},s,a))+{\sigma}^{2}{||\bm{\theta}||}^{2}$ | (7) |

where $\bm{y}=({y}_{1},\mathrm{\dots},{y}_{N})$ and $\sigma =\frac{{\sigma}_{D}}{2{\sigma}_{p}}$.

This theoretical framework allows us to frame learning the return distribution as a Bayesian inference problem, thus giving us the tools to separately quantify aleatoric risk and epistemic uncertainty. Specifically, the variance of the return distribution quantifies the aleatoric uncertainty, and the variance of the Bayesian posterior distribution will be used to estimate the epistemic uncertainty.

### 3.2 Bayesian ensembles of return distributions

With the Bayesian framework described above, we can now in principle use methods such as [graves2011practical, blundell2015weight, gal2016dropout, pearce2018bayesian] to draw samples from approximations of the Bayesian posterior using neural networks, and thus obtain epistemic uncertainty estimates. In the following, we choose to use the ”anchored neural networks” scheme recently proposed by [pearce2018bayesian], due to the simplicity of the scheme and good empirical results. In this scheme, each network in an ensemble is regularized around coordinates randomly drawn from the prior distribution. Then, with some assumptions, when the networks are trained on the same data the parameter values towards which they converge are drawn from the posterior distribution. In our case, for each network $m$ we draw coordinates ${\bm{\theta}}_{m}$ by sampling from the prior distribution, and train the network to minimize

${\mathcal{L}}_{anchored,reg}(\bm{\theta})={\mathcal{L}}_{q}(\bm{y}(\bm{\theta},s,a))+{\sigma}^{2}{||\bm{\theta}-{\bm{\theta}}_{m}||}^{2}$ | (8) |

Samples thus obtained can be used to yield useful epistemic uncertainty estimates. First, for any given quantile we can obtain an ensemble of estimates drawn from the posterior distribution of that quantile. This ensemble can then be used to estimate the uncertainty on that quantile, using the variance of the ensemble for example, or other parameters of interest. The per-quantile uncertainty is helpful for identifying which parts of an estimated distribution are uncertain (see supplementary information). Second, we show in the following that the epistemic uncertainty on the mean of the return distribution can be obtained in a simple manner.

### 3.3 Uncertainty measurement with a two network ensemble

The use of large ensembles of neural networks is cumbersome in practice; we show that using our framework an ensemble of only two networks is sufficient to obtain an estimate of the epistemic uncertainty on the mean of the distribution. This implies that the epistemic uncertainty can be cheaply estimated in the distributional setting. Informally, this result arises from the fact that the disagreement between the two networks provides $N$ independent error measures (one for each quantile). Although the error for each quantile is noisy, the average error provides a low variance uncertainty measure for the mean of the distribution. We formalize this intuition in the following.

Our demonstration first requires us to verify that the disagreement between the two networks does indeed provide $N$ independent error measures. Although neither our prior nor our likelihood introduce any dependence between the outputs of the network, the presence of shared parameters in the network violates this independence assumption. However, the posterior distributions of different outputs of infinite width Bayesian neural networks are known to be uncorrelated for normal priors and separable likelihoods [neal2012bayesian]. We will thus assume in the following that the posterior distributions for sufficiently large neural networks are independent. This independence assumption is experimentally studied in the supplementary information.

With this assumption, we demonstrate that the use of two networks allows us to upper bound the epistemic uncertainty on the mean of the return distribution. We consider samples $({\widehat{q}}_{1}\mathrm{\dots}{\widehat{q}}_{N})$ from the posterior distribution for the quantiles. We then estimate the mean of the return distribution as

${\mu}_{Z}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i}}{\widehat{q}}_{i}$ | (9) |

It can easily be shown that if the first and second order moments of the posterior distribution for the quantiles $i$ are bounded for all $i$ and $N$, then the mean of these samples indeed converges to the mean of the distribution.

Since for each quantile the posterior distribution is estimated using the same data, we approximate the variance of the mean with the mean of the variances of the quantiles, which we denote ${\sigma}_{Z}^{2}$,

${\sigma}_{Z}^{2}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i}}\text{var}({q}_{i})$ | (10) |

We note that ${\sigma}_{Z}^{2}$ is an upper bound of the actual epistemic uncertainty of the mean of the distribution.

As we don’t have direct access to the posterior distributions of the quantiles, we propose to use an ensemble of two neural networks A and B. Their mean is used to estimate the mean of the distribution as in equation 9, their covariance between quantiles describes the aleatoric uncertainty, and the epistemic uncertainty on the mean is estimated using the following expression ${\sigma}_{Z,N}$,

${\sigma}_{Z,N}={\left[{\displaystyle \frac{1}{2N}}{\displaystyle \sum _{i}}{({y}_{i}^{(A)}-{y}_{i}^{(B)})}^{2}\right]}^{\frac{1}{2}}$ | (11) |

where ${y}_{i}^{(A)}$ and ${y}_{i}^{(B)}$ are the predicted values of quantile $i$ produced by networks A and B. We make use of our assumption that the uncertainties on the quantiles are independent to show that, with increasing $N$, ${\sigma}_{Z,N}$ is with some reasonable assumptions a good estimate of ${\sigma}_{Z}$.

Proposition 1: We assume that the expected value of ${q}_{i}$, the variance of ${q}_{i}$, and the variance of ${q}_{i}^{2}$ are all bounded for all $i$ and large enough $N$. If ${y}_{i}^{(A)}$ and ${y}_{i}^{(A)}$ are samples drawn from the probability distribution of ${q}_{i}$ for all $i$ and $N$, then ${lim}_{N\to \mathrm{\infty}}{\sigma}_{Z,N}-{\sigma}_{Z}=0$.

Proof: See supplementary information

We consider an illustrative example in which an analytic expression for the variance of ${\sigma}_{Z,N}$ is available. We assume that the posterior distribution of the quantile estimates is given by a homoskedastic normal distribution with standard deviation $\sigma $. For each $i$ both ${q}_{i}^{(A)}$ and ${q}_{i}^{(B)}$ are now normally distributed with standard deviation $\sigma $ and the same mean. Their difference is thus normally distributed with standard deviation $\sqrt{2}\sigma $ and mean 0, so that $\frac{{\sigma}_{Z,N}^{2}}{{\sigma}^{2}}$ follows a Chi-squared distribution with $N$ degrees of freedom. The relative error between ${\sigma}_{Z,N}$ and ${\sigma}_{Z}$ thus decreases with increasing $N$. For example with $N=32$ and $N=200$, as was used in the QRTD and QR-DQN algorithms demonstrated by [dabney2017distributional], ${\sigma}_{Z,N}$ is with 95% confidence within $21\%$ and $8\%$, respectively, of ${\sigma}_{Z}$.

### 3.4 Distributional Q-learning with uncertainty

Until now, we have been mainly concerned with learning the return distribution given an ensemble of samples of this distribution. Here, we discuss how our uncertainty estimate can be adapted to temporal difference learning of the return distribution. First, we replace the empirical return distribution with the Bellman target ${\{\mathcal{T}Z(s,a)\}}_{(s,a)}$ defined in equation 1 for the estimation setting, and in equation 2 for the control setting. By interpreting each quantile value of the target distribution as a sample drawn from the target distribution, we can now write the likelihood associated with the quantile estimates,

$P(D|\bm{\theta})=C\mathrm{exp}\left(-{\mathcal{L}}_{TD}(\bm{\theta})\right)$ | (12) |

where $D$ now refers to the targets and ${\mathcal{L}}_{TD}$ is the TD loss defined in equation 4. The framework developed in the previous sections can thus be adapted to temporal difference methods. Furthermore, the anchoring scheme requires that both networks be trained on the same target distributions. Since we now have access to two estimates of the return distribution, we propose to use their averages as the common target distribution for both networks.

## 4 Experiments

We first empirically verify that our epistemic uncertainty estimates perform as expected on the simple problem of learning a static distribution, and compare them to an alternative metric using a stochastic contextual bandit problem. We then illustrate applications of these estimates for different algorithms in two reinforcement learning environments: Cartpole and the Atari suite. Code to reproduce these experiments is available online at https://github.com/uncharted-technologies/risk-and-uncertainty.

### 4.1 Variance and number of quantiles

First, we empirically verify that two networks are sufficient to provide a low variance estimate of the epistemic uncertainty, and that this low variance can be achieved with a reasonable number of quantiles. We train 20 anchored neural networks on a fixed set of samples from a standard normal distribution for different numbers of quantiles. We then use this ensemble to measure epistemic uncertainty in two different ways. The first uncertainty measure is the ensemble uncertainty, corresponding to the square root of the average variance of the quantiles across the ensemble. The second uncertainty measure is ours using two networks, which we calculate for all 190 pairs in this ensemble. Our results are shown in figure 1. We see that both uncertainty measures give the same result on average. Our epistemic uncertainty metric has higher variance, but as the number of quantiles increases the variance decreases.

### 4.2 Contextual bandits

We now empirically compare our epistemic uncertainty measure to that obtained by the Bayes by Backprop method for neural networks [blundell2015weight]. As our approach uses a novel likelihood term, we resort to an indirect comparison whereby we study the performance of the two methods on a contextual bandit problem. We use a variation of the problem presented in [guez2015sample, blundell2015weight], in which at each step an agent is shown a mushroom, characterized by a set of features, and must decide to eat it or not. The agent is penalized for eating a toxic mushroom and rewarded for eating an edible mushroom, and must learn to distinguish good from bad mushrooms from their features. Figure 2 shows the performance achieved by several agents. Two agents measure epistemic uncertainty in two different ways, with our method and with Bayes-by-Backprop [barto1983neuronlike, blundell2015weight], and pick actions via Thompson sampling (see supplementary information). One baseline agent uses an $\u03f5$-greedy policy. We see that the agent that uses our uncertainty metric performs roughly as well as the agent using Bayes-by-Backprop, which shows that both methods produce meaningful uncertainty estimates.

### 4.3 Applications in reinforcement learning

We now study applications of our uncertainty quantification method in reinforcement learning.

#### 4.3.1 Better generalization through exploration

As a first application, we show that our epistemic uncertainty estimate allows us to design agents that efficiently explore their environment, allowing them to better generalize to a wide variety of initial conditions different from what they were trained on. For this experiment, we use our epistemic uncertainty measure to drive exploration via Thompson sampling. We use the OpenAI Gym domain Cartpole [brockman2016openai], in which the agent must keep a pole upright as long as possible on a cart that the agent can either move left or right. The episode terminates either after 200 time steps, if the cart leaves the track, or if the pole falls over. We measure generalization ability by training agents on the standard Cartpole domain in which the cart is initialized at the center of the track, and test them on modified domains in which the cart is initialized at different locations along the track.

Figure 3 compares the performance of an agent that chooses actions with Thompson sampling and our epistemic uncertainty estimate to different $\u03f5$-greedy agents. For all $\u03f5$-greedy agents, $\u03f5$ is annealed during training over the first 5000 training steps; however the final value reached differs between agents. The agent that uses our epistemic uncertainty estimate is the only one that both generalizes well and converges to near-perfect scores during training. We notice that training is slower than for greedy agents, which combined with the higher generalization scores is consistent with this agent spending more time exploring its environment. We note that this link between enhanced exploration and generalization has been observed before in non-distributional agents for example in [arumugam2018mitigating, witty2018measuring].

#### 4.3.2 Information directed sampling

A second application of our method for estimating both types of uncertainty is information-directed sampling (IDS), as proposed in [nikolov2019information]. In this scheme, the agent takes actions based on the ratio between the regret associated with the action and the information to be gained. Calculating this ratio requires estimates of both aleatoric and epistemic uncertainties. [nikolov2019information] demonstrate that information-directed sampling can achieve good results on Atari games from the Arcade Learning Suite [bellemare2013arcade].

Whereas they estimate both types of uncertainty using Bootstrapped DQN [osband2016deep] on the one hand and distributional RL with a categorical parameterization on the other, we propose to use instead a quantile parameterization and the uncertainties estimated using our framework. As a proof of principle experiment, we train both a QR-DQN agent as in [dabney2017distributional] and an agent that uses our uncertainties to drive information-directed sampling on 4 Atari games over 50 million training frames. Training curves are shown in figure 4. We see that our uncertainty estimates can successfully be used for information-directed exploration on these domains, and that the resulting algorithm appears competitive with the $\u03f5$-greedy QR-DQN algorithm of [dabney2017distributional].

#### 4.3.3 Tracking the epistemic uncertainty in distributional RL

Lastly, we show that our uncertainty metrics can be used to identify novel situations encountered by the agent. For this experiment, we train an agent with the QR-DQN algorithm as in [dabney2017distributional] with an $\u03f5$-greedy exploration policy on the Atari game Breakout, except that we now use a network architecture that allows us to implement the anchored network scheme to measure the epistemic uncertainty (see supplementary information). Since the trained agent consistently achieves very high scores, we monitor the agent’s epistemic uncertainty over a test episode in which it follows a 5% $\u03f5$-greedy policy, which forces it to sometimes make mistakes.

The trained agent’s epistemic uncertainty over the course of an episode is shown in figure. 5. The uncertainty behaves in the way we would expect it to; in addition to the specific spikes in uncertainty pointed out in the figure that we can easily interpret, we also observe that the agent’s uncertainty generally increases during an episode. This is because the agent encounters a wider variety of possible states at the end of an episode than at its start. Monitoring an agent’s uncertainty in this way in complex domains would be valuable for real-life reinforcement learning agents, for example to detect and prevent potential accidents.

## 5 Conclusion

Estimating both aleatoric and epistemic uncertainty is important for developing real-world learning strategies that can both explore efficiently and account for risk in an agent’s actions. We propose a method for estimating both types of uncertainty in deep reinforcement learning. The distributional approach is used to estimate aleatoric risk, and our Bayesian formulation is used to estimate epistemic uncertainty. We also show that the epistemic uncertainty on the mean of the distribution can be estimated in a computationally cheap way using an ensemble of two networks. Our experimental results in the Cartpole and Atari domains illustrate applications of our method.

## References

## Appendix A Thompson sampling using our epistemic uncertainty metric

Here we provide our method for selecting an action with Thompson sampling using our epistemic uncertainty estimate, which we use for our bandit experiment, our Cartpole experiment, and the Atari experiment presented later in the supplementary information. Thompson sampling in these contexts should be used with the epistemic uncertainty and not the aleatoric uncertainty; our method allows us to separate both contributions. Since we do not have access to the exact shape of the epistemic posterior distribution for the mean of the returns, we approximate it with a normal distribution.

## Appendix B Proof of Proposition 1

Let ${\mu}_{i}$ and ${\sigma}_{i}$ be the mean and standard deviation for quantile ${q}_{i}$ according to the posterior distribution. Since ${q}_{i}^{(A)}$ and ${q}_{i}^{(B)}$ are random variables both drawn according to the posterior distribution for ${q}_{i}$, we can write ${({q}_{i}^{(A)}-{q}_{i}^{(B)})}^{2}/2={\sigma}_{i}^{2}+{\u03f5}_{i}$, where ${\u03f5}_{i}$ is a random variable with mean 0. We thus have that:

${\sigma}_{Z,N}^{2}-{\sigma}_{Z}^{2}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}{\u03f5}_{i}$ | (13) |

Moreover, the ${\u03f5}_{i}$ satisfy:

$\text{var}({\u03f5}_{i})$ | $={\displaystyle \frac{1}{4}}\text{var}({({q}_{i}^{(A)}-{q}_{i}^{(B)})}^{2})$ | (14) | ||

$={\displaystyle \frac{1}{4}}\text{var}({q}_{i}^{{(A)}^{2}}+{q}_{i}^{{(B)}^{2}}-2{q}_{i}^{(A)}{q}_{i}^{(B)})$ | (15) | |||

$\le {\displaystyle \frac{1}{2}}(\text{var}({q}_{i}^{{(A)}^{2}})+\text{var}({q}_{i}^{{(B)}^{2}})+2\text{var}({q}_{i}^{(A)}{q}_{i}^{(B)}))$ | (16) | |||

$\le \text{var}({q}_{i}^{2})+2{\mu}_{i}^{2}{\sigma}_{i}+{\sigma}_{i}^{2}$ | (17) |

With our assumption that ${\mu}_{i}$, ${\sigma}_{i}$, and the variance of ${q}_{i}^{2}$ are bounded for all $i$ and large enough $N$, the variance of ${\u03f5}_{i}$ is also bounded by some constant $C$ for large enough $N$. Since the estimates of ${q}_{i}^{(A)}$ and ${q}_{i}^{(B)}$ are assumed to be independent across all $i$, the ${\u03f5}_{i}$ are independent random variables, so that:

$\text{var}({\sigma}_{Z,N}^{2}-{\sigma}_{Z}^{2})$ | $={\displaystyle \frac{1}{{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}\text{var}({\u03f5}_{i})\le {\displaystyle \frac{C}{N}}$ | (18) |

where the inequality holds for large enough $N$. Since ${\sigma}_{Z,N}^{2}-{\sigma}_{Z}^{2}$ is 0 in expectation and its variance converges to 0, ${\sigma}_{Z,N}^{2}-{\sigma}_{Z}^{2}$ converges to 0.

## Appendix C What type of epistemic uncertainty are we measuring?

It is important in reinforcement learning to distinguish between local and global uncertainties [kearns2002near]. Global uncertainty estimates propagate uncertainties through multiple time steps, whereas local uncertainties consider only the uncertainty at the current step. Since we consider do not consider the uncertainty on the Bellman targets during training, our uncertainty estimate measures the local uncertainty. The global uncertainty is usually a more useful quantity in the reinforcement learning problem; however, estimates of the local uncertainty can be converted into an estimate of the global uncertainty using an uncertainty Bellman equation [odonoghue2017uncertainty]. Our metric provides a simple estimate of the local uncertainty, which can thus if necessary be used to quantify the global uncertainty.

## Appendix D Further information and results on the contextual bandit problem

Here, we provide more information and results on the contextual bandit experiment.

### D.1 Experiment setup

We use the UCI mushroom data set [bache2013uci], where each entry contains features about different mushrooms and whether they are edible or not. We convert this dataset into a contextual bandit problem in which at each step an agent must choose between eating a given mushroom or not eating it. The agent receives stochastic rewards drawn from normal distributions with standard deviation 1, and means -3 and 1 for respectively eating a toxic mushroom and eating an edible mushroom. The agent receives a deterministic reward of 0 for not eating the mushroom.

Our agents all use neural networks to predict the distribution of rewards corresponding to eating the mushroom, as a function of the mushroom’s features. Each neural network uses two hidden layers of 100 neurons each. The agent that uses our uncertainty measure and the $\u03f5$-greedy agent both estimate the distribution using 50 quantiles and our expression for the likelihood. The Bayes-by-Backprop (BBB) agent has a single output and learns the mean of the distribution using a Gaussian likelihood. Both the BBB agent and our agent have the same prior, and other hyperparameters were separately tuned for both agents to achieve the best results. Both of these agents select actions using Thompson sampling; the agent that uses our uncertainty metric uses the anchored network scheme [pearce2018bayesian] to produce two networks approximately sampled from the posterior over the weights, and the BBB agent performs inference over a single network with weights drawn from its variational posterior.

Learning proceeds as follows. Each time an agent acts, its action as well as the corresponding reward is stored in memory. After every action, the neural network is updated with one batch of 128 feature-reward pairs randomly drawn from memory.

### D.2 Results

Next, we examine the reward distributions learned by our agents, and use a larger ensemble of networks to also study the per-quantile uncertainty. Specifically, in figure 6 we plot the reward distribution predicted by our agent for an edible mushroom. We see that the agent has correctly learned the distribution. We also plot the per-quantile uncertainty on this distribution, obtained from multiple samples from the approximated posterior for each quantile. We observe that the per-quantile uncertainty provides important information: the agent is a lot less certain about the lower quantiles than the rest of the distribution. This is because the possibility of receiving a very negative reward if the model is wrong affects the lower quantiles much more than the upper quantiles.

## Appendix E Locating the epistemic uncertainty in the distribution

We further demonstrate that our Bayesian formulation of learning a distribution allows us to locate the uncertainty in the estimated distribution. We produce 20 samples of a fixed distribution to be learned by our network, ten of which have a value of 1 and ten of which have a value of -1. With these samples, the epistemic uncertainty on the middle quantiles of the estimated distribution that produced these samples should be higher than that on the lower quantiles. For example, there are not enough data points to decide whether the median value should be 1 or -1. We train an ensemble of 20 anchored neural networks on these samples, and measure the per-quantile epistemic uncertainty using the observed standard deviation in the predictions.

Our experimental results are shown in figure 7. The epistemic uncertainty is indeed higher on the middle quantiles than on the other quantiles. This is reflected in the larger amount of noise in the predictions of the ensemble for those quantile values.

## Appendix F Further results on Cartpole

We provide a visual representation of the predictions given by our two networks in the Cartpole domain in figure 8. The anchoring scheme causes each network’s predictions to be noisy, and the noise can be interpreted as an indication of the agent’s uncertainty. As we can expect, the uncertainty on the sub-optimal action is higher than for the better action, since that action is selected less often in that state.

We note that for our Cartpole experiment, we use a smooth quantile loss with $\kappa =10$ (see [dabney2017distributional]) instead of a strict quantile loss as in the rest of our paper. This is because of the density of positive rewards in this environment which can cause the value estimates to increase exponentially. Using $\kappa =10$ makes the loss more sensitive to the termination condition, which is what grounds all Q-value estimates. Note that divergence in Q-value estimates on this domain can also easily be observed with non-distributional DQN [mnih2015human] for a smoothed l1 loss.

## Appendix G Correlations between the outputs of a Bayesian neural network

In this section, we provide experimental evidence to support our assumption that the outputs of a Bayesian neural network are uncorrelated for large enough neural networks. We compare the epistemic uncertainties produced by an anchored ensemble of neural networks to that produced by a single anchored neural network with several outputs on a toy regression problem. Both the problem formulation and the code for this experiment draw from the work of [pearce2018bayesian].

Representative samples from these experiments are shown in figure 9. For a small neural network with only 10 neurons per layer the different outputs of the multioutput neural network are strongly correlated, which leads to poor uncertainty estimates (top left). The anchored ensemble produces significantly better uncertainty estimates for the same network width (bottom left). However, as we increase the width of the neural network to 100 (top right) the uncertainty estimates of the network with multiple outputs improve and become close to those obtained by the anchored ensemble of networks of the same width (bottom right).

## Appendix H Further results on the Atari suite

### H.1 Tracking the epistemic uncertainty: experimental setup

Our experiment reproduces the training procedure of [dabney2017distributional], except that we use a network architecture allowing us to measure the epistemic uncertainty using our method and we train our agent for 50 million steps instead of the usual 200 million. Our network architecture includes the following modifications. Following [osband2016deep], instead of having two separate networks, we have both networks share common parameters within a ”body” consisting of convolutional layers, on top of which lie two ”heads” with separate parameters, consisting of two linear layers, which each correspond to one of the networks. As in [hafner2018reliable] we only define a prior on the two heads, and we use the anchoring scheme of [pearce2018bayesian] to enforce this prior. The outputs from both heads are averaged to produce the value estimates used by the policy. This different architecture did not lead to any noticeable difference in training compared to a baseline QR-DQN agent, except that training time increased by about 10% using a GPU.

### H.2 Information Directed Exploration: experimental setup

To implement information directed exploration as in [nikolov2019information], we use the same network architecture as described above. Labelling the two networks A and B, we obtain estimates of the mean value of the distribution and the epistemic and aleatoric uncertainties as follows:

$\mu $ | $={\displaystyle \frac{1}{2N}}{\displaystyle \sum _{i=1}^{N}}({y}_{i}^{(A)}+{y}_{i}^{(B)})$ | (19) | ||

${\sigma}_{\text{epistemic}}^{2}$ | $={\sigma}_{Z,N}^{2}={\displaystyle \frac{1}{2N}}{\displaystyle \sum _{i=1}^{N}}{({y}_{i}^{(A)}-{y}_{i}^{(B)})}^{2}$ | (20) | ||

${\sigma}_{\text{aleatoric}}^{2}$ | $=\text{cov}({\mathbf{y}}^{(A)},{\mathbf{y}}^{(B)})$ | (21) | ||

$={\displaystyle \frac{\text{var}({\mathbf{y}}^{(A)})+\text{var}({\mathbf{y}}^{(B)})}{2}}-{\displaystyle \frac{1}{2}}\text{var}({\mathbf{y}}^{(A)}-{\mathbf{y}}^{(B)})$ | (22) | |||

$\approx {\displaystyle \frac{\text{var}({\mathbf{y}}^{(A)})+\text{var}({\mathbf{y}}^{(B)})}{2}}-{\sigma}_{\text{epistemic}}^{2}$ | (23) |

Using these uncertainty estimates, the agent selects actions using algorithm 2. In algorithm 2, $\lambda $ is a hyperparameter that we set to 0.1 as in [nikolov2019information], and $\u03f5$ is set to ${10}^{-5}$ to prevent division by zero.

Evaluation scores during training are shown in the main text. Table 1 shows the best evaluation results achieved during training over all three seeds. Human scores are obtained from [mnih2015human]. Note that due to computational limitations we did not perform an extensive hyperparameter search for our implementation of information directed sampling; in particular, the chosen values for $\lambda $ and for the noise scale in our expression for the likelihood may be suboptimal.

Game | Human | QR-DQN | IDS |
---|---|---|---|

Alien | 7128 | 1728 | 1962 |

Amidar | 1719 | 1036 | 796 |

Assault | 742 | 19286 | 10548 |

Asterix | 8503 | 43590 | 41317 |

### H.3 Exploration via Thompson sampling

Game | Human | QR-DQN | QR-DQN |
---|---|---|---|

($\u03f5$-greedy) | (Thompson | ||

sampling) | |||

Alien | 7128 | 1728 | 1899 |

Amidar | 1719 | 1036 | 442 |

Assault | 742 | 19286 | 14377 |

Asterix | 8503 | 43590 | 20518 |

Breakout | 31 | 565 | 515 |

We also report preliminary results of an experiment in which we train agents that select actions using Thompson sampling (in the same way as in the Cartpole environment) on the following Atari games: Alien, Amidar, Assault, Asterix, and Breakout. We train all agents for 50 million frames. Every 1M frames, these agents are evaluated on 500k frames, as in [mnih2015human, dabney2017distributional]. Table 2 shows the best evaluation results achieved during training for both our implementation of QR-DQN with an $\u03f5$-greedy policy as in [dabney2017distributional], and QR-DQN with a Thompson sampling policy using our uncertainty metric and our modified network architecture described above. Figure 10 shows evaluation scores during training. Note that these figures correspond to three training seeds for QR-DQN with an $\u03f5$-greedy policy, but only one seed for QR-DQN with Thompson sampling due to computational limitations; as the scores shown in table 2 are the best scores across all seeds this penalizes the agent that uses Thompson sampling.

We observe that agents that use Thompson sampling with our uncertainty metric all learn successful policies on these games. However, on some games agents that use Thompson sampling to select actions learn more slowly than the corresponding agent that uses an $\u03f5$-greedy policy. This result is in line with what we observe on Cartpole: the agents that use Thompson sampling spend more time exploring less rewarding parts of the MDP, and at the end of training lag behind the $\u03f5$-greedy agents that have spent more time exploiting simple but successful policies (such as moving the paddle towards the ball in Breakout). We hypothesize that reducing the weight of the prior during learning will lead to more exploitative and thus more successful policies; we leave this to future work.