### Abstract

Reinforcement learning agents are faced with two types of uncertainty.Epistemic uncertainty stems from limited data and is useful for exploration,whereas aleatoric uncertainty arises from stochastic environments and must beaccounted for in risk-sensitive applications. We highlight the challengesinvolved in simultaneously estimating both of them, and propose a framework fordisentangling and estimating these uncertainties on learned Q-values. We deriveunbiased estimators of these uncertainties and introduce an uncertainty-awareDQN algorithm, which we show exhibits safe learning behavior and outperformsother DQN variants on the MinAtar testbed.

### Quick Read (beta)

# Estimating Risk and Uncertainty in Deep Reinforcement Learning

###### Abstract

Reinforcement learning agents are faced with two types of uncertainty. Epistemic uncertainty stems from limited data and is useful for exploration, whereas aleatoric uncertainty arises from stochastic environments and must be accounted for in risk-sensitive applications. We highlight the challenges involved in simultaneously estimating both of them, and propose a framework for disentangling and estimating these uncertainties on learned Q-values. We derive unbiased estimators of these uncertainties and introduce an uncertainty-aware DQN algorithm, which we show exhibits safe learning behavior and outperforms other DQN variants on the MinAtar testbed.

Distinguishing between both epistemic uncertainty, which stems from limited data, and aleatoric uncertainty, caused by intrinsic stochasticity in the environment, is important in reinforcement learning for both exploration and risk-sensitivity (osband2016risk; moerland2017efficient; nikolov2019information). However, while prior work has developed independent methods for estimating both uncertainties, difficulties appear when trying to estimate both simultaneously. For example, distributional reinforcement learning (morimura2010nonparametric; morimura2012parametric; bellemare2017distributional), which aims to learn the distribution of returns instead of the mean value only, has been suggested as a way of measuring aleatoric uncertainty (nikolov2019information). However, although metrics such as the variance of the learned distribution can be good indicators of the aleatoric uncertainty for in-distribution data, it was highlighted for example in (chua2018deep) that for out-of-distribution data, when the epistemic uncertainty is high, such a metric is not a good indicator of aleatoric uncertainty as it conflates both uncertainties.

To address this issue, we construct a framework for disentangling both types of uncertainties which is applicable to both stochastic and deterministic environments. Our method builds on the distributional reinforcement learning framework, which aims to learn the entire return distribution instead of only its expected value (bellemare2017distributional), and methods for approximate Bayesian deep learning. Our main contributions are 1) a theoretical framework within which epistemic and aleatoric uncertainties can be separately estimated, 2) practical, unbiased estimators for both types of uncertainty, and 3) a demonstration that these uncertainties can successfully be used within an uncertainty-aware Deep Q Networks (mnih2015human) algorithm.

## 1 Background

We consider a discounted Markov Decision Process (MDP) defined by $(\chi ,A,R,P,\gamma )$, in which $\chi $ and $A$ represent the state and action spaces, $R$ is the distribution of rewards associated with performing actions given the states, $P$ is the transition probability, and $\gamma $ is the discount factor.

Distributional reinforcement learning aims to learn the distribution of returns ${Z}^{\pi}(s,a)$ associated with taking action $a$ in state $s$ and then following a policy $\pi $. To learn this distribution, (dabney2017distributional) propose a quantile parameterization. In this framework, a probability distribution $Z(s,a)$ is parameterized by $N$ quantiles ${\tau}_{i}=i/(N+1)$ for $i\in [1,N]$, with values $\bm{q}=({q}_{1},\mathrm{\dots},{q}_{N})$. Learning the quantile values proceeds by minimizing the quantile regression loss (koenker2001quantile),

${\mathcal{L}}_{q}(\bm{q})={\mathbb{E}}_{z\sim Z(s,a)}{\displaystyle \sum _{i=1}^{N}}{\rho}_{{\tau}_{i}}(z-{q}_{i}(s,a)),$ | (1) | ||

$$ |

This loss can be minimized stochastically for each new value $z$ sampled from $Z(s,a)$. For temporal difference learning of the optimal value function, $Z(s,a)$ is replaced with the Bellman target $R(s,a)+\gamma Z({s}^{\prime},{a}^{\prime})$, where ${a}^{\prime}\sim {\text{argmax}}_{{a}^{\prime}}\mathbb{E}[Z({s}^{\prime},{a}^{\prime})]$ and ${s}^{\prime}\sim P(\cdot |s,a)$, yielding the QR-DQN algorithm of (dabney2017distributional).

An intuitive way of estimating the aleatoric uncertainty on the return distribution would be to use the variance of the quantiles. However, the variance of the quantiles is not a good estimator of the aleatoric uncertainty, because for out of distribution data the epistemic uncertainty on the value of the quantiles can also affect the variance.

## 2 Estimating both uncertainties

Here, we construct a theoretical framework which will allow us to disentangle epistemic and aleatoric uncertainties and derived unbiased estimators for both.

### 2.1 Theoretical framework

We start by framing learning the quantiles of the return distribution as a Bayesian inference problem. We consider state $s$, action $a$ taken in state $s$, policy $\pi $, and data $D$ consisting of $K$ samples $({z}_{1},\mathrm{\dots},{z}_{K})$ from ${Z}^{\pi}(s,a)$. To learn the value of a given quantile $\tau $ of ${Z}^{\pi}(s,a)$, we consider a neural network with parameters $\bm{\theta}$, which returns a value $y(\bm{\theta},s,a)$. We interpret possible values of $\bm{\theta}$ as different hypotheses about the function relating the state-action pair to the value of quantile $\tau $ of ${Z}^{\pi}(s,a)$ (mackay2003information). Following (yu2001bayesian), we define a likelihood based on how well the output of the network matches the data using an asymmetric Laplace distribution,

$P(D|\bm{\theta})={\displaystyle \prod _{j=1}^{K}}{f}_{\tau}({z}_{j}-y(\bm{\theta},s,a)),$ | (2) | ||

$\text{where}\mathit{\hspace{1em}}{f}_{\tau}(u)={\displaystyle \frac{\tau (1-\tau )}{{\sigma}_{D}}}\mathrm{exp}(-{\displaystyle \frac{{\rho}_{\tau}(u)}{{\sigma}_{D}}})$ |

where ${\sigma}_{D}$ is a characteristic length scale and ${\rho}_{\tau}$ is the same as in equation 1.

To estimate the entire return distribution instead of a single quantile, we extend this formalism to a network with $N$ outputs ${y}_{i}(\bm{\theta},s,a)$, where each output $i$ is trained to learn the value of quantile ${\tau}_{i}$. We thus define the likelihood

$P(D|\bm{\theta})={\displaystyle \prod _{j=1}^{K}}{\displaystyle \prod _{i=1}^{N}}{f}_{{\tau}_{i}}({z}_{j}-{y}_{i}(\bm{\theta},s,a))$ | (3) |

Minimizing the loss in equation 1 is equivalent to maximizing the likelihood in equation 3. If we now consider a normal prior on parameters $\bm{\theta}$ centered around $0$, we can use any one of several methods for approximately sampling from the posterior distribution $P(\bm{\theta}|D)$ (blundell2015weight; gal2016dropout; pearce2018bayesian).

### 2.2 Uncertainty estimates

Using the framework described above, we now propose expressions for both aleatoric and epistemic uncertainties.

#### 2.2.1 Epistemic uncertainty

To obtain a single aggregate measure of the epistemic uncertainty on the return distribution, we propose taking the average of the epistemic uncertainty on the quantiles, defined by their variance over $\bm{\theta}$,

${\sigma}_{\text{epistemic}}^{2}={\mathbb{E}}_{i\sim \mathcal{U}\{1,N\}}\left[{\text{var}}_{\bm{\theta}\sim P(\bm{\theta}|D)}({y}_{i}(\bm{\theta},s,a))\right]$ | (4) |

where $\mathcal{U}\{1,N\}$ is the uniform distribution over $\{1,N\}$.

#### 2.2.2 Aleatoric uncertainty

An intuitive measure of the aleatoric uncertainty is the variance of the quantile values. However, this variance is also affected by epistemic uncertainty in the form of the distribution over $\bm{\theta}$. To decouple aleatoric uncertainty from epistemic uncertainty, we define the aleatoric uncertainty as the variance of the expected value of the quantiles according to the posterior distribution over $\bm{\theta}$,

${\sigma}_{\text{aleatoric}}^{2}={\text{var}}_{i\sim \mathcal{U}\{1,N\}}[{\mathbb{E}}_{\bm{\theta}\sim P(\bm{\theta}|D)}{y}_{i}(\bm{\theta},s,a)]$ | (5) |

When the posterior is concentrated around a single value, we recover the intuitive definition of aleatoric uncertainty as the variance of the quantiles. However, when the posterior is not concentrated, the variance of a single set of quantiles is a biased estimator of ${\sigma}_{\text{aleatoric}}^{2}$:

###### Proposition 2.1.

(Proof in the appendix) Consider $\widehat{\bm{\theta}}$ drawn from the posterior distribution over $\bm{\theta}$. Then ${\text{var}}_{i\sim \mathcal{U}\{1,N\}}[{y}_{i}(\widehat{\bm{\theta}},s,a)]$ is a biased estimator of ${\sigma}_{\text{aleatoric}}^{2}$.

#### 2.2.3 Decomposition of Uncertainties

We require that the total uncertainty on the return distribution can be decomposed as the sum of these two uncertainties. We consider the total variance of the return distribution ${\text{var}}_{\bm{\theta}\sim P(\bm{\theta}|D),i\sim \mathcal{U}\{1,N\}}({y}_{i}(\bm{\theta},s,a))$, which for notational simplicity we write ${\text{var}}_{\bm{\theta},i}({y}_{i}(\bm{\theta},s,a))$.

###### Proposition 2.2.

We also consider two limit cases as sanity checks. First, in the absence of data, when all the uncertainty should be epistemic, we do find ${\text{var}}_{\bm{\theta},i}({y}_{i}(\bm{\theta},s,a))={\sigma}_{\text{epistemic}}^{2}$. In the limit of infinite data, when all the uncertainty should be aleatoric, we also find ${\text{var}}_{\bm{\theta},i}({y}_{i}(\bm{\theta},s,a))={\sigma}_{\text{aleatoric}}^{2}$.

### 2.3 Approximate Uncertainties Using Two Networks

Estimating the variance and expectation over $\bm{\theta}$ in the previous expressions for both uncertainties requires in principle a large number of samples of $\bm{\theta}$, which is impractical. Instead, we propose the following approximations of ${\sigma}_{\text{epistemic}}^{2}$ and ${\sigma}_{\text{aleatoric}}^{2}$ using only two samples ${\bm{\theta}}_{A}$ and ${\bm{\theta}}_{B}$ from the posterior distribution over $\bm{\theta}$,

${\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}={\displaystyle \frac{1}{2}}{\mathbb{E}}_{i\sim \mathcal{U}\{1,N\}}{[({y}_{i}({\bm{\theta}}_{A},s,a)-{y}_{i}({\bm{\theta}}_{B},s,a))]}^{2}$ | |||

${\stackrel{~}{\sigma}}_{\text{aleatoric}}^{2}={\text{cov}}_{i\sim \mathcal{U}\{1,N\}}({y}_{i}({\bm{\theta}}_{A},s,a),{y}_{i}({\bm{\theta}}_{B},s,a))$ | (7) |

###### Proposition 2.3.

(Proof in the appendix) ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ and ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$ are unbiased estimators of ${\sigma}_{\text{epistemic}}$ and ${\sigma}_{\text{aleatoric}}$. Moreover, assuming that the network outputs are uncorrelated, the variance of these estimators converges towards 0 as the number of quantiles increases.

In figure 1, we provide an illustration of the uncertainties measured with ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ and ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$ on a toy dataset. We consider a neural network that estimates 50 quantiles from the target distribution, and we draw two samples of $\bm{\theta}$ using approximate MAP sampling (pearce2018bayesian). As expected, ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ is small close to the data but large far from it, while ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$ correctly captures the noise in the data.

### 2.4 Uncertainty-Aware Deep Q Networks

Until now, we have been mainly concerned with learning the return distribution given an ensemble of samples of this distribution. For temporal difference learning, we replace these samples with the Bellman target. Although this implies measuring the “one-step” epistemic uncertainty on the bootstrapped target instead of that on the total return, this uncertainty is nonetheless useful for exploration as it allows for the identification of less-visited state-action pairs.

There are several ways these uncertainty estimates could be included into a reinforcement learning algorithm, for example to drive information-directed exploration (nikolov2019information). To better contrast the different roles played by both uncertainties, we propose a simple uncertainty-aware Deep Q Networks algorithm (UA-DQN), which is based on the QR-DQN algorithm of (dabney2017distributional) but includes the following modifications, presented in Algorithm 1:

Auxiliary networks for uncertainty estimation. To disentangle value learning and uncertainty estimation, we consider two auxiliary networks ${\bm{\theta}}_{A}$ and ${\bm{\theta}}_{B}$ both trained on the targets used in QR-DQN and approximately sampled from the posterior distribution over $\bm{\theta}$. These networks are used to derive ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ and ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$.

Uncertainty-Aware Action Selection. Instead of the $\u03f5$-greedy policy used by QR-DQN, we use our uncertainty estimates to separately drive risk-awareness and exploration. We use ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$ to penalize high-variance actions, while ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ drives exploration using Thompson sampling.

## 3 Experiments

### 3.1 Safe Learning

We first empirically study the behavior of our uncertainty-aware DQN on an environment inspired by the AI Safety Gridworlds (leike2017ai). We consider a simple $2\times 5$ gridworld represented in figure 2, in which the agent must navigate to a goal without falling off a cliff. The agent receives -1 point at each timestep, +10 points for reaching the goal, and if the agent falls off the cliff the environment restarts. We introduce a stochastic wind in this environment, which with probability $5\%$ knocks the agent off the cliff if the agent is on the ledge. The expectation value of the returns associated with the risky trajectory along the ledge is 4.8, while for the safe trajectory the return is deterministically 4.

We compare the learning behavior of our uncertainty-aware DQN agent to other comparable algorithms. We consider the $\u03f5$-greedy QR-DQN algorithm of (dabney2017distributional), a risk-neutral version of UA-DQN (with $\lambda =0$), and two risk-averse variants of UA-DQN (with $\lambda =0.5$). Variant 1 uses the variance of the learned quantiles to estimate aleatoric uncertainty, while variant 2 uses our ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$ estimator. All algorithms differ only in action selection.

Experimental results are shown in figure 2. While all algorithms quickly learn to solve this simple task, there are marked differences in behavior. QR-DQN falls off the most due to both taking the risky trajectory and its $\u03f5$-greedy policy. Risk-neutral UA-DQN only falls off the cliff due to its use of the risky trajectory. Both risk-averse variants of UA-DQN learn to use the safe trajectory. However, variant 1 overestimates aleatoric uncertainty due to its use of a biased estimator, takes longer to identify the safe trajectory, and thus accumulates more falls during learning than variant 2 that uses our unbiased estimate ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$. Variant 2, which uses our unbiased estimators of both uncertainties, is the least likely to fall off the cliff during learning.

### 3.2 Evaluation on MinAtar

We now evaluate our UA-DQN algorithm on the MinAtar testbed (young2019minatar), which contains simplified implementations of 5 Atari games. Compared to the Arcade Learning Environment (bellemare2013arcade), MinAtar has similar underlying game dynamics but involves lower-dimensional observations. This helps to decouple representation learning from behavioral learning and allows to focus on the latter, and also encourages reproducibility as the reduced computational overhead allows for more thorough comparisons involving more training seeds.

We compare risk-neutral UA-DQN ($\lambda =0$) with DQN (mnih2015human), QR-DQN, and Bootstrapped DQN (osband2016deep). In contrast to UA-DQN which uses Thompson sampling to explore, Bootstrapped DQN uses an ensemble of bootstrapped DQN heads to achieve diverse behaviors. All algorithms are implemented within the same code base using the hyperparameters of (young2019minatar), except that we use the Adam optimizer (kingma2014adam) instead of RMSProp with learning rate ${10}^{-4}$ and $\u03f5={10}^{-8}$. We also optimized the exploration hyperparameters: we use a final $\u03f5$ of 0.03 for the $\u03f5$-greedy policies of DQN, QR-DQN, and Bootstrapped DQN, and for UA-DQN we use $\beta =0.2$. We selected these values using the game Breakout, which were then fixed for all the other games.

Our results are shown in figure 3. As reported in (dabney2017distributional), we find that QR-DQN outperforms the other DQN variants, and that UA-DQN in turn significantly outperforms QR-DQN. To understand why this is, we inspect the behavior of UA-DQN and find that even at the end of training roughly 10-20% of the actions selected by UA-DQN are non-greedy. As UA-DQN only differs from QR-DQN in action selection, and QR-DQN’s performance decreases with higher levels of $\u03f5$-greedy exploration, this result indicates that UA-DQN successfully uses ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ to appropriately decide when best to perform exploratory or greedy actions.

## 4 Conclusion

Estimating both uncertainties is important for developing agents that can both explore efficiently and account for risk in their actions. We propose a scheme whereby both types of uncertainty on the expected return of a policy can be estimated in deep reinforcement learning. We show that unbiased estimators for these uncertainties can be obtained using only two networks, and that these estimators can be efficiently harnessed by an uncertainty-aware DQN algorithm for improved risk-sensitivity and exploration. We find that this UA-DQN algorithm significantly outperforms other DQN variants on the MinAtar testbed.

## References

## Appendix A Related Work

Our work focuses on the problem of estimating the uncertainty of the expected return of a policy in model-free reinforcement learning. The epistemic uncertainty on the expected return has been shown to be useful for exploration (osband2016deep; azizzadenesheli2018efficient; touati2018randomized). On the other hand, the aleatoric uncertainty of the expected return is useful for designing risk-averse policies (howard1972risk; tamar2016learning; dabney2018implicit). Whereas most prior work considers both uncertainties separately, (tang2018exploration; moerland2017efficient) are interested in both, but their methods yield only an aggregate uncertainty. (nikolov2019information) do make use of both uncertainties to drive information-directed sampling (russo2014learning), but their uncertainty estimates derive from two different frameworks. Moreover, we argue that the variance of the learned return distribution, which (nikolov2019information) use for aleatoric uncertainty estimation, conflates both uncertainties for out of distribution data. Our work aims to provide a single framework for simultaneously estimating both uncertainties for the return distribution.

Estimating both types of uncertainty is also important in model-based reinforcement learning, where uncertainties affect the predictions of a learned dynamics model of the environment. Uncertainty estimates can be used in planning, either for better exploration (schmidhuber1991possibility; sun2011planning) or to avoid risky or unknown sections of the environment (garcia2015comprehensive). Model based algorithms that explicitly account for both aleatoric and epistemic uncertainties have recently also been developed (depeweg2018decomposition; chua2018deep; henaff2019model).

An approach that combines model free and model based techniques consists of using the uncertainties derived from a learned dynamics model to inform the policy of a model-free agent. The epistemic uncertainty associated with the learned model can for example be used as an intrinsic motivation bonus (stadie2015incentivizing; pathak2017curiosity; burda2018exploration). However, uncertainties on the transition model do not typically convey information about the uncertainty of the expected return of a policy, which is a quantity of fundamental interest in reinforcement learning.

## Appendix B Proofs

In the following, for notational simplicity we will omit the dependence of ${y}_{i}(\bm{\theta},s,a)$ on $s$ and $a$. Moreover, subscripts used in variances/expectation values should be interpreted as the variance/expectation value taken over the distribution of the variables in the subscript, so that for example ${\mathbb{E}}_{\bm{\theta}}={\mathbb{E}}_{\bm{\theta}\sim P(\bm{\theta}|D)}$ and ${\mathbb{E}}_{i}={\mathbb{E}}_{i\sim \mathcal{U}\{1,N\}}$. We will also assume that the following integrals over $P(\bm{\theta}|D)$ are well defined, which, considering in particular the Gaussian prior over the weights, is a reasonable assumption.

### B.1 Proof of proposition 2.1

Here, we show that, considering a sample $\widehat{\bm{\theta}}$ drawn from the posterior distribution over $\bm{\theta}$, ${\text{var}}_{i}[{y}_{i}(\widehat{\bm{\theta}})]$ is a biased estimator of ${\sigma}_{\text{aleatoric}}^{2}$. We do so by showing that ${\mathbb{E}}_{\bm{\theta}}[{\text{var}}_{i}[{y}_{i}(\bm{\theta})]]$ is greater than ${\sigma}_{\text{aleatoric}}^{2}$.

${\mathbb{E}}_{\bm{\theta}}\left[{\text{var}}_{i}[{y}_{i}(\bm{\theta},s,a)]\right]$ | $={\mathbb{E}}_{\bm{\theta}}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{({y}_{j}(\bm{\theta})-{\mathbb{E}}_{i}[{y}_{i}(\bm{\theta})])}^{2}\right]$ | ||

$={\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\mathbb{E}}_{\bm{\theta}}\left[{({y}_{j}(\bm{\theta})-{\mathbb{E}}_{i}[{y}_{i}(\bm{\theta})])}^{2}\right]$ |

By definition of the variance, we also have

${\text{var}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})-{\mathbb{E}}_{i}({y}_{i}(\bm{\theta})]={\mathbb{E}}_{\bm{\theta}}\left[{({y}_{j}(\bm{\theta})-{\mathbb{E}}_{i}[{y}_{i}(\bm{\theta})])}^{2}\right]-{({\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})])}^{2}$ |

Therefore, when the posterior over $\bm{\theta}$ is not concentrated and ${\text{var}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})-{\mathbb{E}}_{i}({y}_{i}(\bm{\theta})]>0$,

${\mathbb{E}}_{\bm{\theta}}\left[{\text{var}}_{i}[{y}_{i}(\bm{\theta},s,a)]\right]$ | $>{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{({\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})])}^{2}$ | ||

$>{\text{var}}_{i}[{\mathbb{E}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})]]$ | |||

$>{\sigma}_{\text{aleatoric}}^{2}$ |

### B.2 Proof of proposition 2.2

Here, we show that ${\text{var}}_{\bm{\theta},i}({y}_{i}(\bm{\theta},s,a))={\sigma}_{\text{epistemic}}^{2}+{\sigma}_{\text{aleatoric}}^{2}$.

${\text{var}}_{\bm{\theta},i}({y}_{i}(\bm{\theta}))$ | $={\displaystyle {\int}_{\bm{\theta}}}{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\left({y}_{j}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})]\right)}^{2}P(\bm{\theta}|D)d\bm{\theta}$ | ||

$={\displaystyle {\int}_{\bm{\theta}}}{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\left({y}_{j}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]+{\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})]\right)}^{2}P(\bm{\theta}|D)d\bm{\theta}$ | |||

$={\displaystyle {\int}_{\bm{\theta}}}{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}({({y}_{j}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})])}^{2}$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}+{({\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})])}^{2}$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}+2({\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})])({y}_{j}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]))P(\bm{\theta}|D)d\bm{\theta}$ |

The integral over $\bm{\theta}$ of the last line is 0, which leaves us with

${\text{var}}_{\bm{\theta},i}({y}_{i}(\bm{\theta}))$ | $={\displaystyle {\int}_{\bm{\theta}}}{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\left({y}_{j}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]\right)}^{2}P(\bm{\theta}|D)d\bm{\theta}+{\displaystyle {\int}_{\bm{\theta}}}{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{({\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})])}^{2}P(\bm{\theta}|D)d\bm{\theta}$ | ||

$={\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\displaystyle {\int}_{\bm{\theta}}}{\left({y}_{j}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]\right)}^{2}P(\bm{\theta}|D)\mathit{d}\bm{\theta}+{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{({\mathbb{E}}_{\bm{\theta}}[{y}_{j}(\bm{\theta})]-{\mathbb{E}}_{\bm{\theta},i}[{y}_{i}(\bm{\theta})])}^{2}$ | |||

$={\mathbb{E}}_{i}({\text{var}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))+{\text{var}}_{i}({\mathbb{E}}_{\bm{\theta}}{y}_{i}(\bm{\theta}))$ | |||

$={\sigma}_{\text{epistemic}}^{2}+{\sigma}_{\text{aleatoric}}^{2}$ |

### B.3 Proof of proposition 2.3

### B.4 Expectation of the estimators

Here, we show that ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ and ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$ are unbiased estimators of ${\sigma}_{\text{epistemic}}$ and ${\sigma}_{\text{aleatoric}}$. In the following, ${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}$ indicates the expectation value when ${\bm{\theta}}_{\bm{A}}$ and ${\bm{\theta}}_{\bm{B}}$ are drawn from the posterior distribution over $\bm{\theta}$. Moreover, in what follows it can easily be verified that expectations over $\bm{\theta}$ and over $i$ are interchangeable due to the discrete nature of the expectation over $i$.

${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}]$ | $={\displaystyle \frac{1}{2}}{\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}{\mathbb{E}}_{i}[{({y}_{i}({\bm{\theta}}_{A})-{y}_{i}({\bm{\theta}}_{B}))}^{2}]$ | ||

$={\displaystyle \frac{1}{2}}{\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}{\mathbb{E}}_{i}[{({y}_{i}({\bm{\theta}}_{A})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))+{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))-{y}_{i}({\bm{\theta}}_{B}))}^{2}]$ | |||

$={\displaystyle \frac{1}{2}}{\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\mathbb{E}}_{i}[{({y}_{i}({\bm{\theta}}_{A})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))}^{2}]+{\mathbb{E}}_{i}[{({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))-{y}_{i}({\bm{\theta}}_{B}))}^{2}]$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}+2{\mathbb{E}}_{i}[(({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))-{y}_{i}({\bm{\theta}}_{B}))({y}_{i}({\bm{\theta}}_{A})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))]]$ |

The average over either ${\bm{\theta}}_{\bm{A}}$ or ${\bm{\theta}}_{\bm{B}}$ of the last line is zero, which, after noticing that ${\bm{\theta}}_{\bm{A}}$ and ${\bm{\theta}}_{\bm{B}}$ are now separable such that we can use the equality ${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}}}[{y}_{i}({\bm{\theta}}_{A})]={\mathbb{E}}_{{\bm{\theta}}_{\bm{B}}}[{y}_{i}({\bm{\theta}}_{B})]={\mathbb{E}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})]$, leaves us with

${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}]$ | $={\displaystyle \frac{1}{2}}\left({\mathbb{E}}_{\bm{\theta}}[{\mathbb{E}}_{i}{({y}_{i}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))}^{2}+{\mathbb{E}}_{i}{({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))-{y}_{i}(\bm{\theta}))}^{2}]\right)$ | ||

$={\mathbb{E}}_{\bm{\theta}}\left[{\mathbb{E}}_{i}{({y}_{i}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))}^{2}\right]$ | |||

$={\mathbb{E}}_{i}\left[{\mathbb{E}}_{\bm{\theta}}{({y}_{i}(\bm{\theta})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))}^{2}\right]$ | |||

$={\mathbb{E}}_{i}\left[{\text{var}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))\right]$ | |||

$={\sigma}_{\text{epistemic}}^{2}$ |

so ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ is indeed an unbiased estimator of ${\sigma}_{\text{epistemic}}$.

Similarly, for ${\stackrel{~}{\sigma}}_{\text{aleatoric}}$, and introducing ${\u03f5}_{i}({\bm{\theta}}_{A})={y}_{i}({\bm{\theta}}_{A})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))$ and ${\u03f5}_{i}({\bm{\theta}}_{B})={y}_{i}({\bm{\theta}}_{B})-{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))$,

${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{aleatoric}}^{2}]$ | $={\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}{\text{cov}}_{i}({y}_{i}({\bm{\theta}}_{A}),{y}_{i}({\bm{\theta}}_{B}))$ | ||

$={\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}{\text{cov}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A})+{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})),{\u03f5}_{i}({\bm{\theta}}_{B})+{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))$ | |||

$={\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\text{cov}}_{i}({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})),{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))+{\text{cov}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A}),{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}+{\text{cov}}_{i}({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})),{\u03f5}_{i}({\bm{\theta}}_{B}))+{\text{cov}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A}),{\u03f5}_{i}({\bm{\theta}}_{B}))]$ |

Looking at these terms individually, we have

${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\text{cov}}_{i}({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})),{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))\right]$ | $={\text{var}}_{i}({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))$ | ||

$={\sigma}_{\text{aleatoric}}^{2}$ |

${E}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\text{cov}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A}),{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))\right]$ | $={E}_{{\bm{\theta}}_{\bm{A}}}\left[{\text{cov}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A}),{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))\right]$ | ||

$={E}_{{\bm{\theta}}_{\bm{A}}}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}({\u03f5}_{j}({\bm{\theta}}_{A})-{\mathbb{E}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A})))({\mathbb{E}}_{\bm{\theta}}({y}_{j}(\bm{\theta}))-{\mathbb{E}}_{i}{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))\right]$ | |||

$={\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}({\mathbb{E}}_{\bm{\theta}}({y}_{j}(\bm{\theta}))-{\mathbb{E}}_{i}{\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta})))({\mathbb{E}}_{{\bm{\theta}}_{\bm{A}}}({\u03f5}_{j}({\bm{\theta}}_{A}))-{\mathbb{E}}_{i}({\mathbb{E}}_{{\bm{\theta}}_{\bm{A}}}({\u03f5}_{i}({\bm{\theta}}_{A}))))$ | |||

$=0\mathit{\hspace{1em}\hspace{1em}}\text{since}\mathit{\hspace{1em}}{\mathbb{E}}_{{\bm{\theta}}_{\bm{A}}}({\u03f5}_{i}({\bm{\theta}}_{\bm{A}}))=0\mathit{\hspace{1em}}\text{for all}i$ |

${E}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\text{cov}}_{i}({\mathbb{E}}_{\bm{\theta}}({y}_{i}(\bm{\theta}))),{\u03f5}_{i}({\bm{\theta}}_{B})]$ | $=0\mathit{\hspace{1em}}\text{[Same derivation as previous expression]}$ |

${E}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\text{cov}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A}),{\u03f5}_{i}({\bm{\theta}}_{B}))\right]$ | $={E}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}({\u03f5}_{j}({\bm{\theta}}_{A})-{\mathbb{E}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A})))({\u03f5}_{j}({\bm{\theta}}_{B})-{\mathbb{E}}_{i}({\u03f5}_{i}({\bm{\theta}}_{B})))\right]$ | ||

$={E}_{{\bm{\theta}}_{\bm{A}}}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}({\u03f5}_{j}({\bm{\theta}}_{A})-{\mathbb{E}}_{i}({\u03f5}_{i}({\bm{\theta}}_{A})))({E}_{{\bm{\theta}}_{\bm{B}}}({\u03f5}_{j}({\bm{\theta}}_{B}))-{E}_{i}({\mathbb{E}}_{{\bm{\theta}}_{\bm{B}}}({\u03f5}_{i}({\bm{\theta}}_{B}))))\right]$ | |||

$=0$ |

As desired, we end up with

${\mathbb{E}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{aleatoric}}^{2}]$ | $={\sigma}_{\text{aleatoric}}^{2}$ |

### B.5 Variance of the estimators

Using the same notation as in the previous section, we can write

${\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}]$ | $={\displaystyle \frac{1}{4}}{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}{\mathbb{E}}_{i}[{({y}_{i}({\bm{\theta}}_{A})-{y}_{i}({\bm{\theta}}_{B}))}^{2}]$ | ||

$={\displaystyle \frac{1}{4}}{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}{({y}_{i}({\bm{\theta}}_{A})-{y}_{i}({\bm{\theta}}_{B}))}^{2}\right]$ | |||

$={\displaystyle \frac{1}{4{N}^{2}}}{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\displaystyle \sum _{i=1}^{N}}{({y}_{i}({\bm{\theta}}_{A})-{y}_{i}({\bm{\theta}}_{B}))}^{2}\right]$ |

We now require our assumption that all outputs of the neural networks are decorrelated to write

${\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}]$ | $={\displaystyle \frac{1}{4{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{({y}_{i}({\bm{\theta}}_{A})-{y}_{i}({\bm{\theta}}_{B}))}^{2}\right]$ | ||

$={\displaystyle \frac{1}{4{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{y}_{i}{({\bm{\theta}}_{A})}^{2}+{y}_{i}{({\bm{\theta}}_{B})}^{2}+2{y}_{i}({\bm{\theta}}_{A}){y}_{i}({\bm{\theta}}_{B})\right]$ | |||

$\le {\displaystyle \frac{3}{4{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}\left[{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{y}_{i}{({\bm{\theta}}_{A})}^{2}]+{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{y}_{i}{({\bm{\theta}}_{B})}^{2}]+4{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{y}_{i}({\bm{\theta}}_{A}){y}_{i}({\bm{\theta}}_{B})]\right]$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}\text{[Where we used the Cauchy-Schwartz inequality]}$ | |||

$\le {\displaystyle \frac{3}{4{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}\left[2{\text{var}}_{\bm{\theta}}[{y}_{i}{(\bm{\theta})}^{2}]+8{({\mathbb{E}}_{\bm{\theta}}{y}_{i}(\bm{\theta}))}^{2}{\text{var}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})]+2{({\text{var}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})])}^{2}\right]$ |

We now further assume that ${\mathbb{E}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})]$, ${\text{var}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})]$, and ${\text{var}}_{\bm{\theta}}[{y}_{i}^{2}(\bm{\theta})]$ are bounded for all $i$ and $N$. Then, there is a constant $C$ such that, for all $i$ and $N$,

$2{\text{var}}_{\bm{\theta}}[{y}_{i}{(\bm{\theta})}^{2}]+8{({\mathbb{E}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})])}^{2}{\text{var}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})]+2{({\text{var}}_{\bm{\theta}}[{y}_{i}(\bm{\theta})])}^{2}\le C$ |

We then obtain

${\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}]$ | $\le {\displaystyle \frac{3}{4{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}C$ | ||

$\le {\displaystyle \frac{C}{4N}}$ |

The variance of ${\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}$ (and thus that of ${\stackrel{~}{\sigma}}_{\text{epistemic}}$) therefore decreases towards 0 as the number of quantiles increases.

As for the aleatoric uncertainty, a similar bound can be derived by rewriting ${\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{aleatoric}}^{2}]$ as follows.

${\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{aleatoric}}^{2}]$ | $={\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\text{cov}}_{i}({y}_{i}({\bm{\theta}}_{A}),{y}_{i}({\bm{\theta}}_{B}))]$ | ||

$={\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}({y}_{j}({\bm{\theta}}_{A})-{\mathbb{E}}_{i}{y}_{i}({\bm{\theta}}_{A}))({y}_{j}({\bm{\theta}}_{B})-{\mathbb{E}}_{i}{y}_{i}({\bm{\theta}}_{B}))\right]$ | |||

$={\displaystyle \frac{1}{{N}^{2}}}{\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}\left[{\displaystyle \sum _{j=1}^{N}}({y}_{j}({\bm{\theta}}_{A})-{\mathbb{E}}_{i}{y}_{i}({\bm{\theta}}_{A}))({y}_{j}({\bm{\theta}}_{B})-{\mathbb{E}}_{i}{y}_{i}({\bm{\theta}}_{B}))\right]$ |

In a manner similar as for the derivation of the variance of ${\stackrel{~}{\sigma}}_{\text{epistemic}}^{2}$, assuming that the network outputs are uncorrelated and that the first moments of ${y}_{i}(\bm{\theta})$ are bounded, we can also derive a bound for ${\text{var}}_{{\bm{\theta}}_{\bm{A}},{\bm{\theta}}_{\bm{B}}}[{\stackrel{~}{\sigma}}_{\text{aleatoric}}^{2}]$ that converges to 0 with increasing $N$.

## Appendix C Correlations between the outputs of a Bayesian neural network

Proposition 2.3 makes the assumption that the network outputs are uncorrelated. Indeed, correlations between outputs could cause for example a network to overestimate all the quantiles. If both networks A and B produce overestimations, then ${\stackrel{~}{\sigma}}_{\text{epistemic}}$ would probably underestimate ${\sigma}_{\text{epistemic}}$. However, in the limit of infinite width Bayesian neural networks are uncorrelated for normal priors and separable likelihoods (neal2012bayesian). In the following, we experimentally explore in which cases this applies to finite width neural networks and to approximate Bayesian techniques such as the randomized MAP sampling technique (pearce2018bayesian) used in our work.

### C.1 Uncertainties for different network widths

First, we compare the epistemic uncertainties produced by an ensemble of neural networks produced by the ”anchoring” approximate MAP sampling technique of (pearce2018bayesian) to that produced by a single neural network (also produced with approximate MAP sampling) with several outputs on a toy regression problem. Both the problem formulation and the code for this experiment draw from the work of (pearce2018bayesian).

Representative samples from these experiments are shown in figure 4. For a small neural network with only 10 neurons per layer the different outputs of the multioutput neural network are indeed strongly correlated, which leads to poor uncertainty estimates (top left). The ensemble produces significantly better uncertainty estimates for the same network width (bottom left). However, as we increase the width of the neural network to 100 (top right) the uncertainty estimates of the network with multiple outputs improve and become close to those obtained by the larger ensemble of networks of the same width (bottom right).

## Appendix D Further information on the MinAtar experiment

Our MinAtar experiments used the same network structure as that used in (young2019minatar) and, apart from the optimized exploration hyperparameters and our use of the Adam optimizer described in the main text, also the same hyperparameters indicated in table 1. We searched among $\{{10}^{-4},2.5\times {10}^{-4}\}$ for the Adam learning rate and $\{{10}^{-8},0.01/32\}$ for Adam $\u03f5$, and among $\{0.1,0.03,0.01\}$ for final exploration $\u03f5$ using QR-DQN on Breakout. We found that whereas 0.01 and 0.03 lead to similar average scores, a value of 0.03 led to smaller variance in the results. For UA-DQN, we searched among $\{0.5,0.2,0.1\}$ for $\beta $ on Breakout.

To approximately sample from the posterior over $\bm{\theta}$ for the auxiliary networks used in UA-DQN, we use the approximate MAP sampling scheme of (pearce2018bayesian). For this scheme, we set the scale of the noise to a realistic value of 1, and the scale of the prior to the standard deviation of the network weights at initialization.

Hyperparameter | Value |

minibatch size | 32 |

replay buffer size | 100000 |

target network update frequency | 1000 |

discount factor | 0.99 |

number of step | 5000000 |

Adam learning rate | ${10}^{-4}$ |

Adam $\u03f5$ | ${10}^{-8}$ |

replay start size | 5000 |

update frequency | 1 |

initial $\u03f5$ (DQN, QR-DQN, Bootstrapped DQN) | 1 |

final $\u03f5$ (DQN, QR-DQN, Bootstrapped DQN) | 0.03 |

final exploration step (DQN, QR-DQN, Bootstrapped DQN) | 100000 |

Bootstrapped heads (Bootstrapped DQN) | 10 |

Number of quantiles (QR-DQN, UA-DQN) | 50 |

$\beta $ (UA-QDN) | 0.2 |

$\lambda $ (UA-QDN) | 0 |