Estimating Risk and Uncertainty in Deep Reinforcement Learning

  • 2020-09-09 15:44:27
  • William R. Clements, Bastien Van Delft, Benoît-Marie Robaglia, Reda Bahi Slaoui, Sébastien Toth
  • 0

Abstract

Reinforcement learning agents are faced with two types of uncertainty.Epistemic uncertainty stems from limited data and is useful for exploration,whereas aleatoric uncertainty arises from stochastic environments and must beaccounted for in risk-sensitive applications. We highlight the challengesinvolved in simultaneously estimating both of them, and propose a framework fordisentangling and estimating these uncertainties on learned Q-values. We deriveunbiased estimators of these uncertainties and introduce an uncertainty-awareDQN algorithm, which we show exhibits safe learning behavior and outperformsother DQN variants on the MinAtar testbed.

 

Quick Read (beta)

Estimating Risk and Uncertainty in Deep Reinforcement Learning

William R. Clements    Bastien Van Delft    Benoît-Marie Robaglia    Reda Bahi Slaoui    Sébastien Toth
Abstract

Reinforcement learning agents are faced with two types of uncertainty. Epistemic uncertainty stems from limited data and is useful for exploration, whereas aleatoric uncertainty arises from stochastic environments and must be accounted for in risk-sensitive applications. We highlight the challenges involved in simultaneously estimating both of them, and propose a framework for disentangling and estimating these uncertainties on learned Q-values. We derive unbiased estimators of these uncertainties and introduce an uncertainty-aware DQN algorithm, which we show exhibits safe learning behavior and outperforms other DQN variants on the MinAtar testbed.

Machine Learning, Reinforcement Learning, Uncertainty Estimation

Distinguishing between both epistemic uncertainty, which stems from limited data, and aleatoric uncertainty, caused by intrinsic stochasticity in the environment, is important in reinforcement learning for both exploration and risk-sensitivity (osband2016risk; moerland2017efficient; nikolov2019information). However, while prior work has developed independent methods for estimating both uncertainties, difficulties appear when trying to estimate both simultaneously. For example, distributional reinforcement learning (morimura2010nonparametric; morimura2012parametric; bellemare2017distributional), which aims to learn the distribution of returns instead of the mean value only, has been suggested as a way of measuring aleatoric uncertainty (nikolov2019information). However, although metrics such as the variance of the learned distribution can be good indicators of the aleatoric uncertainty for in-distribution data, it was highlighted for example in (chua2018deep) that for out-of-distribution data, when the epistemic uncertainty is high, such a metric is not a good indicator of aleatoric uncertainty as it conflates both uncertainties.

To address this issue, we construct a framework for disentangling both types of uncertainties which is applicable to both stochastic and deterministic environments. Our method builds on the distributional reinforcement learning framework, which aims to learn the entire return distribution instead of only its expected value (bellemare2017distributional), and methods for approximate Bayesian deep learning. Our main contributions are 1) a theoretical framework within which epistemic and aleatoric uncertainties can be separately estimated, 2) practical, unbiased estimators for both types of uncertainty, and 3) a demonstration that these uncertainties can successfully be used within an uncertainty-aware Deep Q Networks (mnih2015human) algorithm.

1 Background

We consider a discounted Markov Decision Process (MDP) defined by (χ,A,R,P,γ), in which χ and A represent the state and action spaces, R is the distribution of rewards associated with performing actions given the states, P is the transition probability, and γ is the discount factor.

Distributional reinforcement learning aims to learn the distribution of returns Zπ(s,a) associated with taking action a in state s and then following a policy π. To learn this distribution, (dabney2017distributional) propose a quantile parameterization. In this framework, a probability distribution Z(s,a) is parameterized by N quantiles τi=i/(N+1) for i[1,N], with values 𝒒=(q1,,qN). Learning the quantile values proceeds by minimizing the quantile regression loss (koenker2001quantile),

q(𝒒)=𝔼zZ(s,a)i=1Nρτi(z-qi(s,a)), (1)
whereρτi(u)=u×(τi-𝟙u<0)

This loss can be minimized stochastically for each new value z sampled from Z(s,a). For temporal difference learning of the optimal value function, Z(s,a) is replaced with the Bellman target R(s,a)+γZ(s,a), where aargmaxa𝔼[Z(s,a)] and sP(|s,a), yielding the QR-DQN algorithm of (dabney2017distributional).

An intuitive way of estimating the aleatoric uncertainty on the return distribution would be to use the variance of the quantiles. However, the variance of the quantiles is not a good estimator of the aleatoric uncertainty, because for out of distribution data the epistemic uncertainty on the value of the quantiles can also affect the variance.

2 Estimating both uncertainties

Here, we construct a theoretical framework which will allow us to disentangle epistemic and aleatoric uncertainties and derived unbiased estimators for both.

2.1 Theoretical framework

We start by framing learning the quantiles of the return distribution as a Bayesian inference problem. We consider state s, action a taken in state s, policy π, and data D consisting of K samples (z1,,zK) from Zπ(s,a). To learn the value of a given quantile τ of Zπ(s,a), we consider a neural network with parameters 𝜽, which returns a value y(𝜽,s,a). We interpret possible values of 𝜽 as different hypotheses about the function relating the state-action pair to the value of quantile τ of Zπ(s,a) (mackay2003information). Following (yu2001bayesian), we define a likelihood based on how well the output of the network matches the data using an asymmetric Laplace distribution,

P(D|𝜽)=j=1Kfτ(zj-y(𝜽,s,a)), (2)
wherefτ(u)=τ(1-τ)σDexp(-ρτ(u)σD)

where σD is a characteristic length scale and ρτ is the same as in equation 1.

To estimate the entire return distribution instead of a single quantile, we extend this formalism to a network with N outputs yi(𝜽,s,a), where each output i is trained to learn the value of quantile τi. We thus define the likelihood

P(D|𝜽)=j=1Ki=1Nfτi(zj-yi(𝜽,s,a)) (3)

Minimizing the loss in equation 1 is equivalent to maximizing the likelihood in equation 3. If we now consider a normal prior on parameters 𝜽 centered around 0, we can use any one of several methods for approximately sampling from the posterior distribution P(𝜽|D) (blundell2015weight; gal2016dropout; pearce2018bayesian).

2.2 Uncertainty estimates

Using the framework described above, we now propose expressions for both aleatoric and epistemic uncertainties.

2.2.1 Epistemic uncertainty

To obtain a single aggregate measure of the epistemic uncertainty on the return distribution, we propose taking the average of the epistemic uncertainty on the quantiles, defined by their variance over 𝜽,

σepistemic2=𝔼i𝒰{1,N}[var𝜽P(𝜽|D)(yi(𝜽,s,a))] (4)

where 𝒰{1,N} is the uniform distribution over {1,N}.

2.2.2 Aleatoric uncertainty

An intuitive measure of the aleatoric uncertainty is the variance of the quantile values. However, this variance is also affected by epistemic uncertainty in the form of the distribution over 𝜽. To decouple aleatoric uncertainty from epistemic uncertainty, we define the aleatoric uncertainty as the variance of the expected value of the quantiles according to the posterior distribution over 𝜽,

σaleatoric2=vari𝒰{1,N}[𝔼𝜽P(𝜽|D)yi(𝜽,s,a)] (5)

When the posterior is concentrated around a single value, we recover the intuitive definition of aleatoric uncertainty as the variance of the quantiles. However, when the posterior is not concentrated, the variance of a single set of quantiles is a biased estimator of σaleatoric2:

Proposition 2.1.

(Proof in the appendix) Consider 𝜽^ drawn from the posterior distribution over 𝜽. Then vari𝒰{1,N}[yi(𝜽^,s,a)] is a biased estimator of σaleatoric2.

2.2.3 Decomposition of Uncertainties

We require that the total uncertainty on the return distribution can be decomposed as the sum of these two uncertainties. We consider the total variance of the return distribution var𝜽P(𝜽|D),i𝒰{1,N}(yi(𝜽,s,a)), which for notational simplicity we write var𝜽,i(yi(𝜽,s,a)).

Proposition 2.2.

(Proof in the appendix) Considering the expressions for σepistemic and σaleatoric in equations 4 and 5,

var𝜽,i(yi(𝜽,s,a))=σepistemic2+σaleatoric2 (6)

We also consider two limit cases as sanity checks. First, in the absence of data, when all the uncertainty should be epistemic, we do find var𝜽,i(yi(𝜽,s,a))=σepistemic2. In the limit of infinite data, when all the uncertainty should be aleatoric, we also find var𝜽,i(yi(𝜽,s,a))=σaleatoric2.

2.3 Approximate Uncertainties Using Two Networks

Figure 1: Illustration of uncertainty estimates provided by σ~epistemic and σ~aleatoric on a toy dataset (black dots). Intervals represent ±σ for all uncertainties, and the estimated total uncertainty σ~total is defined as σ~total2=σ~aleatoric2+σ~epistemic2.

Estimating the variance and expectation over 𝜽 in the previous expressions for both uncertainties requires in principle a large number of samples of 𝜽, which is impractical. Instead, we propose the following approximations of σepistemic2 and σaleatoric2 using only two samples 𝜽A and 𝜽B from the posterior distribution over 𝜽,

σ~epistemic2=12𝔼i𝒰{1,N}[(yi(𝜽A,s,a)-yi(𝜽B,s,a))]2
σ~aleatoric2=covi𝒰{1,N}(yi(𝜽A,s,a),yi(𝜽B,s,a)) (7)
Proposition 2.3.

(Proof in the appendix) σ~epistemic and σ~aleatoric are unbiased estimators of σepistemic and σaleatoric. Moreover, assuming that the network outputs are uncorrelated, the variance of these estimators converges towards 0 as the number of quantiles increases.

In figure 1, we provide an illustration of the uncertainties measured with σ~epistemic and σ~aleatoric on a toy dataset. We consider a neural network that estimates 50 quantiles from the target distribution, and we draw two samples of 𝜽 using approximate MAP sampling (pearce2018bayesian). As expected, σ~epistemic is small close to the data but large far from it, while σ~aleatoric correctly captures the noise in the data.

2.4 Uncertainty-Aware Deep Q Networks

Algorithm 1 UA-DQN action selection
\[email protected]@algorithmic\STATE

Requires: Action set 𝒜, hyperparameters λ and β, value network 𝜽v, and two auxiliary networks 𝜽A and 𝜽B approximately sampled from the posterior distribution over 𝜽 with randomized MAP sampling (pearce2018bayesian). \FORa in 𝒜 \STATECalculate action mean μ=𝔼i[yi(𝜽v)] \STATECalculate uncertainties σ~epistemic2 and σ~aleatoric2 using networks 𝜽A and 𝜽B. \STATEAdjust for risk-aversion: μμ-λσ~aleatoric \STATEDraw a sample Q^a from 𝒩(μ,βσ~epistemic2) \ENDFOR\STATEOutput: argmaxa[Q^a]

Until now, we have been mainly concerned with learning the return distribution given an ensemble of samples of this distribution. For temporal difference learning, we replace these samples with the Bellman target. Although this implies measuring the “one-step” epistemic uncertainty on the bootstrapped target instead of that on the total return, this uncertainty is nonetheless useful for exploration as it allows for the identification of less-visited state-action pairs.

There are several ways these uncertainty estimates could be included into a reinforcement learning algorithm, for example to drive information-directed exploration (nikolov2019information). To better contrast the different roles played by both uncertainties, we propose a simple uncertainty-aware Deep Q Networks algorithm (UA-DQN), which is based on the QR-DQN algorithm of (dabney2017distributional) but includes the following modifications, presented in Algorithm 1:

Auxiliary networks for uncertainty estimation. To disentangle value learning and uncertainty estimation, we consider two auxiliary networks 𝜽A and 𝜽B both trained on the targets used in QR-DQN and approximately sampled from the posterior distribution over 𝜽. These networks are used to derive σ~epistemic and σ~aleatoric.

Uncertainty-Aware Action Selection. Instead of the ϵ-greedy policy used by QR-DQN, we use our uncertainty estimates to separately drive risk-awareness and exploration. We use σ~aleatoric to penalize high-variance actions, while σ~epistemic drives exploration using Thompson sampling.

3 Experiments

3.1 Safe Learning

Figure 2: Top: grid environment. The risky trajectory has the highest expected reward but involves going through windy tiles where the agent may fall off the cliff. Bottom: Cumulative falls for different agents during training. Shaded areas indicate the 95% confidence interval of the mean obtained from 30 training seeds.
Figure 3: Learning curves over 5 million steps for different agents on the MinAtar testbed. Shaded areas correspond to the 95% confidence interval of the mean obtained from 10 training seeds.

We first empirically study the behavior of our uncertainty-aware DQN on an environment inspired by the AI Safety Gridworlds (leike2017ai). We consider a simple 2×5 gridworld represented in figure 2, in which the agent must navigate to a goal without falling off a cliff. The agent receives -1 point at each timestep, +10 points for reaching the goal, and if the agent falls off the cliff the environment restarts. We introduce a stochastic wind in this environment, which with probability 5% knocks the agent off the cliff if the agent is on the ledge. The expectation value of the returns associated with the risky trajectory along the ledge is 4.8, while for the safe trajectory the return is deterministically 4.

We compare the learning behavior of our uncertainty-aware DQN agent to other comparable algorithms. We consider the ϵ-greedy QR-DQN algorithm of (dabney2017distributional), a risk-neutral version of UA-DQN (with λ=0), and two risk-averse variants of UA-DQN (with λ=0.5). Variant 1 uses the variance of the learned quantiles to estimate aleatoric uncertainty, while variant 2 uses our σ~aleatoric estimator. All algorithms differ only in action selection.

Experimental results are shown in figure 2. While all algorithms quickly learn to solve this simple task, there are marked differences in behavior. QR-DQN falls off the most due to both taking the risky trajectory and its ϵ-greedy policy. Risk-neutral UA-DQN only falls off the cliff due to its use of the risky trajectory. Both risk-averse variants of UA-DQN learn to use the safe trajectory. However, variant 1 overestimates aleatoric uncertainty due to its use of a biased estimator, takes longer to identify the safe trajectory, and thus accumulates more falls during learning than variant 2 that uses our unbiased estimate σ~aleatoric. Variant 2, which uses our unbiased estimators of both uncertainties, is the least likely to fall off the cliff during learning.

3.2 Evaluation on MinAtar

We now evaluate our UA-DQN algorithm on the MinAtar testbed (young2019minatar), which contains simplified implementations of 5 Atari games. Compared to the Arcade Learning Environment (bellemare2013arcade), MinAtar has similar underlying game dynamics but involves lower-dimensional observations. This helps to decouple representation learning from behavioral learning and allows to focus on the latter, and also encourages reproducibility as the reduced computational overhead allows for more thorough comparisons involving more training seeds.

We compare risk-neutral UA-DQN (λ=0) with DQN (mnih2015human), QR-DQN, and Bootstrapped DQN (osband2016deep). In contrast to UA-DQN which uses Thompson sampling to explore, Bootstrapped DQN uses an ensemble of bootstrapped DQN heads to achieve diverse behaviors. All algorithms are implemented within the same code base using the hyperparameters of (young2019minatar), except that we use the Adam optimizer (kingma2014adam) instead of RMSProp with learning rate 10-4 and ϵ=10-8. We also optimized the exploration hyperparameters: we use a final ϵ of 0.03 for the ϵ-greedy policies of DQN, QR-DQN, and Bootstrapped DQN, and for UA-DQN we use β=0.2. We selected these values using the game Breakout, which were then fixed for all the other games.

Our results are shown in figure 3. As reported in (dabney2017distributional), we find that QR-DQN outperforms the other DQN variants, and that UA-DQN in turn significantly outperforms QR-DQN. To understand why this is, we inspect the behavior of UA-DQN and find that even at the end of training roughly 10-20% of the actions selected by UA-DQN are non-greedy. As UA-DQN only differs from QR-DQN in action selection, and QR-DQN’s performance decreases with higher levels of ϵ-greedy exploration, this result indicates that UA-DQN successfully uses σ~epistemic to appropriately decide when best to perform exploratory or greedy actions.

4 Conclusion

Estimating both uncertainties is important for developing agents that can both explore efficiently and account for risk in their actions. We propose a scheme whereby both types of uncertainty on the expected return of a policy can be estimated in deep reinforcement learning. We show that unbiased estimators for these uncertainties can be obtained using only two networks, and that these estimators can be efficiently harnessed by an uncertainty-aware DQN algorithm for improved risk-sensitivity and exploration. We find that this UA-DQN algorithm significantly outperforms other DQN variants on the MinAtar testbed.

References

Appendix A Related Work

Our work focuses on the problem of estimating the uncertainty of the expected return of a policy in model-free reinforcement learning. The epistemic uncertainty on the expected return has been shown to be useful for exploration (osband2016deep; azizzadenesheli2018efficient; touati2018randomized). On the other hand, the aleatoric uncertainty of the expected return is useful for designing risk-averse policies (howard1972risk; tamar2016learning; dabney2018implicit). Whereas most prior work considers both uncertainties separately, (tang2018exploration; moerland2017efficient) are interested in both, but their methods yield only an aggregate uncertainty. (nikolov2019information) do make use of both uncertainties to drive information-directed sampling (russo2014learning), but their uncertainty estimates derive from two different frameworks. Moreover, we argue that the variance of the learned return distribution, which (nikolov2019information) use for aleatoric uncertainty estimation, conflates both uncertainties for out of distribution data. Our work aims to provide a single framework for simultaneously estimating both uncertainties for the return distribution.

Estimating both types of uncertainty is also important in model-based reinforcement learning, where uncertainties affect the predictions of a learned dynamics model of the environment. Uncertainty estimates can be used in planning, either for better exploration (schmidhuber1991possibility; sun2011planning) or to avoid risky or unknown sections of the environment (garcia2015comprehensive). Model based algorithms that explicitly account for both aleatoric and epistemic uncertainties have recently also been developed (depeweg2018decomposition; chua2018deep; henaff2019model).

An approach that combines model free and model based techniques consists of using the uncertainties derived from a learned dynamics model to inform the policy of a model-free agent. The epistemic uncertainty associated with the learned model can for example be used as an intrinsic motivation bonus (stadie2015incentivizing; pathak2017curiosity; burda2018exploration). However, uncertainties on the transition model do not typically convey information about the uncertainty of the expected return of a policy, which is a quantity of fundamental interest in reinforcement learning.

Appendix B Proofs

In the following, for notational simplicity we will omit the dependence of yi(𝜽,s,a) on s and a. Moreover, subscripts used in variances/expectation values should be interpreted as the variance/expectation value taken over the distribution of the variables in the subscript, so that for example 𝔼𝜽=𝔼𝜽P(𝜽|D) and 𝔼i=𝔼i𝒰{1,N}. We will also assume that the following integrals over P(𝜽|D) are well defined, which, considering in particular the Gaussian prior over the weights, is a reasonable assumption.

B.1 Proof of proposition 2.1

Here, we show that, considering a sample 𝜽^ drawn from the posterior distribution over 𝜽, vari[yi(𝜽^)] is a biased estimator of σaleatoric2. We do so by showing that 𝔼𝜽[vari[yi(𝜽)]] is greater than σaleatoric2.

𝔼𝜽[vari[yi(𝜽,s,a)]] =𝔼𝜽[1Nj=1N(yj(𝜽)-𝔼i[yi(𝜽)])2]
=1Nj=1N𝔼𝜽[(yj(𝜽)-𝔼i[yi(𝜽)])2]

By definition of the variance, we also have

var𝜽[yj(𝜽)-𝔼i(yi(𝜽)]=𝔼𝜽[(yj(𝜽)-𝔼i[yi(𝜽)])2]-(𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])2

Therefore, when the posterior over 𝜽 is not concentrated and var𝜽[yj(𝜽)-𝔼i(yi(𝜽)]>0,

𝔼𝜽[vari[yi(𝜽,s,a)]] >1Nj=1N(𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])2
>vari[𝔼𝜽[yi(𝜽)]]
>σaleatoric2

B.2 Proof of proposition 2.2

Here, we show that var𝜽,i(yi(𝜽,s,a))=σepistemic2+σaleatoric2.

var𝜽,i(yi(𝜽)) =𝜽1Nj=1N(yj(𝜽)-𝔼𝜽,i[yi(𝜽)])2P(𝜽|D)d𝜽
=𝜽1Nj=1N(yj(𝜽)-𝔼𝜽[yj(𝜽)]+𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])2P(𝜽|D)d𝜽
=𝜽1Nj=1N((yj(𝜽)-𝔼𝜽[yj(𝜽)])2
  +(𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])2
  +2(𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])(yj(𝜽)-𝔼𝜽[yj(𝜽)]))P(𝜽|D)d𝜽

The integral over 𝜽 of the last line is 0, which leaves us with

var𝜽,i(yi(𝜽)) =𝜽1Nj=1N(yj(𝜽)-𝔼𝜽[yj(𝜽)])2P(𝜽|D)d𝜽+𝜽1Nj=1N(𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])2P(𝜽|D)d𝜽
=1Nj=1N𝜽(yj(𝜽)-𝔼𝜽[yj(𝜽)])2P(𝜽|D)𝑑𝜽+1Nj=1N(𝔼𝜽[yj(𝜽)]-𝔼𝜽,i[yi(𝜽)])2
=𝔼i(var𝜽(yi(𝜽)))+vari(𝔼𝜽yi(𝜽))
=σepistemic2+σaleatoric2

B.3 Proof of proposition 2.3

B.4 Expectation of the estimators

Here, we show that σ~epistemic and σ~aleatoric are unbiased estimators of σepistemic and σaleatoric. In the following, 𝔼𝜽𝑨,𝜽𝑩 indicates the expectation value when 𝜽𝑨 and 𝜽𝑩 are drawn from the posterior distribution over 𝜽. Moreover, in what follows it can easily be verified that expectations over 𝜽 and over i are interchangeable due to the discrete nature of the expectation over i.

𝔼𝜽𝑨,𝜽𝑩[σ~epistemic2] =12𝔼𝜽𝑨,𝜽𝑩𝔼i[(yi(𝜽A)-yi(𝜽B))2]
=12𝔼𝜽𝑨,𝜽𝑩𝔼i[(yi(𝜽A)-𝔼𝜽(yi(𝜽))+𝔼𝜽(yi(𝜽))-yi(𝜽B))2]
=12𝔼𝜽𝑨,𝜽𝑩[𝔼i[(yi(𝜽A)-𝔼𝜽(yi(𝜽)))2]+𝔼i[(𝔼𝜽(yi(𝜽))-yi(𝜽B))2]
  +2𝔼i[((𝔼𝜽(yi(𝜽))-yi(𝜽B))(yi(𝜽A)-𝔼𝜽(yi(𝜽)))]]

The average over either 𝜽𝑨 or 𝜽𝑩 of the last line is zero, which, after noticing that 𝜽𝑨 and 𝜽𝑩 are now separable such that we can use the equality 𝔼𝜽𝑨[yi(𝜽A)]=𝔼𝜽𝑩[yi(𝜽B)]=𝔼𝜽[yi(𝜽)], leaves us with

𝔼𝜽𝑨,𝜽𝑩[σ~epistemic2] =12(𝔼𝜽[𝔼i(yi(𝜽)-𝔼𝜽(yi(𝜽)))2+𝔼i(𝔼𝜽(yi(𝜽))-yi(𝜽))2])
=𝔼𝜽[𝔼i(yi(𝜽)-𝔼𝜽(yi(𝜽)))2]
=𝔼i[𝔼𝜽(yi(𝜽)-𝔼𝜽(yi(𝜽)))2]
=𝔼i[var𝜽(yi(𝜽))]
=σepistemic2

so σ~epistemic is indeed an unbiased estimator of σepistemic.

Similarly, for σ~aleatoric, and introducing ϵi(𝜽A)=yi(𝜽A)-𝔼𝜽(yi(𝜽)) and ϵi(𝜽B)=yi(𝜽B)-𝔼𝜽(yi(𝜽)),

𝔼𝜽𝑨,𝜽𝑩[σ~aleatoric2] =𝔼𝜽𝑨,𝜽𝑩covi(yi(𝜽A),yi(𝜽B))
=𝔼𝜽𝑨,𝜽𝑩covi(ϵi(𝜽A)+𝔼𝜽(yi(𝜽)),ϵi(𝜽B)+𝔼𝜽(yi(𝜽)))
=𝔼𝜽𝑨,𝜽𝑩[covi(𝔼𝜽(yi(𝜽)),𝔼𝜽(yi(𝜽)))+covi(ϵi(𝜽A),𝔼𝜽(yi(𝜽)))
  +covi(𝔼𝜽(yi(𝜽)),ϵi(𝜽B))+covi(ϵi(𝜽A),ϵi(𝜽B))]

Looking at these terms individually, we have

𝔼𝜽𝑨,𝜽𝑩[covi(𝔼𝜽(yi(𝜽)),𝔼𝜽(yi(𝜽)))] =vari(𝔼𝜽(yi(𝜽)))
=σaleatoric2
E𝜽𝑨,𝜽𝑩[covi(ϵi(𝜽A),𝔼𝜽(yi(𝜽)))] =E𝜽𝑨[covi(ϵi(𝜽A),𝔼𝜽(yi(𝜽)))]
=E𝜽𝑨[1Nj=1N(ϵj(𝜽A)-𝔼i(ϵi(𝜽A)))(𝔼𝜽(yj(𝜽))-𝔼i𝔼𝜽(yi(𝜽)))]
=1Nj=1N(𝔼𝜽(yj(𝜽))-𝔼i𝔼𝜽(yi(𝜽)))(𝔼𝜽𝑨(ϵj(𝜽A))-𝔼i(𝔼𝜽𝑨(ϵi(𝜽A))))
=0  since𝔼𝜽𝑨(ϵi(𝜽𝑨))=0for all i
E𝜽𝑨,𝜽𝑩[covi(𝔼𝜽(yi(𝜽))),ϵi(𝜽B)] =0[Same derivation as previous expression]
E𝜽𝑨,𝜽𝑩[covi(ϵi(𝜽A),ϵi(𝜽B))] =E𝜽𝑨,𝜽𝑩[1Nj=1N(ϵj(𝜽A)-𝔼i(ϵi(𝜽A)))(ϵj(𝜽B)-𝔼i(ϵi(𝜽B)))]
=E𝜽𝑨[1Nj=1N(ϵj(𝜽A)-𝔼i(ϵi(𝜽A)))(E𝜽𝑩(ϵj(𝜽B))-Ei(𝔼𝜽𝑩(ϵi(𝜽B))))]
=0

As desired, we end up with

𝔼𝜽𝑨,𝜽𝑩[σ~aleatoric2] =σaleatoric2

B.5 Variance of the estimators

Using the same notation as in the previous section, we can write

var𝜽𝑨,𝜽𝑩[σ~epistemic2] =14var𝜽𝑨,𝜽𝑩𝔼i[(yi(𝜽A)-yi(𝜽B))2]
=14var𝜽𝑨,𝜽𝑩[1Ni=1N(yi(𝜽A)-yi(𝜽B))2]
=14N2var𝜽𝑨,𝜽𝑩[i=1N(yi(𝜽A)-yi(𝜽B))2]

We now require our assumption that all outputs of the neural networks are decorrelated to write

var𝜽𝑨,𝜽𝑩[σ~epistemic2] =14N2i=1Nvar𝜽𝑨,𝜽𝑩[(yi(𝜽A)-yi(𝜽B))2]
=14N2i=1Nvar𝜽𝑨,𝜽𝑩[yi(𝜽A)2+yi(𝜽B)2+2yi(𝜽A)yi(𝜽B)]
34N2i=1N[var𝜽𝑨,𝜽𝑩[yi(𝜽A)2]+var𝜽𝑨,𝜽𝑩[yi(𝜽B)2]+4var𝜽𝑨,𝜽𝑩[yi(𝜽A)yi(𝜽B)]]
  [Where we used the Cauchy-Schwartz inequality]
34N2i=1N[2var𝜽[yi(𝜽)2]+8(𝔼𝜽yi(𝜽))2var𝜽[yi(𝜽)]+2(var𝜽[yi(𝜽)])2]

We now further assume that 𝔼𝜽[yi(𝜽)], var𝜽[yi(𝜽)], and var𝜽[yi2(𝜽)] are bounded for all i and N. Then, there is a constant C such that, for all i and N,

2var𝜽[yi(𝜽)2]+8(𝔼𝜽[yi(𝜽)])2var𝜽[yi(𝜽)]+2(var𝜽[yi(𝜽)])2C

We then obtain

var𝜽𝑨,𝜽𝑩[σ~epistemic2] 34N2i=1NC
C4N

The variance of σ~epistemic2 (and thus that of σ~epistemic) therefore decreases towards 0 as the number of quantiles increases.

As for the aleatoric uncertainty, a similar bound can be derived by rewriting var𝜽𝑨,𝜽𝑩[σ~aleatoric2] as follows.

var𝜽𝑨,𝜽𝑩[σ~aleatoric2] =var𝜽𝑨,𝜽𝑩[covi(yi(𝜽A),yi(𝜽B))]
=var𝜽𝑨,𝜽𝑩[1Nj=1N(yj(𝜽A)-𝔼iyi(𝜽A))(yj(𝜽B)-𝔼iyi(𝜽B))]
=1N2var𝜽𝑨,𝜽𝑩[j=1N(yj(𝜽A)-𝔼iyi(𝜽A))(yj(𝜽B)-𝔼iyi(𝜽B))]

In a manner similar as for the derivation of the variance of σ~epistemic2, assuming that the network outputs are uncorrelated and that the first moments of yi(𝜽) are bounded, we can also derive a bound for var𝜽𝑨,𝜽𝑩[σ~aleatoric2] that converges to 0 with increasing N.

Appendix C Correlations between the outputs of a Bayesian neural network

Proposition 2.3 makes the assumption that the network outputs are uncorrelated. Indeed, correlations between outputs could cause for example a network to overestimate all the quantiles. If both networks A and B produce overestimations, then σ~epistemic would probably underestimate σepistemic. However, in the limit of infinite width Bayesian neural networks are uncorrelated for normal priors and separable likelihoods (neal2012bayesian). In the following, we experimentally explore in which cases this applies to finite width neural networks and to approximate Bayesian techniques such as the randomized MAP sampling technique (pearce2018bayesian) used in our work.

C.1 Uncertainties for different network widths

Figure 4: Comparison of epistemic uncertainties obtained with the approximate MAP sampling method of (pearce2018bayesian). Top: uncertainties obtained by a single neural network with twenty outputs. Bottom: uncertainties obtained by an ensemble of 20 networks. Left: 10 neurons per hidden layer. Right: 100 neurons per hidden layer. The two colors of shading represent one and two standard deviations from the mean.

First, we compare the epistemic uncertainties produced by an ensemble of neural networks produced by the ”anchoring” approximate MAP sampling technique of (pearce2018bayesian) to that produced by a single neural network (also produced with approximate MAP sampling) with several outputs on a toy regression problem. Both the problem formulation and the code for this experiment draw from the work of (pearce2018bayesian).

Representative samples from these experiments are shown in figure 4. For a small neural network with only 10 neurons per layer the different outputs of the multioutput neural network are indeed strongly correlated, which leads to poor uncertainty estimates (top left). The ensemble produces significantly better uncertainty estimates for the same network width (bottom left). However, as we increase the width of the neural network to 100 (top right) the uncertainty estimates of the network with multiple outputs improve and become close to those obtained by the larger ensemble of networks of the same width (bottom right).

Appendix D Further information on the MinAtar experiment

Our MinAtar experiments used the same network structure as that used in (young2019minatar) and, apart from the optimized exploration hyperparameters and our use of the Adam optimizer described in the main text, also the same hyperparameters indicated in table 1. We searched among {10-4,2.5×10-4} for the Adam learning rate and {10-8,0.01/32} for Adam ϵ, and among {0.1,0.03,0.01} for final exploration ϵ using QR-DQN on Breakout. We found that whereas 0.01 and 0.03 lead to similar average scores, a value of 0.03 led to smaller variance in the results. For UA-DQN, we searched among {0.5,0.2,0.1} for β on Breakout.

To approximately sample from the posterior over 𝜽 for the auxiliary networks used in UA-DQN, we use the approximate MAP sampling scheme of (pearce2018bayesian). For this scheme, we set the scale of the noise to a realistic value of 1, and the scale of the prior to the standard deviation of the network weights at initialization.

Hyperparameter Value
minibatch size 32
replay buffer size 100000
target network update frequency 1000
discount factor 0.99
number of step 5000000
Adam learning rate 10-4
Adam ϵ 10-8
replay start size 5000
update frequency 1
initial ϵ (DQN, QR-DQN, Bootstrapped DQN) 1
final ϵ (DQN, QR-DQN, Bootstrapped DQN) 0.03
final exploration step (DQN, QR-DQN, Bootstrapped DQN) 100000
Bootstrapped heads (Bootstrapped DQN) 10
Number of quantiles (QR-DQN, UA-DQN) 50
β (UA-QDN) 0.2
λ (UA-QDN) 0
Table 1: Hyperparameters used for our MinAtar experiments