Reinforcement learning agents are faced with two types of uncertainty.Epistemic uncertainty stems from limited data and is useful for exploration,whereas aleatoric uncertainty arises from stochastic environments and must beaccounted for in risk-sensitive applications. We highlight the challengesinvolved in simultaneously estimating both of them, and propose a framework fordisentangling and estimating these uncertainties on learned Q-values. We deriveunbiased estimators of these uncertainties and introduce an uncertainty-awareDQN algorithm, which we show exhibits safe learning behavior and outperformsother DQN variants on the MinAtar testbed.
Quick Read (beta)
Estimating Risk and Uncertainty in Deep Reinforcement Learning
Reinforcement learning agents are faced with two types of uncertainty. Epistemic uncertainty stems from limited data and is useful for exploration, whereas aleatoric uncertainty arises from stochastic environments and must be accounted for in risk-sensitive applications. We highlight the challenges involved in simultaneously estimating both of them, and propose a framework for disentangling and estimating these uncertainties on learned Q-values. We derive unbiased estimators of these uncertainties and introduce an uncertainty-aware DQN algorithm, which we show exhibits safe learning behavior and outperforms other DQN variants on the MinAtar testbed.
Distinguishing between both epistemic uncertainty, which stems from limited data, and aleatoric uncertainty, caused by intrinsic stochasticity in the environment, is important in reinforcement learning for both exploration and risk-sensitivity (osband2016risk; moerland2017efficient; nikolov2019information). However, while prior work has developed independent methods for estimating both uncertainties, difficulties appear when trying to estimate both simultaneously. For example, distributional reinforcement learning (morimura2010nonparametric; morimura2012parametric; bellemare2017distributional), which aims to learn the distribution of returns instead of the mean value only, has been suggested as a way of measuring aleatoric uncertainty (nikolov2019information). However, although metrics such as the variance of the learned distribution can be good indicators of the aleatoric uncertainty for in-distribution data, it was highlighted for example in (chua2018deep) that for out-of-distribution data, when the epistemic uncertainty is high, such a metric is not a good indicator of aleatoric uncertainty as it conflates both uncertainties.
To address this issue, we construct a framework for disentangling both types of uncertainties which is applicable to both stochastic and deterministic environments. Our method builds on the distributional reinforcement learning framework, which aims to learn the entire return distribution instead of only its expected value (bellemare2017distributional), and methods for approximate Bayesian deep learning. Our main contributions are 1) a theoretical framework within which epistemic and aleatoric uncertainties can be separately estimated, 2) practical, unbiased estimators for both types of uncertainty, and 3) a demonstration that these uncertainties can successfully be used within an uncertainty-aware Deep Q Networks (mnih2015human) algorithm.
We consider a discounted Markov Decision Process (MDP) defined by , in which and represent the state and action spaces, is the distribution of rewards associated with performing actions given the states, is the transition probability, and is the discount factor.
Distributional reinforcement learning aims to learn the distribution of returns associated with taking action in state and then following a policy . To learn this distribution, (dabney2017distributional) propose a quantile parameterization. In this framework, a probability distribution is parameterized by quantiles for , with values . Learning the quantile values proceeds by minimizing the quantile regression loss (koenker2001quantile),
This loss can be minimized stochastically for each new value sampled from . For temporal difference learning of the optimal value function, is replaced with the Bellman target , where and , yielding the QR-DQN algorithm of (dabney2017distributional).
An intuitive way of estimating the aleatoric uncertainty on the return distribution would be to use the variance of the quantiles. However, the variance of the quantiles is not a good estimator of the aleatoric uncertainty, because for out of distribution data the epistemic uncertainty on the value of the quantiles can also affect the variance.
2 Estimating both uncertainties
Here, we construct a theoretical framework which will allow us to disentangle epistemic and aleatoric uncertainties and derived unbiased estimators for both.
2.1 Theoretical framework
We start by framing learning the quantiles of the return distribution as a Bayesian inference problem. We consider state , action taken in state , policy , and data consisting of samples from . To learn the value of a given quantile of , we consider a neural network with parameters , which returns a value . We interpret possible values of as different hypotheses about the function relating the state-action pair to the value of quantile of (mackay2003information). Following (yu2001bayesian), we define a likelihood based on how well the output of the network matches the data using an asymmetric Laplace distribution,
where is a characteristic length scale and is the same as in equation 1.
To estimate the entire return distribution instead of a single quantile, we extend this formalism to a network with outputs , where each output is trained to learn the value of quantile . We thus define the likelihood
Minimizing the loss in equation 1 is equivalent to maximizing the likelihood in equation 3. If we now consider a normal prior on parameters centered around , we can use any one of several methods for approximately sampling from the posterior distribution (blundell2015weight; gal2016dropout; pearce2018bayesian).
2.2 Uncertainty estimates
Using the framework described above, we now propose expressions for both aleatoric and epistemic uncertainties.
2.2.1 Epistemic uncertainty
To obtain a single aggregate measure of the epistemic uncertainty on the return distribution, we propose taking the average of the epistemic uncertainty on the quantiles, defined by their variance over ,
where is the uniform distribution over .
2.2.2 Aleatoric uncertainty
An intuitive measure of the aleatoric uncertainty is the variance of the quantile values. However, this variance is also affected by epistemic uncertainty in the form of the distribution over . To decouple aleatoric uncertainty from epistemic uncertainty, we define the aleatoric uncertainty as the variance of the expected value of the quantiles according to the posterior distribution over ,
When the posterior is concentrated around a single value, we recover the intuitive definition of aleatoric uncertainty as the variance of the quantiles. However, when the posterior is not concentrated, the variance of a single set of quantiles is a biased estimator of :
(Proof in the appendix) Consider drawn from the posterior distribution over . Then is a biased estimator of .
2.2.3 Decomposition of Uncertainties
We require that the total uncertainty on the return distribution can be decomposed as the sum of these two uncertainties. We consider the total variance of the return distribution , which for notational simplicity we write .
We also consider two limit cases as sanity checks. First, in the absence of data, when all the uncertainty should be epistemic, we do find . In the limit of infinite data, when all the uncertainty should be aleatoric, we also find .
2.3 Approximate Uncertainties Using Two Networks
Estimating the variance and expectation over in the previous expressions for both uncertainties requires in principle a large number of samples of , which is impractical. Instead, we propose the following approximations of and using only two samples and from the posterior distribution over ,
(Proof in the appendix) and are unbiased estimators of and . Moreover, assuming that the network outputs are uncorrelated, the variance of these estimators converges towards 0 as the number of quantiles increases.
In figure 1, we provide an illustration of the uncertainties measured with and on a toy dataset. We consider a neural network that estimates 50 quantiles from the target distribution, and we draw two samples of using approximate MAP sampling (pearce2018bayesian). As expected, is small close to the data but large far from it, while correctly captures the noise in the data.
2.4 Uncertainty-Aware Deep Q Networks
Until now, we have been mainly concerned with learning the return distribution given an ensemble of samples of this distribution. For temporal difference learning, we replace these samples with the Bellman target. Although this implies measuring the “one-step” epistemic uncertainty on the bootstrapped target instead of that on the total return, this uncertainty is nonetheless useful for exploration as it allows for the identification of less-visited state-action pairs.
There are several ways these uncertainty estimates could be included into a reinforcement learning algorithm, for example to drive information-directed exploration (nikolov2019information). To better contrast the different roles played by both uncertainties, we propose a simple uncertainty-aware Deep Q Networks algorithm (UA-DQN), which is based on the QR-DQN algorithm of (dabney2017distributional) but includes the following modifications, presented in Algorithm 1:
Auxiliary networks for uncertainty estimation. To disentangle value learning and uncertainty estimation, we consider two auxiliary networks and both trained on the targets used in QR-DQN and approximately sampled from the posterior distribution over . These networks are used to derive and .
Uncertainty-Aware Action Selection. Instead of the -greedy policy used by QR-DQN, we use our uncertainty estimates to separately drive risk-awareness and exploration. We use to penalize high-variance actions, while drives exploration using Thompson sampling.
3.1 Safe Learning
We first empirically study the behavior of our uncertainty-aware DQN on an environment inspired by the AI Safety Gridworlds (leike2017ai). We consider a simple gridworld represented in figure 2, in which the agent must navigate to a goal without falling off a cliff. The agent receives -1 point at each timestep, +10 points for reaching the goal, and if the agent falls off the cliff the environment restarts. We introduce a stochastic wind in this environment, which with probability knocks the agent off the cliff if the agent is on the ledge. The expectation value of the returns associated with the risky trajectory along the ledge is 4.8, while for the safe trajectory the return is deterministically 4.
We compare the learning behavior of our uncertainty-aware DQN agent to other comparable algorithms. We consider the -greedy QR-DQN algorithm of (dabney2017distributional), a risk-neutral version of UA-DQN (with ), and two risk-averse variants of UA-DQN (with ). Variant 1 uses the variance of the learned quantiles to estimate aleatoric uncertainty, while variant 2 uses our estimator. All algorithms differ only in action selection.
Experimental results are shown in figure 2. While all algorithms quickly learn to solve this simple task, there are marked differences in behavior. QR-DQN falls off the most due to both taking the risky trajectory and its -greedy policy. Risk-neutral UA-DQN only falls off the cliff due to its use of the risky trajectory. Both risk-averse variants of UA-DQN learn to use the safe trajectory. However, variant 1 overestimates aleatoric uncertainty due to its use of a biased estimator, takes longer to identify the safe trajectory, and thus accumulates more falls during learning than variant 2 that uses our unbiased estimate . Variant 2, which uses our unbiased estimators of both uncertainties, is the least likely to fall off the cliff during learning.
3.2 Evaluation on MinAtar
We now evaluate our UA-DQN algorithm on the MinAtar testbed (young2019minatar), which contains simplified implementations of 5 Atari games. Compared to the Arcade Learning Environment (bellemare2013arcade), MinAtar has similar underlying game dynamics but involves lower-dimensional observations. This helps to decouple representation learning from behavioral learning and allows to focus on the latter, and also encourages reproducibility as the reduced computational overhead allows for more thorough comparisons involving more training seeds.
We compare risk-neutral UA-DQN () with DQN (mnih2015human), QR-DQN, and Bootstrapped DQN (osband2016deep). In contrast to UA-DQN which uses Thompson sampling to explore, Bootstrapped DQN uses an ensemble of bootstrapped DQN heads to achieve diverse behaviors. All algorithms are implemented within the same code base using the hyperparameters of (young2019minatar), except that we use the Adam optimizer (kingma2014adam) instead of RMSProp with learning rate and . We also optimized the exploration hyperparameters: we use a final of 0.03 for the -greedy policies of DQN, QR-DQN, and Bootstrapped DQN, and for UA-DQN we use . We selected these values using the game Breakout, which were then fixed for all the other games.
Our results are shown in figure 3. As reported in (dabney2017distributional), we find that QR-DQN outperforms the other DQN variants, and that UA-DQN in turn significantly outperforms QR-DQN. To understand why this is, we inspect the behavior of UA-DQN and find that even at the end of training roughly 10-20% of the actions selected by UA-DQN are non-greedy. As UA-DQN only differs from QR-DQN in action selection, and QR-DQN’s performance decreases with higher levels of -greedy exploration, this result indicates that UA-DQN successfully uses to appropriately decide when best to perform exploratory or greedy actions.
Estimating both uncertainties is important for developing agents that can both explore efficiently and account for risk in their actions. We propose a scheme whereby both types of uncertainty on the expected return of a policy can be estimated in deep reinforcement learning. We show that unbiased estimators for these uncertainties can be obtained using only two networks, and that these estimators can be efficiently harnessed by an uncertainty-aware DQN algorithm for improved risk-sensitivity and exploration. We find that this UA-DQN algorithm significantly outperforms other DQN variants on the MinAtar testbed.
Appendix A Related Work
Our work focuses on the problem of estimating the uncertainty of the expected return of a policy in model-free reinforcement learning. The epistemic uncertainty on the expected return has been shown to be useful for exploration (osband2016deep; azizzadenesheli2018efficient; touati2018randomized). On the other hand, the aleatoric uncertainty of the expected return is useful for designing risk-averse policies (howard1972risk; tamar2016learning; dabney2018implicit). Whereas most prior work considers both uncertainties separately, (tang2018exploration; moerland2017efficient) are interested in both, but their methods yield only an aggregate uncertainty. (nikolov2019information) do make use of both uncertainties to drive information-directed sampling (russo2014learning), but their uncertainty estimates derive from two different frameworks. Moreover, we argue that the variance of the learned return distribution, which (nikolov2019information) use for aleatoric uncertainty estimation, conflates both uncertainties for out of distribution data. Our work aims to provide a single framework for simultaneously estimating both uncertainties for the return distribution.
Estimating both types of uncertainty is also important in model-based reinforcement learning, where uncertainties affect the predictions of a learned dynamics model of the environment. Uncertainty estimates can be used in planning, either for better exploration (schmidhuber1991possibility; sun2011planning) or to avoid risky or unknown sections of the environment (garcia2015comprehensive). Model based algorithms that explicitly account for both aleatoric and epistemic uncertainties have recently also been developed (depeweg2018decomposition; chua2018deep; henaff2019model).
An approach that combines model free and model based techniques consists of using the uncertainties derived from a learned dynamics model to inform the policy of a model-free agent. The epistemic uncertainty associated with the learned model can for example be used as an intrinsic motivation bonus (stadie2015incentivizing; pathak2017curiosity; burda2018exploration). However, uncertainties on the transition model do not typically convey information about the uncertainty of the expected return of a policy, which is a quantity of fundamental interest in reinforcement learning.
Appendix B Proofs
In the following, for notational simplicity we will omit the dependence of on and . Moreover, subscripts used in variances/expectation values should be interpreted as the variance/expectation value taken over the distribution of the variables in the subscript, so that for example and . We will also assume that the following integrals over are well defined, which, considering in particular the Gaussian prior over the weights, is a reasonable assumption.
B.1 Proof of proposition 2.1
Here, we show that, considering a sample drawn from the posterior distribution over , is a biased estimator of . We do so by showing that is greater than .
By definition of the variance, we also have
Therefore, when the posterior over is not concentrated and ,
B.2 Proof of proposition 2.2
Here, we show that .
The integral over of the last line is 0, which leaves us with
B.3 Proof of proposition 2.3
B.4 Expectation of the estimators
Here, we show that and are unbiased estimators of and . In the following, indicates the expectation value when and are drawn from the posterior distribution over . Moreover, in what follows it can easily be verified that expectations over and over are interchangeable due to the discrete nature of the expectation over .
The average over either or of the last line is zero, which, after noticing that and are now separable such that we can use the equality , leaves us with
so is indeed an unbiased estimator of .
Similarly, for , and introducing and ,
Looking at these terms individually, we have
As desired, we end up with
B.5 Variance of the estimators
Using the same notation as in the previous section, we can write
We now require our assumption that all outputs of the neural networks are decorrelated to write
We now further assume that , , and are bounded for all and . Then, there is a constant such that, for all and ,
We then obtain
The variance of (and thus that of ) therefore decreases towards 0 as the number of quantiles increases.
As for the aleatoric uncertainty, a similar bound can be derived by rewriting as follows.
In a manner similar as for the derivation of the variance of , assuming that the network outputs are uncorrelated and that the first moments of are bounded, we can also derive a bound for that converges to 0 with increasing .
Appendix C Correlations between the outputs of a Bayesian neural network
Proposition 2.3 makes the assumption that the network outputs are uncorrelated. Indeed, correlations between outputs could cause for example a network to overestimate all the quantiles. If both networks A and B produce overestimations, then would probably underestimate . However, in the limit of infinite width Bayesian neural networks are uncorrelated for normal priors and separable likelihoods (neal2012bayesian). In the following, we experimentally explore in which cases this applies to finite width neural networks and to approximate Bayesian techniques such as the randomized MAP sampling technique (pearce2018bayesian) used in our work.
C.1 Uncertainties for different network widths
First, we compare the epistemic uncertainties produced by an ensemble of neural networks produced by the ”anchoring” approximate MAP sampling technique of (pearce2018bayesian) to that produced by a single neural network (also produced with approximate MAP sampling) with several outputs on a toy regression problem. Both the problem formulation and the code for this experiment draw from the work of (pearce2018bayesian).
Representative samples from these experiments are shown in figure 4. For a small neural network with only 10 neurons per layer the different outputs of the multioutput neural network are indeed strongly correlated, which leads to poor uncertainty estimates (top left). The ensemble produces significantly better uncertainty estimates for the same network width (bottom left). However, as we increase the width of the neural network to 100 (top right) the uncertainty estimates of the network with multiple outputs improve and become close to those obtained by the larger ensemble of networks of the same width (bottom right).
Appendix D Further information on the MinAtar experiment
Our MinAtar experiments used the same network structure as that used in (young2019minatar) and, apart from the optimized exploration hyperparameters and our use of the Adam optimizer described in the main text, also the same hyperparameters indicated in table 1. We searched among for the Adam learning rate and for Adam , and among for final exploration using QR-DQN on Breakout. We found that whereas 0.01 and 0.03 lead to similar average scores, a value of 0.03 led to smaller variance in the results. For UA-DQN, we searched among for on Breakout.
To approximately sample from the posterior over for the auxiliary networks used in UA-DQN, we use the approximate MAP sampling scheme of (pearce2018bayesian). For this scheme, we set the scale of the noise to a realistic value of 1, and the scale of the prior to the standard deviation of the network weights at initialization.
|replay buffer size||100000|
|target network update frequency||1000|
|number of step||5000000|
|Adam learning rate|
|replay start size||5000|
|initial (DQN, QR-DQN, Bootstrapped DQN)||1|
|final (DQN, QR-DQN, Bootstrapped DQN)||0.03|
|final exploration step (DQN, QR-DQN, Bootstrapped DQN)||100000|
|Bootstrapped heads (Bootstrapped DQN)||10|
|Number of quantiles (QR-DQN, UA-DQN)||50|