Abstract
Reinforcement Learning (RL) has demonstrated stateoftheart results in anumber of autonomous system applications, however many of the underlyingalgorithms rely on blackbox predictions. This results in poor explainabilityof the behaviour of these systems, raising concerns as to their use insafetycritical applications. Recent work has demonstrated thatuncertaintyaware models exhibit more cautious behaviours through theincorporation of model uncertainty estimates. In this work, we build onProbabilistic Backpropagation to introduce a fully Bayesian Recurrent NeuralNetwork architecture. We apply this within a Safe RL scenario, and demonstratethat the proposed method significantly outperforms a popular approach forobtaining model uncertainties in collision avoidance tasks. Furthermore, wedemonstrate that the proposed approach requires less training and is far moreefficient than the current leading method, both in terms of compute resourceand memory footprint.
Quick Read (beta)
Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning
Abstract
Reinforcement Learning (RL) has demonstrated stateoftheart results in a number of autonomous system applications, however many of the underlying algorithms rely on blackbox predictions. This results in poor explainability of the behaviour of these systems, raising concerns as to their use in safetycritical applications. Recent work has demonstrated that uncertaintyaware models exhibit more cautious behaviours through the incorporation of model uncertainty estimates. In this work, we build on Probabilistic Backpropagation to introduce a fully Bayesian Recurrent Neural Network architecture. We apply this within a Safe RL scenario, and demonstrate that the proposed method significantly outperforms a popular approach for obtaining model uncertainties in collision avoidance tasks. Furthermore, we demonstrate that the proposed approach requires less training and is far more efficient than the current leading method, both in terms of compute resource and memory footprint.
Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning
Matt Benatan IBM Research UK SciTech Daresbury Warrington, UK. [email protected] Edward O. PyzerKnapp IBM Research UK SciTech Daresbury Warrington, UK. [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Reinforcement Learning (RL) has achieved stateoftheart results in a variety of applications, from Atari games (mnih2013playing) to autonomous vehicles (zhang2016learning; shalev2016safe). The models underlying these systems are often blackbox in nature, and do not provide estimates of model uncertainty. This results in overconfident predictions on outofdistribution data, resulting in poor performance in novel data scenarios (amodei2016concrete; lakshminarayanan2017simple; hendrycks2016baseline). Recent work (kahn2017uncertainty; lutjens2018safe) demonstrates that more cautious behaviours can be attained by incorporating model uncertainty estimates into the decision making processes of RL agents. In their work, Lütjens et al. demonstrate that a LongShortTermMemory (LSTM) ensemble using MC Dropout (gal_dropout_2015) is able to approximate model uncertainties. While this demonstrates a clear advantage over a standard LSTM with no uncertainty estimates, the accuracy of their approach’s uncertainty estimates is tied to the size of the ensemble and the number of dropout forward passes executed. Thus, obtaining high quality model uncertainty estimates can quickly become demanding for both compute and memory resources.
An alternative to the approximate Bayesian inference provided by MC Dropout and ensemble methods is Probabilistic Backpropagation (PBP) (hernandezlobato_probabilistic_2015)  a fully Bayesian method for training Bayesian Neural Networks (BNNs), which thus produces fullyBayesian model uncertainty estimates. Unlike ensemblebased methods (lakshminarayanan2017simple), PBP provides a probabilistic neural network architecture at only twice the memory footprint of its nonprobabilistic equivalent. As PBP is fully Bayesian, it only requires training a single network, and inference only requires one forward pass  making it far less computationally intensive than ensemble (lakshminarayanan2017simple), MC Dropout (gal_dropout_2015), or combined (lutjens2018safe) methods.
In this paper we leverage PBP to produce a probabilistic variant of a standard Recurrent Neural Network (RNN). We apply this to a collision avoidance task, and demonstrate that the PBPRNN exhibits competitive performance when compared with the MC Dropout Ensemble (MDE) described in (lutjens2018safe), and achieves higherquality model uncertainty estimates.
2 Related Work
2.1 Safe Reinforcement Learning
Safe RL involves learning policies which maximize performance criteria, e.g. reward, while accounting for safety constraints (garcia2015comprehensive; berkenkamp2017safe), and is a field of study that is becoming increasingly important as more and more automated systems are being designed to operate in safetycritical situations (zhang2016learning; chen2019model; shalev2016safe). A key issue with many existing approaches is their lack of uncertainty quantification  agents’ inability to quantify what they ‘don’t know’. By incorporating uncertainty, RL agents can make better decisions by opting for actions for which they are more confident of a positive outcome. Incorporating this principal in safetyconstrained tasks results in agents choosing safer actions (kahn2017uncertainty; lutjens2018safe), and is crucial for making RL feasible for safetycritical scenarios.
Existing work on Safe RL has typically aimed to discover uncertainty in the environment or model (garcia2015comprehensive)  with the former commonly being the goal in the case of risksensitive RL (RSRL) (mihatsch2002risk; shen2014risk). In this work we focus on model uncertainty, with the aim of quantifying the model’s confidence on outofdistribution data and using this uncertainty quantification to produce safer decisions.
2.2 Probabilistic Neural Networks
There are a variety of approaches for modeling distributions and producing accurate uncertainty estimates (williams2006gaussian; ghahramani2015probabilistic), however many of these methods do not scale well to large datasets. Over recent years neural networkbased methods have demonstrated stateoftheart performance on a wide range of tasks (cho2014learning; he2016deep; ledig2017photo), partly due to their ability to leverage large amounts of data. This has helped to drive interest in the development of a variety of probabilistic neural network approaches, which are capable of modeling uncertainty while also scaling to large datasets (gal_dropout_2015; hernandezlobato_probabilistic_2015; snoek_scalable_2015).
In probabilistic neural networks, network weights are modeled as distributions, rather than as point estimates, allowing the network to encode model uncertainty. If we consider y to be an $N$dimensional vector of targets ${y}_{n}$, and X to be an $N\times D$ matrix of features ${\text{\mathbf{x}}}_{n}$, then we can define the likelihood for y given the network weights $W$, X, and noise precision $\gamma $ as:
$$p(\text{\mathbf{y}}W,\text{\mathbf{X}},\gamma )=\prod _{n=1}^{N}\mathcal{N}({y}_{n}f({\text{\mathbf{x}}}_{n};W),{\gamma}^{1})$$  (1) 
We specify a Gaussian prior for each weight in our network, giving us:
$$p(W{\lambda}_{p})=\prod _{l=1}^{L}\prod _{i=1}^{{V}_{l}}\prod _{j=1}^{{V}_{l1}+1}\mathcal{N}({w}_{ij,l}0,{\lambda}_{p}^{1})$$  (2) 
where ${w}_{ij,l}$ denotes a weight in weight matrix ${\text{\mathbf{W}}}_{l}$ at layer $l$, ${V}_{l}$ is the number of neurons in layer $l$, and ${\lambda}_{p}$ is a precision parameter. For additional details pertaining to ${\lambda}_{p}$ and its hyperprior, please see (hernandezlobato_probabilistic_2015). Using the above definition, we can obtain the posterior distribution for parameters $W$, $\gamma $ and ${\lambda}_{p}$ by applying Bayes’ rule:
$$p(W,\gamma ,{\lambda}_{p}\mathcal{D})=\frac{p(\text{\mathbf{y}}W,\text{\mathbf{X}},{\lambda}_{p})p(W{\lambda}_{p})p({\lambda}_{p})p(\gamma )}{p(\text{\mathbf{y}}\text{\mathbf{X}})}$$  (3) 
where $\mathcal{D}=(\text{\mathbf{X}},\text{\mathbf{y}})$. Predictions for some output ${y}_{\star}$ can then be obtained through:
$$p({y}_{\star}{\text{\mathbf{x}}}_{\star},\mathcal{D})=\int p({y}_{\star}{\text{\mathbf{x}}}_{\star},W,\gamma )p(W,\gamma ,{\lambda}_{p}\mathcal{D})\mathit{d}\gamma \mathit{d}{\lambda}_{p}\mathit{d}W$$  (4) 
As this integral is computationally intractable, several approximations have been proposed (hernandezlobato_probabilistic_2015; gal_dropout_2015; ghosh_assumed_2016). In this work, we use PBP to train a fully Bayesian neural network, as opposed to the approximation provided by MC Dropout or ensemblebased methods.
2.3 Probabilistic Backpropagation
The PBP network can be viewed as a modification of a standard Multilayer Perceptron (MLP) wherein each weight ${w}_{ij,l}\in {\text{\mathbf{W}}}_{l}$ is defined by a one dimensional Gaussian and correspondingly is represented by two weights  a mean ${m}_{ij,l}$ and a variance ${v}_{ij,l}$.
Similarly to standard backpropagation, PBP training consists of two phases. The first phase comprises forward propagation of the input features through the network to obtain the marginal loglikelihood, $logZ$ (instead of loss, which is used in typical backpropagation). The gradients of $logZ$ with respect to the mean and variance weights are then backpropagated using reversemode differentiation, and the resulting derivatives are used to update the mean and variance weights of the network.
The update rule for PBP obtains the parameters of the new Gaussian beliefs ${q}^{new}(w)=\mathcal{N}(w{m}^{new},{v}^{new})$ that minimize the KullbackLeibler (KL) divergence between our beliefs $s$ and ${q}^{new}$ as a function of $m$, $v$ and the gradient of $logZ$:
$${m}^{new}=m+v\frac{\partial logZ}{\partial m}$$  (5) 
$${v}^{new}=v{v}^{2}\left[{\frac{\partial logZ}{\partial m}}^{2}2\frac{\partial logZ}{\partial v}\right]$$  (6) 
These rules ensure moment matching between ${q}^{new}$ and s, guaranteeing that the distributions have the same mean and variance. These are the update equations used in the PBPRNN described in this paper. For further details on PBP, please see (hernandezlobato_probabilistic_2015).
3 Probabilistic Backpropagation for Recurrent Neural Networks
Recurrent Neural Networks (RNNs), and specifically LSTMs, are the current stateoftheart for dynamic obstacle avoidance tasks (alahi2016social; lutjens2018safe; vemula2018social). This is due to their ability to encode contextual dependencies, allowing them to accurately model the temporal patterns of dynamic obstacles. As such, we have chosen to use an RNN for our collision prediction network. In order to provide fullyBayesian model uncertainty estimates, we adapt a standard RNN architecture to use Probabilistic Backpropagation (PBP).
A standard RNN consists of two weight matrices per layer, the weight matrix ${\text{\mathbf{W}}}_{x}$, which is equivalent to the weight matrices used in a standard MLP, and the weight matrix ${\text{\mathbf{W}}}_{h}$, which holds the transition weights. For a given input ${\text{\mathbf{x}}}_{t}$ at a time step $t$, the output ${\text{\mathbf{y}}}_{t}$ and hidden state ${\text{\mathbf{h}}}_{t}$ at step $t$ are computed as:
$${\text{\mathbf{h}}}_{t}={\text{\mathbf{y}}}_{t}=f({\text{\mathbf{W}}}_{h}{h}_{t1}+{\text{\mathbf{W}}}_{x}{\text{\mathbf{x}}}_{t})$$  (7) 
where $f()$ is some activation function and ${\text{\mathbf{h}}}_{t1}$ is the previous hidden state.
For the PBPRNN, we apply the PBP treatment of a standard MLP to an RNN: each of the weights ${w}_{h}\in {\text{\mathbf{W}}}_{h}$ and ${w}_{x}\in {\text{\mathbf{W}}}_{x}$ are represented by mean and variance weights, resulting in four weight matrices in place of the two standard RNN matrices: ${\text{\mathbf{W}}}_{m}$, ${\text{\mathbf{W}}}_{v}$, ${\text{\mathbf{W}}}_{hm}$ and ${\text{\mathbf{W}}}_{hv}$. This produces two outputs from the network output layer $L$ at each time step $t$  a mean ${m}_{L,t}$ and a variance ${v}_{L,t}$:
$${m}_{L,t}={\text{\mathbf{W}}}_{hm}{\text{\mathbf{h}}}_{mt1}+{\text{\mathbf{W}}}_{m}{\text{\mathbf{x}}}_{t}$$  (8) 
$${v}_{L,t}={\text{\mathbf{W}}}_{hv}{\text{\mathbf{h}}}_{vt1}+{\text{\mathbf{W}}}_{v}{\text{\mathbf{v}}}_{t}$$  (9) 
Updates to the network at step $t$ are computed using Truncated Backpropagation Through Time (TBTT) (boden2002guide) in combination with the standard PBP update procedure, whereby the network is ‘unrolled’ by some $T$ time steps, and the weights are updated in reverse order  from time step $T$ back to $t=1$. First, the marginal log likelihoods for each time step are computed:
$$log{Z}_{t}=0.5\frac{\mathrm{log}{v}_{t}+{({y}_{t}{m}_{l,t})}^{2}}{{v}_{t}}$$  (10) 
These are then used to update the weights in each of the four weight matrices per layer, as per the typical PBP update rule:
${m}_{t}^{new}$  $={m}_{t}+{v}_{t}{\displaystyle \frac{\partial log{Z}_{t}}{\partial {m}_{t}}}\forall {m}_{t}\in {\text{\mathbf{W}}}_{m}$  (11)  
${m}_{ht}^{new}$  $={m}_{ht}+{v}_{ht}{\displaystyle \frac{\partial log{Z}_{t}}{\partial {m}_{ht}}}\forall {m}_{ht}\in {\text{\mathbf{W}}}_{hm}$  (12)  
${v}_{t}^{new}$  $={v}_{t}{v}_{t}^{2}\left[{{\displaystyle \frac{\partial log{Z}_{t}}{\partial {m}_{t}}}}^{2}2{\displaystyle \frac{\partial log{Z}_{t}}{\partial {v}_{t}}}\right]\forall {v}_{t}\in {\text{\mathbf{W}}}_{v}$  (13)  
${v}_{ht}^{new}$  $={v}_{ht}{v}_{ht}^{2}\left[{{\displaystyle \frac{\partial log{Z}_{t}}{\partial {m}_{ht}}}}^{2}2{\displaystyle \frac{\partial log{Z}_{t}}{\partial {v}_{ht}}}\right]\forall {v}_{ht}\in {\text{\mathbf{W}}}_{hv}$  (14) 
For incorporating prior factors into $q$, we follow the same procedure for updating the $\alpha $ and $\beta $ parameters described in (hernandezlobato_probabilistic_2015), but substitute the likelihood $Z$ for the mean of $Z$ over all time steps:
$$Z=\frac{{\sum}_{t=1}^{T}{Z}_{t}}{T}$$  (15) 
The same procedure is followed for parameters ${Z}_{1}$ and ${Z}_{2}$ used in the $\alpha $ and $\beta $ updates for standard PBP, as described in (hernandezlobato_probabilistic_2015). As with standard BPTT, this process is repeated for each time step $t\in T$ for each minibatch within each epoch.
4 Methods
Many tasks used for evaluating the performance of RL methods, such as arcade learning environments, would not obviously benefit from a safe RL approach. In order to transparently and reproducibly assess the effects of our approach to uncertainty quantification, we decided that the tasks which would be used for this analysis would include the following qualities:

•
The task should reflect a realworld application of RL, rather than a trivial toy problem.

•
The task should be simple enough that the results can be clearly interpreted.

•
The task should facilitate a comprehensive combination of test cases for evaluating the impact of using uncertainty quantification, including exposing the agent to novel behaviours, noise, and missing data.
We felt that the task of collision avoidance, as in (lutjens2018safe), fulfilled this criteria, and was a sensible choice to comprehensively evaluate our methods. As such, in this work we use the PBPRNN architecture described above for a collision prediction network applied to a collision avoidance task. This network is used within a Model Predictive Controller (MPC) which selects a motion primitive $u$ from a set of motion primitives $U$ as in (lutjens2018safe). Similarly to (lutjens2018safe), $U$ contains 11 discrete motion primitives spanning heading angles ${\alpha}_{h}\in [\frac{\pi}{5},\frac{\pi}{5}]$ of length $h=0.05$, and the MPC chooses the lowestcost motion primitive according to the following criteria:
$${u}_{t:t+h}^{*}=argmin({\lambda}_{v}{V}_{coll}^{i}+{\lambda}_{c}{P}_{coll}^{i}+{\lambda}_{d}{d}_{goal})$$  (16) 
where ${P}_{coll}^{i}$ is the collision prediction for the motion primitive at $i$, ${V}_{coll}^{i}$ denotes the variance associated with the collision prediction, ${d}_{goal}$ is the distance to the goal, and ${\lambda}_{v}$, ${\lambda}_{c}$, ${\lambda}_{d}$ are their respective coefficients. In this work, we use an epsilon greedy policy (mnih2015human), decreasing $\u03f5$ by $\frac{\u03f5}{50}$ after each episode, to continue adding new experience while monotonically decreasing the probability of random actions. Previous work demonstrates that starting with low ${\lambda}_{v}$ and increasing ${\lambda}_{v}$ over time is helpful for escaping local minima (lutjens2018safe). As such, we multiply ${\lambda}_{v}$ by $1\u03f5$, so that ${\lambda}_{v}$ increases as $\u03f5$ decreases.
4.1 Environment
For the collision avoidance task, we use a multiagent OpenAI Gym (brockman2016openai) environment based on (lowe2017multi) with two agents. The agents start each episode facing eachother, and each agent is assigned a goal, as illustrated in Figure 1. One agent acts as a dynamic obstacle, following a collaborative Reciprocal Velocity Object (RVO) policy (van2011reciprocal), while the other uses the MPC and collision prediction network. The observation ${\text{\mathbf{o}}}_{t}$ at time $t$ is defined as:
$${\text{\mathbf{o}}}_{t}=[{a}_{1p,t},{a}_{1v,t};{a}_{2p,t},{a}_{2v,t},{u}_{t}]$$  (17) 
where $[{a}_{1p},{a}_{1v}]$ and $[{a}_{2p},{a}_{2v}]$ are the positions and velocities for agents 1 and 2 respectively and ${u}_{t}$ is the evaluated motion primitive. For the input to the RNN (and the LSTM, which we compare against), we feed the input o, which comprises $l=8$ observations, from $tl$ to $t$. After each episode, all inputs are collated into a set of input features X, and a set of labels y is populated with the collision label, $y=[0,1]$, for the episode. These are collated into an experience pool of examples, from which samples are randomly drawn to train the collision prediction network.
The results reported here were obtained after $\u03f5$ reached $0.1$, at which point it was set to $0$, and the test cases were run with no further training cycles. While $\u03f5>0$, the dynamic obstacle starts in the same location each episode, $po{s}_{xy,o}=[0,0.25]$, and the agent starts at $po{s}_{xy,a}=[0,0.25]$. Once $\u03f5$ is set to $0$, we obtain results for 20 episodes using this initialization, after which the starting position of the dynamic obstacle is randomly generated for $po{s}_{y,o}$ from the range $0.25$ to $0.25$, as illustrated in Figure 1. At this point we switch to a noncollaborative policy, for which the obstacle no longer tries to avoid the agent, and simply moves in a straightline towards its goal. This combination of random initialization and policy switching for the dynamic obstacle produces novel dynamic object behaviour which is used for testing.
4.2 Network Parameters
Using previous work as a guide (lutjens2018safe), the LSTM Ensemble consists of 5 singlelayer LSTM networks each with 16 units with linear activations and a Rectified Linear Unit (ReLU) activation on the output. This is trained using MSE loss and Adam optimization with an initial learning rate of 0.001. For inference (as in (lutjens2018safe)), we execute 20 forward passes per network with different dropout masks (setting dropout $p=0.7$) from which the sample mean and variance are drawn from the resulting distribution of 100 predictions. We use an equivalent architecture for the PBPRNN  a single layer network consisting of 16 units with linear activations. For both networks we set the number of time steps $T$ equal to the sequence length ${l}_{s}$. For shorter sequences (for the first 7 time steps in each episode) we zero pad the input up to $T$.
4.3 Training Procedure
For both networks evaluated here we used TensorFlow on a single Power 8+ compute node. During the training cycles, the dynamic obstacle follows a collaborative RVO policy. We first run 100 episodes to seed the experience pool, for which the agent selects random actions. Following this we draw 500 random samples from the pool to train the collision prediction network. We then follow a standard observeacttrain procedure, repeating training after every 10 episodes. To balance the training data, we draw half of the samples at random from examples where collision labels are $y=0$ and half from a pool where collision labels are $y=1$.
For PBP, we run an initial phase of 5 epochs of training after the first 100 episodes, and 2 epochs of training after each subsequent set of 10 episodes. This relatively small number of training epochs is guided by empirical evidence and prior work demonstrating that PBP requires relatively few epochs for training (benatan2018practical; hernandezlobato_probabilistic_2015). We found that the MDE performed better with comparatively more epochs of training, and so trained this for 100 epochs after the first 100 episodes, followed by 10 epochs of training after each subsequent set of 10 episodes.
For the MPC, we use the same values for the $\lambda $ parameters as recommended in (lutjens2018safe) for both the MDE and PBPRNN, setting ${\lambda}_{c}=25$, ${\lambda}_{v}=200$ and ${\lambda}_{d}=3$.
5 Results
PBPRNN $\mu $  MDE$\mu $  PBPRNN ${\sigma}^{2}$  MDE ${\sigma}^{2}$  

Train  0.002  0.001  0.002  0.001 
Novel  0.003  0.001  0.002  0.0002 
Novel + noise (${\lambda}_{\xi}=0.005$)  0.006  0.001  0.005  0.0003 
Novel + dropped obs. (${n}_{dropped}=5$)  0.039  0.007  0.044  0.003 
PBPRNN $\mu $  MDE$\mu $  PBPRNN ${\sigma}^{2}$  MDE ${\sigma}^{2}$  

Train  0.685  2.1$\times {10}^{5}$  0.108  3.8$\times {10}^{5}$ 
Novel  0.555  6.2$\times {10}^{4}$  1.048  1.7$\times {10}^{5}$ 
Novel + noise (${\lambda}_{\xi}=0.005$)  2.479  8.6$\times {10}^{4}$  1.440  2.5$\times {10}^{5}$ 
Novel + dropped obs. (${n}_{dropped}=5$)  8.760  66.871  4.707  27.120 
5.1 Collision Detection Network Performance
Following the completion of all training cycles, we collect a set of indistribution (training) and a set of outofdistribution (novel) test data, each for 20 episodes  the entire process is repeated 10 times. For the outofdistribution data, we randomly initialize the dynamic obstacle as described in Section 4.1. We use this data to evaluate the performance of the collision prediction network through obtaining the loglikelihood, false positive rate (FPR) (the number of times a collision is predicted, but not encountered) and false negative rate (FNR) (the number of times a collision occurred when one was not predicted). Additionally, for all noncollision episodes, we record the minimum distances between the agent and obstacle in order to build an impression of the caution exhibited by the agent.
As demonstrated in Figure 2, the PBPRNN achieves substantially better results on the training distribution, with 0 falsepositives, falsenegatives, and collisions. As would be expected, we see these figures degrade somewhat in the case of novel data for both the PBPRNN and the MDE; however the PBPRNN continues to demonstrate better overall performance, with a significantly lower FPR and fewer collisions.
The minimum distance plots in Figure 2 (far right top and bottom) demonstrate that the agent using the PBPRNN leaves a greater margin when passing the obstacle than the agent using the MDE; indicating more cautious behavior. This is supported by the variance results in Table 1, which show that the PBPRNN’s variance values increase between the training and novel scenarios, whereas this isn’t the case for MDE. This indicates that the PBPRNN has a comparatively higher quality model of uncertainty, as its variances increase with the degree of novelty in the observations  thus resulting in more cautious behaviour.
5.2 Collision Avoidance with Noise and Dropped Observations
Uncertaintybased methods have demonstrated advantageous performance in the presence of noise or dropped observations due to their ability to leverage model uncertainty when selecting actions (lutjens2018safe). Here, we investigate the performance of the MDE and PBPRNN approaches with varying levels of additive noise by adding a matrix of noise terms $\bm{\xi}$ multiplied by an incrementally increasing noise coefficient ${\lambda}_{\xi}$ to our matrix of observations:
$${\text{\mathbf{o}}}_{t\xi}={\text{\mathbf{o}}}_{t}+\bm{\xi}{\lambda}_{\xi}$$  (18) 
where $\bm{\xi}$ is a matrix of values each generated from the distribution $\mathcal{N}(0,1)$. For each method, we run 10 episodes for each value of ${\lambda}_{\xi}$ after the initial training phase, and repeat this process 10 times. We again use the noncollaborative policy for the dynamic obstacle. As Figure 3 demonstrates, the PBPRNN exhibits better robustness to noise, with fewer collisions on average. This performance can again be explained by the change in the PBPRNN’s variance output as the level of noise increases. This is demonstrated in the bottom plot of Figure 3, which shows ${\sigma}^{2}$ steadily increasing for the PBPRNN, whereas this is not the case for MDE  again indicating that the PBPRNN has a better model of uncertainty.
In the final test case, we combine the noncollaborative policy with dropped observations  whereby, for increasing values of ${n}_{dropped}$, between 1 and 8 observations in the sequence are randomly selected and set to zero, simulating the kind of behaviour that may occur in electronic sensor systems. The PBPRNN again exhibits better performance when compared with the MDE, as shown in Figure 4, and the same underlying theme is evident: the ${\sigma}^{2}$ values for the PBPRNN increase with the number of dropped observations, whereas these remain very small (although do increase marginally) for the MDE.
Tables 1 and 2 further validate the model quality of the PBPRNN with respect to the MDE. In Table 1, we see that the PBPRNN’s mean uncertainty increases as the test scenarios deviate from the training distribution. While the MDE’s variance does eventually increase, it requires the combination of the noncollaborative policy and a significant number of dropped observations for this to occur. Crucially, Table 2 illustrates that the loglikelihood values for the PBPRNN and MDE differ significantly, with the PBPRNN obtaining high and consistent loglikelihood scores, while MDE achieves low and inconsistent loglikelihoods. These poor loglikelihoods are a product of MDE’s overconfident predictions due to its lack of a descriptive posterior  a known drawback of MC Dropout (lutjens2018safe; pearce2018bayesian), which is clearly documented here in the variance and loglikelihood values of our results.
5.3 Computational Considerations
We analyzed compute time for inference with both the PBPRNN and the MC Dropout Ensemble. For MDE, we ran the forward passes serially and recorded a mean inference time of $119.8\pm 3.2\mathrm{ms}$. In the case of the PBPRNN, we recorded a mean inference time of $29.7\pm 0.7\mathrm{ms}$. While the inference time for MDE can be greatly reduced by running forward passes in parallel (lutjens2018safe), it will still require significantly more compute resource when compared with PBP, due to the requirement of multiple networks and multiple forward passes. Furthermore, in the case of MDE, the compute time required will increase with the quality requirements of the model uncertainty estimates.
Additionally, the PBPRNN required far fewer epochs of training to achieve better performance than the MDE, with the PBPRNN executing a total of 27 epochs of training, while the MDE executed 210 epochs of training.
The PBPRNN is also advantageous in terms of memory footprint, as it only requires twice the memory of its nonprobabilistic alternative, yet provides fully Bayesian uncertainty estimates. In contrast, the quality of the MDE’s uncertainty estimates is proportional to the number of networks in the ensemble  requiring larger networks and larger memory footprint to produce uncertainty estimates of reasonable quality (lakshminarayanan2017simple).
5.4 Variance Weights for Long Term Memory
This work demonstrates that a PBPRNN is capable of achieving superior performance when compared with an LSTMbased approach. This is particularly interesting when considering that traditional RNNs are typically only performant on very short sequences (graves_supervised_2012), and LSTM’s are the stateoftheart for tasks such as the collision avoidance task used here (lutjens2018safe). We hypothesize that the performance obtained here is possible due to the variance weights which, through learning variances associated with different features at different time steps, may function similarly to the gates used in an LSTM  however, more rigorous investigation is required to confirm whether this is the case.
6 Conclusion
While there has been previous work on Bayesian RNNs (fortunato2017bayesian), our work is the first to demonstrate that PBP can be effectively applied to an RNN architecture to produce a recurrent network with fullyBayesian model uncertainty estimates. The resulting network is less demanding on both computation and memory resources than popular dropout and ensemblebased methods, such as the recently proposed MC Dropout Ensemble (lutjens2018safe). Crucially, our approach produces much higher quality uncertainty estimates, which are necessary for improving RL agent performance in novel scenarios, as demonstrated by the model’s competitive performance in dynamic obstacle avoidance tasks. In the case of Safe RL, this directly translates to safer behaviour on the part of the RL agent  a crucial step towards feasibly incorporating more complex machine learning algorithms in safetycritical tasks.
Acknowledgements
Thanks to Clyde Fare, Peter Fenner and Mohab Elkaref for useful discussion. This work was supported by the Science and Technology Facilities Council Hartree Centre’s Innovation Return on Research program, funded by the U.K. Department for Business, Energy, and Industrial Strategy.