Reinforcement Learning (RL) has demonstrated state-of-the-art results in anumber of autonomous system applications, however many of the underlyingalgorithms rely on black-box predictions. This results in poor explainabilityof the behaviour of these systems, raising concerns as to their use insafety-critical applications. Recent work has demonstrated thatuncertainty-aware models exhibit more cautious behaviours through theincorporation of model uncertainty estimates. In this work, we build onProbabilistic Backpropagation to introduce a fully Bayesian Recurrent NeuralNetwork architecture. We apply this within a Safe RL scenario, and demonstratethat the proposed method significantly outperforms a popular approach forobtaining model uncertainties in collision avoidance tasks. Furthermore, wedemonstrate that the proposed approach requires less training and is far moreefficient than the current leading method, both in terms of compute resourceand memory footprint.
Quick Read (beta)
Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning
Reinforcement Learning (RL) has demonstrated state-of-the-art results in a number of autonomous system applications, however many of the underlying algorithms rely on black-box predictions. This results in poor explainability of the behaviour of these systems, raising concerns as to their use in safety-critical applications. Recent work has demonstrated that uncertainty-aware models exhibit more cautious behaviours through the incorporation of model uncertainty estimates. In this work, we build on Probabilistic Backpropagation to introduce a fully Bayesian Recurrent Neural Network architecture. We apply this within a Safe RL scenario, and demonstrate that the proposed method significantly outperforms a popular approach for obtaining model uncertainties in collision avoidance tasks. Furthermore, we demonstrate that the proposed approach requires less training and is far more efficient than the current leading method, both in terms of compute resource and memory footprint.
Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning
noticebox[b]Preprint. Under review.\[email protected]
Reinforcement Learning (RL) has achieved state-of-the-art results in a variety of applications, from Atari games (mnih2013playing) to autonomous vehicles (zhang2016learning; shalev2016safe). The models underlying these systems are often black-box in nature, and do not provide estimates of model uncertainty. This results in over-confident predictions on out-of-distribution data, resulting in poor performance in novel data scenarios (amodei2016concrete; lakshminarayanan2017simple; hendrycks2016baseline). Recent work (kahn2017uncertainty; lutjens2018safe) demonstrates that more cautious behaviours can be attained by incorporating model uncertainty estimates into the decision making processes of RL agents. In their work, Lütjens et al. demonstrate that a Long-Short-Term-Memory (LSTM) ensemble using MC Dropout (gal_dropout_2015) is able to approximate model uncertainties. While this demonstrates a clear advantage over a standard LSTM with no uncertainty estimates, the accuracy of their approach’s uncertainty estimates is tied to the size of the ensemble and the number of dropout forward passes executed. Thus, obtaining high quality model uncertainty estimates can quickly become demanding for both compute and memory resources.
An alternative to the approximate Bayesian inference provided by MC Dropout and ensemble methods is Probabilistic Backpropagation (PBP) (hernandez-lobato_probabilistic_2015) - a fully Bayesian method for training Bayesian Neural Networks (BNNs), which thus produces fully-Bayesian model uncertainty estimates. Unlike ensemble-based methods (lakshminarayanan2017simple), PBP provides a probabilistic neural network architecture at only twice the memory footprint of its non-probabilistic equivalent. As PBP is fully Bayesian, it only requires training a single network, and inference only requires one forward pass - making it far less computationally intensive than ensemble (lakshminarayanan2017simple), MC Dropout (gal_dropout_2015), or combined (lutjens2018safe) methods.
In this paper we leverage PBP to produce a probabilistic variant of a standard Recurrent Neural Network (RNN). We apply this to a collision avoidance task, and demonstrate that the PBP-RNN exhibits competitive performance when compared with the MC Dropout Ensemble (MDE) described in (lutjens2018safe), and achieves higher-quality model uncertainty estimates.
2 Related Work
2.1 Safe Reinforcement Learning
Safe RL involves learning policies which maximize performance criteria, e.g. reward, while accounting for safety constraints (garcia2015comprehensive; berkenkamp2017safe), and is a field of study that is becoming increasingly important as more and more automated systems are being designed to operate in safety-critical situations (zhang2016learning; chen2019model; shalev2016safe). A key issue with many existing approaches is their lack of uncertainty quantification - agents’ inability to quantify what they ‘don’t know’. By incorporating uncertainty, RL agents can make better decisions by opting for actions for which they are more confident of a positive outcome. Incorporating this principal in safety-constrained tasks results in agents choosing safer actions (kahn2017uncertainty; lutjens2018safe), and is crucial for making RL feasible for safety-critical scenarios.
Existing work on Safe RL has typically aimed to discover uncertainty in the environment or model (garcia2015comprehensive) - with the former commonly being the goal in the case of risk-sensitive RL (RSRL) (mihatsch2002risk; shen2014risk). In this work we focus on model uncertainty, with the aim of quantifying the model’s confidence on out-of-distribution data and using this uncertainty quantification to produce safer decisions.
2.2 Probabilistic Neural Networks
There are a variety of approaches for modeling distributions and producing accurate uncertainty estimates (williams2006gaussian; ghahramani2015probabilistic), however many of these methods do not scale well to large datasets. Over recent years neural network-based methods have demonstrated state-of-the-art performance on a wide range of tasks (cho2014learning; he2016deep; ledig2017photo), partly due to their ability to leverage large amounts of data. This has helped to drive interest in the development of a variety of probabilistic neural network approaches, which are capable of modeling uncertainty while also scaling to large datasets (gal_dropout_2015; hernandez-lobato_probabilistic_2015; snoek_scalable_2015).
In probabilistic neural networks, network weights are modeled as distributions, rather than as point estimates, allowing the network to encode model uncertainty. If we consider y to be an -dimensional vector of targets , and X to be an matrix of features , then we can define the likelihood for y given the network weights , X, and noise precision as:
We specify a Gaussian prior for each weight in our network, giving us:
where denotes a weight in weight matrix at layer , is the number of neurons in layer , and is a precision parameter. For additional details pertaining to and its hyper-prior, please see (hernandez-lobato_probabilistic_2015). Using the above definition, we can obtain the posterior distribution for parameters , and by applying Bayes’ rule:
where . Predictions for some output can then be obtained through:
As this integral is computationally intractable, several approximations have been proposed (hernandez-lobato_probabilistic_2015; gal_dropout_2015; ghosh_assumed_2016). In this work, we use PBP to train a fully Bayesian neural network, as opposed to the approximation provided by MC Dropout or ensemble-based methods.
2.3 Probabilistic Backpropagation
The PBP network can be viewed as a modification of a standard Multilayer Perceptron (MLP) wherein each weight is defined by a one dimensional Gaussian and correspondingly is represented by two weights - a mean and a variance .
Similarly to standard backpropagation, PBP training consists of two phases. The first phase comprises forward propagation of the input features through the network to obtain the marginal log-likelihood, (instead of loss, which is used in typical backpropagation). The gradients of with respect to the mean and variance weights are then backpropagated using reverse-mode differentiation, and the resulting derivatives are used to update the mean and variance weights of the network.
The update rule for PBP obtains the parameters of the new Gaussian beliefs that minimize the Kullback-Leibler (KL) divergence between our beliefs and as a function of , and the gradient of :
These rules ensure moment matching between and s, guaranteeing that the distributions have the same mean and variance. These are the update equations used in the PBP-RNN described in this paper. For further details on PBP, please see (hernandez-lobato_probabilistic_2015).
3 Probabilistic Backpropagation for Recurrent Neural Networks
Recurrent Neural Networks (RNNs), and specifically LSTMs, are the current state-of-the-art for dynamic obstacle avoidance tasks (alahi2016social; lutjens2018safe; vemula2018social). This is due to their ability to encode contextual dependencies, allowing them to accurately model the temporal patterns of dynamic obstacles. As such, we have chosen to use an RNN for our collision prediction network. In order to provide fully-Bayesian model uncertainty estimates, we adapt a standard RNN architecture to use Probabilistic Backpropagation (PBP).
A standard RNN consists of two weight matrices per layer, the weight matrix , which is equivalent to the weight matrices used in a standard MLP, and the weight matrix , which holds the transition weights. For a given input at a time step , the output and hidden state at step are computed as:
where is some activation function and is the previous hidden state.
For the PBP-RNN, we apply the PBP treatment of a standard MLP to an RNN: each of the weights and are represented by mean and variance weights, resulting in four weight matrices in place of the two standard RNN matrices: , , and . This produces two outputs from the network output layer at each time step - a mean and a variance :
Updates to the network at step are computed using Truncated Backpropagation Through Time (TBTT) (boden2002guide) in combination with the standard PBP update procedure, whereby the network is ‘unrolled’ by some time steps, and the weights are updated in reverse order - from time step back to . First, the marginal log likelihoods for each time step are computed:
These are then used to update the weights in each of the four weight matrices per layer, as per the typical PBP update rule:
For incorporating prior factors into , we follow the same procedure for updating the and parameters described in (hernandez-lobato_probabilistic_2015), but substitute the likelihood for the mean of over all time steps:
The same procedure is followed for parameters and used in the and updates for standard PBP, as described in (hernandez-lobato_probabilistic_2015). As with standard BPTT, this process is repeated for each time step for each mini-batch within each epoch.
Many tasks used for evaluating the performance of RL methods, such as arcade learning environments, would not obviously benefit from a safe RL approach. In order to transparently and reproducibly assess the effects of our approach to uncertainty quantification, we decided that the tasks which would be used for this analysis would include the following qualities:
The task should reflect a real-world application of RL, rather than a trivial toy problem.
The task should be simple enough that the results can be clearly interpreted.
The task should facilitate a comprehensive combination of test cases for evaluating the impact of using uncertainty quantification, including exposing the agent to novel behaviours, noise, and missing data.
We felt that the task of collision avoidance, as in (lutjens2018safe), fulfilled this criteria, and was a sensible choice to comprehensively evaluate our methods. As such, in this work we use the PBP-RNN architecture described above for a collision prediction network applied to a collision avoidance task. This network is used within a Model Predictive Controller (MPC) which selects a motion primitive from a set of motion primitives as in (lutjens2018safe). Similarly to (lutjens2018safe), contains 11 discrete motion primitives spanning heading angles of length , and the MPC chooses the lowest-cost motion primitive according to the following criteria:
where is the collision prediction for the motion primitive at , denotes the variance associated with the collision prediction, is the distance to the goal, and , , are their respective coefficients. In this work, we use an epsilon greedy policy (mnih2015human), decreasing by after each episode, to continue adding new experience while monotonically decreasing the probability of random actions. Previous work demonstrates that starting with low and increasing over time is helpful for escaping local minima (lutjens2018safe). As such, we multiply by , so that increases as decreases.
For the collision avoidance task, we use a multi-agent OpenAI Gym (brockman2016openai) environment based on (lowe2017multi) with two agents. The agents start each episode facing eachother, and each agent is assigned a goal, as illustrated in Figure 1. One agent acts as a dynamic obstacle, following a collaborative Reciprocal Velocity Object (RVO) policy (van2011reciprocal), while the other uses the MPC and collision prediction network. The observation at time is defined as:
where and are the positions and velocities for agents 1 and 2 respectively and is the evaluated motion primitive. For the input to the RNN (and the LSTM, which we compare against), we feed the input o, which comprises observations, from to . After each episode, all inputs are collated into a set of input features X, and a set of labels y is populated with the collision label, , for the episode. These are collated into an experience pool of examples, from which samples are randomly drawn to train the collision prediction network.
The results reported here were obtained after reached , at which point it was set to , and the test cases were run with no further training cycles. While , the dynamic obstacle starts in the same location each episode, , and the agent starts at . Once is set to , we obtain results for 20 episodes using this initialization, after which the starting position of the dynamic obstacle is randomly generated for from the range to , as illustrated in Figure 1. At this point we switch to a non-collaborative policy, for which the obstacle no longer tries to avoid the agent, and simply moves in a straight-line towards its goal. This combination of random initialization and policy switching for the dynamic obstacle produces novel dynamic object behaviour which is used for testing.
4.2 Network Parameters
Using previous work as a guide (lutjens2018safe), the LSTM Ensemble consists of 5 single-layer LSTM networks each with 16 units with linear activations and a Rectified Linear Unit (ReLU) activation on the output. This is trained using MSE loss and Adam optimization with an initial learning rate of 0.001. For inference (as in (lutjens2018safe)), we execute 20 forward passes per network with different dropout masks (setting dropout ) from which the sample mean and variance are drawn from the resulting distribution of 100 predictions. We use an equivalent architecture for the PBP-RNN - a single layer network consisting of 16 units with linear activations. For both networks we set the number of time steps equal to the sequence length . For shorter sequences (for the first 7 time steps in each episode) we zero pad the input up to .
4.3 Training Procedure
For both networks evaluated here we used TensorFlow on a single Power 8+ compute node. During the training cycles, the dynamic obstacle follows a collaborative RVO policy. We first run 100 episodes to seed the experience pool, for which the agent selects random actions. Following this we draw 500 random samples from the pool to train the collision prediction network. We then follow a standard observe-act-train procedure, repeating training after every 10 episodes. To balance the training data, we draw half of the samples at random from examples where collision labels are and half from a pool where collision labels are .
For PBP, we run an initial phase of 5 epochs of training after the first 100 episodes, and 2 epochs of training after each subsequent set of 10 episodes. This relatively small number of training epochs is guided by empirical evidence and prior work demonstrating that PBP requires relatively few epochs for training (benatan2018practical; hernandez-lobato_probabilistic_2015). We found that the MDE performed better with comparatively more epochs of training, and so trained this for 100 epochs after the first 100 episodes, followed by 10 epochs of training after each subsequent set of 10 episodes.
For the MPC, we use the same values for the parameters as recommended in (lutjens2018safe) for both the MDE and PBP-RNN, setting , and .
|Novel + noise ()||0.006||0.001||0.005||0.0003|
|Novel + dropped obs. ()||0.039||0.007||0.044||0.003|
|Novel + noise ()||-2.479||-8.6||1.440||2.5|
|Novel + dropped obs. ()||-8.760||-66.871||4.707||27.120|
5.1 Collision Detection Network Performance
Following the completion of all training cycles, we collect a set of in-distribution (training) and a set of out-of-distribution (novel) test data, each for 20 episodes - the entire process is repeated 10 times. For the out-of-distribution data, we randomly initialize the dynamic obstacle as described in Section 4.1. We use this data to evaluate the performance of the collision prediction network through obtaining the log-likelihood, false positive rate (FPR) (the number of times a collision is predicted, but not encountered) and false negative rate (FNR) (the number of times a collision occurred when one was not predicted). Additionally, for all non-collision episodes, we record the minimum distances between the agent and obstacle in order to build an impression of the caution exhibited by the agent.
As demonstrated in Figure 2, the PBP-RNN achieves substantially better results on the training distribution, with 0 false-positives, false-negatives, and collisions. As would be expected, we see these figures degrade somewhat in the case of novel data for both the PBP-RNN and the MDE; however the PBP-RNN continues to demonstrate better overall performance, with a significantly lower FPR and fewer collisions.
The minimum distance plots in Figure 2 (far right top and bottom) demonstrate that the agent using the PBP-RNN leaves a greater margin when passing the obstacle than the agent using the MDE; indicating more cautious behavior. This is supported by the variance results in Table 1, which show that the PBP-RNN’s variance values increase between the training and novel scenarios, whereas this isn’t the case for MDE. This indicates that the PBP-RNN has a comparatively higher quality model of uncertainty, as its variances increase with the degree of novelty in the observations - thus resulting in more cautious behaviour.
5.2 Collision Avoidance with Noise and Dropped Observations
Uncertainty-based methods have demonstrated advantageous performance in the presence of noise or dropped observations due to their ability to leverage model uncertainty when selecting actions (lutjens2018safe). Here, we investigate the performance of the MDE and PBP-RNN approaches with varying levels of additive noise by adding a matrix of noise terms multiplied by an incrementally increasing noise coefficient to our matrix of observations:
where is a matrix of values each generated from the distribution . For each method, we run 10 episodes for each value of after the initial training phase, and repeat this process 10 times. We again use the non-collaborative policy for the dynamic obstacle. As Figure 3 demonstrates, the PBP-RNN exhibits better robustness to noise, with fewer collisions on average. This performance can again be explained by the change in the PBP-RNN’s variance output as the level of noise increases. This is demonstrated in the bottom plot of Figure 3, which shows steadily increasing for the PBP-RNN, whereas this is not the case for MDE - again indicating that the PBP-RNN has a better model of uncertainty.
In the final test case, we combine the non-collaborative policy with dropped observations - whereby, for increasing values of , between 1 and 8 observations in the sequence are randomly selected and set to zero, simulating the kind of behaviour that may occur in electronic sensor systems. The PBP-RNN again exhibits better performance when compared with the MDE, as shown in Figure 4, and the same underlying theme is evident: the values for the PBP-RNN increase with the number of dropped observations, whereas these remain very small (although do increase marginally) for the MDE.
Tables 1 and 2 further validate the model quality of the PBP-RNN with respect to the MDE. In Table 1, we see that the PBP-RNN’s mean uncertainty increases as the test scenarios deviate from the training distribution. While the MDE’s variance does eventually increase, it requires the combination of the non-collaborative policy and a significant number of dropped observations for this to occur. Crucially, Table 2 illustrates that the log-likelihood values for the PBP-RNN and MDE differ significantly, with the PBP-RNN obtaining high and consistent log-likelihood scores, while MDE achieves low and inconsistent log-likelihoods. These poor log-likelihoods are a product of MDE’s over-confident predictions due to its lack of a descriptive posterior - a known drawback of MC Dropout (lutjens2018safe; pearce2018bayesian), which is clearly documented here in the variance and log-likelihood values of our results.
5.3 Computational Considerations
We analyzed compute time for inference with both the PBP-RNN and the MC Dropout Ensemble. For MDE, we ran the forward passes serially and recorded a mean inference time of . In the case of the PBP-RNN, we recorded a mean inference time of . While the inference time for MDE can be greatly reduced by running forward passes in parallel (lutjens2018safe), it will still require significantly more compute resource when compared with PBP, due to the requirement of multiple networks and multiple forward passes. Furthermore, in the case of MDE, the compute time required will increase with the quality requirements of the model uncertainty estimates.
Additionally, the PBP-RNN required far fewer epochs of training to achieve better performance than the MDE, with the PBP-RNN executing a total of 27 epochs of training, while the MDE executed 210 epochs of training.
The PBP-RNN is also advantageous in terms of memory footprint, as it only requires twice the memory of its non-probabilistic alternative, yet provides fully Bayesian uncertainty estimates. In contrast, the quality of the MDE’s uncertainty estimates is proportional to the number of networks in the ensemble - requiring larger networks and larger memory footprint to produce uncertainty estimates of reasonable quality (lakshminarayanan2017simple).
5.4 Variance Weights for Long Term Memory
This work demonstrates that a PBP-RNN is capable of achieving superior performance when compared with an LSTM-based approach. This is particularly interesting when considering that traditional RNNs are typically only performant on very short sequences (graves_supervised_2012), and LSTM’s are the state-of-the-art for tasks such as the collision avoidance task used here (lutjens2018safe). We hypothesize that the performance obtained here is possible due to the variance weights which, through learning variances associated with different features at different time steps, may function similarly to the gates used in an LSTM - however, more rigorous investigation is required to confirm whether this is the case.
While there has been previous work on Bayesian RNNs (fortunato2017bayesian), our work is the first to demonstrate that PBP can be effectively applied to an RNN architecture to produce a recurrent network with fully-Bayesian model uncertainty estimates. The resulting network is less demanding on both computation and memory resources than popular dropout and ensemble-based methods, such as the recently proposed MC Dropout Ensemble (lutjens2018safe). Crucially, our approach produces much higher quality uncertainty estimates, which are necessary for improving RL agent performance in novel scenarios, as demonstrated by the model’s competitive performance in dynamic obstacle avoidance tasks. In the case of Safe RL, this directly translates to safer behaviour on the part of the RL agent - a crucial step towards feasibly incorporating more complex machine learning algorithms in safety-critical tasks.
Thanks to Clyde Fare, Peter Fenner and Mohab Elkaref for useful discussion. This work was supported by the Science and Technology Facilities Council Hartree Centre’s Innovation Return on Research program, funded by the U.K. Department for Business, Energy, and Industrial Strategy.