Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning

  • 2019-11-08 15:08:51
  • Matt Benatan, Edward O. Pyzer-Knapp
  • 4


Reinforcement Learning (RL) has demonstrated state-of-the-art results in anumber of autonomous system applications, however many of the underlyingalgorithms rely on black-box predictions. This results in poor explainabilityof the behaviour of these systems, raising concerns as to their use insafety-critical applications. Recent work has demonstrated thatuncertainty-aware models exhibit more cautious behaviours through theincorporation of model uncertainty estimates. In this work, we build onProbabilistic Backpropagation to introduce a fully Bayesian Recurrent NeuralNetwork architecture. We apply this within a Safe RL scenario, and demonstratethat the proposed method significantly outperforms a popular approach forobtaining model uncertainties in collision avoidance tasks. Furthermore, wedemonstrate that the proposed approach requires less training and is far moreefficient than the current leading method, both in terms of compute resourceand memory footprint.


Quick Read (beta)

Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning

Matt  Benatan
IBM Research UK
Sci-Tech Daresbury
Warrington, UK.
[email protected]
&Edward O. Pyzer-Knapp
IBM Research UK
Sci-Tech Daresbury
Warrington, UK.
[email protected]

Reinforcement Learning (RL) has demonstrated state-of-the-art results in a number of autonomous system applications, however many of the underlying algorithms rely on black-box predictions. This results in poor explainability of the behaviour of these systems, raising concerns as to their use in safety-critical applications. Recent work has demonstrated that uncertainty-aware models exhibit more cautious behaviours through the incorporation of model uncertainty estimates. In this work, we build on Probabilistic Backpropagation to introduce a fully Bayesian Recurrent Neural Network architecture. We apply this within a Safe RL scenario, and demonstrate that the proposed method significantly outperforms a popular approach for obtaining model uncertainties in collision avoidance tasks. Furthermore, we demonstrate that the proposed approach requires less training and is far more efficient than the current leading method, both in terms of compute resource and memory footprint.


Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning

  Matt  Benatan IBM Research UK Sci-Tech Daresbury Warrington, UK. [email protected] Edward O. Pyzer-Knapp IBM Research UK Sci-Tech Daresbury Warrington, UK. [email protected]


noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\end[email protected]

1 Introduction

Reinforcement Learning (RL) has achieved state-of-the-art results in a variety of applications, from Atari games (mnih2013playing) to autonomous vehicles (zhang2016learning; shalev2016safe). The models underlying these systems are often black-box in nature, and do not provide estimates of model uncertainty. This results in over-confident predictions on out-of-distribution data, resulting in poor performance in novel data scenarios (amodei2016concrete; lakshminarayanan2017simple; hendrycks2016baseline). Recent work (kahn2017uncertainty; lutjens2018safe) demonstrates that more cautious behaviours can be attained by incorporating model uncertainty estimates into the decision making processes of RL agents. In their work, Lütjens et al. demonstrate that a Long-Short-Term-Memory (LSTM) ensemble using MC Dropout (gal_dropout_2015) is able to approximate model uncertainties. While this demonstrates a clear advantage over a standard LSTM with no uncertainty estimates, the accuracy of their approach’s uncertainty estimates is tied to the size of the ensemble and the number of dropout forward passes executed. Thus, obtaining high quality model uncertainty estimates can quickly become demanding for both compute and memory resources.

An alternative to the approximate Bayesian inference provided by MC Dropout and ensemble methods is Probabilistic Backpropagation (PBP) (hernandez-lobato_probabilistic_2015) - a fully Bayesian method for training Bayesian Neural Networks (BNNs), which thus produces fully-Bayesian model uncertainty estimates. Unlike ensemble-based methods (lakshminarayanan2017simple), PBP provides a probabilistic neural network architecture at only twice the memory footprint of its non-probabilistic equivalent. As PBP is fully Bayesian, it only requires training a single network, and inference only requires one forward pass - making it far less computationally intensive than ensemble (lakshminarayanan2017simple), MC Dropout (gal_dropout_2015), or combined (lutjens2018safe) methods.

In this paper we leverage PBP to produce a probabilistic variant of a standard Recurrent Neural Network (RNN). We apply this to a collision avoidance task, and demonstrate that the PBP-RNN exhibits competitive performance when compared with the MC Dropout Ensemble (MDE) described in (lutjens2018safe), and achieves higher-quality model uncertainty estimates.

2 Related Work

2.1 Safe Reinforcement Learning

Safe RL involves learning policies which maximize performance criteria, e.g. reward, while accounting for safety constraints (garcia2015comprehensive; berkenkamp2017safe), and is a field of study that is becoming increasingly important as more and more automated systems are being designed to operate in safety-critical situations (zhang2016learning; chen2019model; shalev2016safe). A key issue with many existing approaches is their lack of uncertainty quantification - agents’ inability to quantify what they ‘don’t know’. By incorporating uncertainty, RL agents can make better decisions by opting for actions for which they are more confident of a positive outcome. Incorporating this principal in safety-constrained tasks results in agents choosing safer actions (kahn2017uncertainty; lutjens2018safe), and is crucial for making RL feasible for safety-critical scenarios.

Existing work on Safe RL has typically aimed to discover uncertainty in the environment or model (garcia2015comprehensive) - with the former commonly being the goal in the case of risk-sensitive RL (RSRL) (mihatsch2002risk; shen2014risk). In this work we focus on model uncertainty, with the aim of quantifying the model’s confidence on out-of-distribution data and using this uncertainty quantification to produce safer decisions.

2.2 Probabilistic Neural Networks

There are a variety of approaches for modeling distributions and producing accurate uncertainty estimates (williams2006gaussian; ghahramani2015probabilistic), however many of these methods do not scale well to large datasets. Over recent years neural network-based methods have demonstrated state-of-the-art performance on a wide range of tasks (cho2014learning; he2016deep; ledig2017photo), partly due to their ability to leverage large amounts of data. This has helped to drive interest in the development of a variety of probabilistic neural network approaches, which are capable of modeling uncertainty while also scaling to large datasets (gal_dropout_2015; hernandez-lobato_probabilistic_2015; snoek_scalable_2015).

In probabilistic neural networks, network weights are modeled as distributions, rather than as point estimates, allowing the network to encode model uncertainty. If we consider y to be an N-dimensional vector of targets yn, and X to be an N×D matrix of features 𝐱n, then we can define the likelihood for y given the network weights W, X, and noise precision γ as:

p(𝐲|W,𝐗,γ)=n=1N𝒩(yn|f(𝐱n;W),γ-1) (1)

We specify a Gaussian prior for each weight in our network, giving us:

p(W|λp)=l=1Li=1Vlj=1Vl-1+1𝒩(wij,l|0,λp-1) (2)

where wij,l denotes a weight in weight matrix 𝐖l at layer l, Vl is the number of neurons in layer l, and λp is a precision parameter. For additional details pertaining to λp and its hyper-prior, please see (hernandez-lobato_probabilistic_2015). Using the above definition, we can obtain the posterior distribution for parameters W, γ and λp by applying Bayes’ rule:

p(W,γ,λp|𝒟)=p(𝐲|W,𝐗,λp)p(W|λp)p(λp)p(γ)p(𝐲|𝐗) (3)

where 𝒟=(𝐗,𝐲). Predictions for some output y can then be obtained through:

p(y|𝐱,𝒟)=p(y|𝐱,W,γ)p(W,γ,λp|𝒟)𝑑γ𝑑λp𝑑W (4)

As this integral is computationally intractable, several approximations have been proposed (hernandez-lobato_probabilistic_2015; gal_dropout_2015; ghosh_assumed_2016). In this work, we use PBP to train a fully Bayesian neural network, as opposed to the approximation provided by MC Dropout or ensemble-based methods.

2.3 Probabilistic Backpropagation

The PBP network can be viewed as a modification of a standard Multilayer Perceptron (MLP) wherein each weight wij,l𝐖l is defined by a one dimensional Gaussian and correspondingly is represented by two weights - a mean mij,l and a variance vij,l.

Similarly to standard backpropagation, PBP training consists of two phases. The first phase comprises forward propagation of the input features through the network to obtain the marginal log-likelihood, logZ (instead of loss, which is used in typical backpropagation). The gradients of logZ with respect to the mean and variance weights are then backpropagated using reverse-mode differentiation, and the resulting derivatives are used to update the mean and variance weights of the network.

The update rule for PBP obtains the parameters of the new Gaussian beliefs qnew(w)=𝒩(w|mnew,vnew) that minimize the Kullback-Leibler (KL) divergence between our beliefs s and qnew as a function of m, v and the gradient of logZ:

mnew=m+vlogZm (5)
vnew=v-v2[logZm2-2logZv] (6)

These rules ensure moment matching between qnew and s, guaranteeing that the distributions have the same mean and variance. These are the update equations used in the PBP-RNN described in this paper. For further details on PBP, please see (hernandez-lobato_probabilistic_2015).

3 Probabilistic Backpropagation for Recurrent Neural Networks

Recurrent Neural Networks (RNNs), and specifically LSTMs, are the current state-of-the-art for dynamic obstacle avoidance tasks (alahi2016social; lutjens2018safe; vemula2018social). This is due to their ability to encode contextual dependencies, allowing them to accurately model the temporal patterns of dynamic obstacles. As such, we have chosen to use an RNN for our collision prediction network. In order to provide fully-Bayesian model uncertainty estimates, we adapt a standard RNN architecture to use Probabilistic Backpropagation (PBP).

A standard RNN consists of two weight matrices per layer, the weight matrix 𝐖x, which is equivalent to the weight matrices used in a standard MLP, and the weight matrix 𝐖h, which holds the transition weights. For a given input 𝐱t at a time step t, the output 𝐲t and hidden state 𝐡t at step t are computed as:

𝐡t=𝐲t=f(𝐖hht-1+𝐖x𝐱t) (7)

where f() is some activation function and 𝐡t-1 is the previous hidden state.

For the PBP-RNN, we apply the PBP treatment of a standard MLP to an RNN: each of the weights wh𝐖h and wx𝐖x are represented by mean and variance weights, resulting in four weight matrices in place of the two standard RNN matrices: 𝐖m, 𝐖v, 𝐖hm and 𝐖hv. This produces two outputs from the network output layer L at each time step t - a mean mL,t and a variance vL,t:

mL,t=𝐖hm𝐡mt-1+𝐖m𝐱t (8)
vL,t=𝐖hv𝐡vt-1+𝐖v𝐯t (9)

Updates to the network at step t are computed using Truncated Backpropagation Through Time (TBTT) (boden2002guide) in combination with the standard PBP update procedure, whereby the network is ‘unrolled’ by some T time steps, and the weights are updated in reverse order - from time step T back to t=1. First, the marginal log likelihoods for each time step are computed:

logZt=-0.5logvt+(yt-ml,t)2vt (10)

These are then used to update the weights in each of the four weight matrices per layer, as per the typical PBP update rule:

mtnew =mt+vtlogZtmtmt𝐖m (11)
mhtnew =mht+vhtlogZtmhtmht𝐖hm (12)
vtnew =vt-vt2[logZtmt2-2logZtvt]vt𝐖v (13)
vhtnew =vht-vht2[logZtmht2-2logZtvht]vht𝐖hv (14)

For incorporating prior factors into q, we follow the same procedure for updating the α and β parameters described in (hernandez-lobato_probabilistic_2015), but substitute the likelihood Z for the mean of Z over all time steps:

Z=t=1TZtT (15)

The same procedure is followed for parameters Z1 and Z2 used in the α and β updates for standard PBP, as described in (hernandez-lobato_probabilistic_2015). As with standard BPTT, this process is repeated for each time step tT for each mini-batch within each epoch.

4 Methods

Many tasks used for evaluating the performance of RL methods, such as arcade learning environments, would not obviously benefit from a safe RL approach. In order to transparently and reproducibly assess the effects of our approach to uncertainty quantification, we decided that the tasks which would be used for this analysis would include the following qualities:

  • The task should reflect a real-world application of RL, rather than a trivial toy problem.

  • The task should be simple enough that the results can be clearly interpreted.

  • The task should facilitate a comprehensive combination of test cases for evaluating the impact of using uncertainty quantification, including exposing the agent to novel behaviours, noise, and missing data.

We felt that the task of collision avoidance, as in (lutjens2018safe), fulfilled this criteria, and was a sensible choice to comprehensively evaluate our methods. As such, in this work we use the PBP-RNN architecture described above for a collision prediction network applied to a collision avoidance task. This network is used within a Model Predictive Controller (MPC) which selects a motion primitive u from a set of motion primitives U as in (lutjens2018safe). Similarly to (lutjens2018safe), U contains 11 discrete motion primitives spanning heading angles αh[-π5,π5] of length h=0.05, and the MPC chooses the lowest-cost motion primitive according to the following criteria:

ut:t+h*=argmin(λvVcolli+λcPcolli+λddgoal) (16)

where Pcolli is the collision prediction for the motion primitive at i, Vcolli denotes the variance associated with the collision prediction, dgoal is the distance to the goal, and λv, λc, λd are their respective coefficients. In this work, we use an epsilon greedy policy (mnih2015human), decreasing ϵ by ϵ50 after each episode, to continue adding new experience while monotonically decreasing the probability of random actions. Previous work demonstrates that starting with low λv and increasing λv over time is helpful for escaping local minima (lutjens2018safe). As such, we multiply λv by 1-ϵ, so that λv increases as ϵ decreases.

Figure 1: Diagram of initial environment settings for training distribution (left) and test distribution (right). Top (orange) circle: dynamic obstacle. Bottom (blue) circle: agent. Bottom star: goal for dynamic obstacle. Top star: goal for agent.

4.1 Environment

For the collision avoidance task, we use a multi-agent OpenAI Gym (brockman2016openai) environment based on (lowe2017multi) with two agents. The agents start each episode facing eachother, and each agent is assigned a goal, as illustrated in Figure 1. One agent acts as a dynamic obstacle, following a collaborative Reciprocal Velocity Object (RVO) policy (van2011reciprocal), while the other uses the MPC and collision prediction network. The observation 𝐨t at time t is defined as:

𝐨t=[a1p,t,a1v,t;a2p,t,a2v,t,ut] (17)

where [a1p,a1v] and [a2p,a2v] are the positions and velocities for agents 1 and 2 respectively and ut is the evaluated motion primitive. For the input to the RNN (and the LSTM, which we compare against), we feed the input o, which comprises l=8 observations, from t-l to t. After each episode, all inputs are collated into a set of input features X, and a set of labels y is populated with the collision label, y=[0,1], for the episode. These are collated into an experience pool of examples, from which samples are randomly drawn to train the collision prediction network.

The results reported here were obtained after ϵ reached 0.1, at which point it was set to 0, and the test cases were run with no further training cycles. While ϵ>0, the dynamic obstacle starts in the same location each episode, posxy,o=[0,0.25], and the agent starts at posxy,a=[0,-0.25]. Once ϵ is set to 0, we obtain results for 20 episodes using this initialization, after which the starting position of the dynamic obstacle is randomly generated for posy,o from the range -0.25 to 0.25, as illustrated in Figure 1. At this point we switch to a non-collaborative policy, for which the obstacle no longer tries to avoid the agent, and simply moves in a straight-line towards its goal. This combination of random initialization and policy switching for the dynamic obstacle produces novel dynamic object behaviour which is used for testing.

4.2 Network Parameters

Using previous work as a guide (lutjens2018safe), the LSTM Ensemble consists of 5 single-layer LSTM networks each with 16 units with linear activations and a Rectified Linear Unit (ReLU) activation on the output. This is trained using MSE loss and Adam optimization with an initial learning rate of 0.001. For inference (as in (lutjens2018safe)), we execute 20 forward passes per network with different dropout masks (setting dropout p=0.7) from which the sample mean and variance are drawn from the resulting distribution of 100 predictions. We use an equivalent architecture for the PBP-RNN - a single layer network consisting of 16 units with linear activations. For both networks we set the number of time steps T equal to the sequence length ls. For shorter sequences (for the first 7 time steps in each episode) we zero pad the input up to T.

4.3 Training Procedure

For both networks evaluated here we used TensorFlow on a single Power 8+ compute node. During the training cycles, the dynamic obstacle follows a collaborative RVO policy. We first run 100 episodes to seed the experience pool, for which the agent selects random actions. Following this we draw 500 random samples from the pool to train the collision prediction network. We then follow a standard observe-act-train procedure, repeating training after every 10 episodes. To balance the training data, we draw half of the samples at random from examples where collision labels are y=0 and half from a pool where collision labels are y=1.

For PBP, we run an initial phase of 5 epochs of training after the first 100 episodes, and 2 epochs of training after each subsequent set of 10 episodes. This relatively small number of training epochs is guided by empirical evidence and prior work demonstrating that PBP requires relatively few epochs for training (benatan2018practical; hernandez-lobato_probabilistic_2015). We found that the MDE performed better with comparatively more epochs of training, and so trained this for 100 epochs after the first 100 episodes, followed by 10 epochs of training after each subsequent set of 10 episodes.

For the MPC, we use the same values for the λ parameters as recommended in (lutjens2018safe) for both the MDE and PBP-RNN, setting λc=25, λv=200 and λd=3.

5 Results

Table 1: Means (μ) and variances (σ2) of recorded variances for PBP-RNN and MDE for the training distribution and three out-of-distribution test cases
Train 0.002 0.001 0.002 0.001
Novel 0.003 0.001 0.002 0.0002
Novel + noise (λξ=0.005) 0.006 0.001 0.005 0.0003
Novel + dropped obs. (ndropped=5) 0.039 0.007 0.044 0.003
Table 2: Means (μ) and variances (σ2) of recorded log-likelihoods for PBP-RNN and MDE for the training distribution and three out-of-distribution test cases
Train 0.685 -2.1×105 0.108 3.8×105
Novel -0.555 -6.2×104 1.048 1.7×105
Novel + noise (λξ=0.005) -2.479 -8.6×104 1.440 2.5×105
Novel + dropped obs. (ndropped=5) -8.760 -66.871 4.707 27.120
Figure 2: Box-and-whisker plots of results for collision detection network performance. Top: results from training distribution. Bottom: results from test (novel) distribution.

5.1 Collision Detection Network Performance

Following the completion of all training cycles, we collect a set of in-distribution (training) and a set of out-of-distribution (novel) test data, each for 20 episodes - the entire process is repeated 10 times. For the out-of-distribution data, we randomly initialize the dynamic obstacle as described in Section 4.1. We use this data to evaluate the performance of the collision prediction network through obtaining the log-likelihood, false positive rate (FPR) (the number of times a collision is predicted, but not encountered) and false negative rate (FNR) (the number of times a collision occurred when one was not predicted). Additionally, for all non-collision episodes, we record the minimum distances between the agent and obstacle in order to build an impression of the caution exhibited by the agent.

As demonstrated in Figure 2, the PBP-RNN achieves substantially better results on the training distribution, with 0 false-positives, false-negatives, and collisions. As would be expected, we see these figures degrade somewhat in the case of novel data for both the PBP-RNN and the MDE; however the PBP-RNN continues to demonstrate better overall performance, with a significantly lower FPR and fewer collisions.

The minimum distance plots in Figure 2 (far right top and bottom) demonstrate that the agent using the PBP-RNN leaves a greater margin when passing the obstacle than the agent using the MDE; indicating more cautious behavior. This is supported by the variance results in Table 1, which show that the PBP-RNN’s variance values increase between the training and novel scenarios, whereas this isn’t the case for MDE. This indicates that the PBP-RNN has a comparatively higher quality model of uncertainty, as its variances increase with the degree of novelty in the observations - thus resulting in more cautious behaviour.

5.2 Collision Avoidance with Noise and Dropped Observations

Uncertainty-based methods have demonstrated advantageous performance in the presence of noise or dropped observations due to their ability to leverage model uncertainty when selecting actions (lutjens2018safe). Here, we investigate the performance of the MDE and PBP-RNN approaches with varying levels of additive noise by adding a matrix of noise terms 𝝃 multiplied by an incrementally increasing noise coefficient λξ to our matrix of observations:

𝐨tξ=𝐨t+𝝃λξ (18)

where 𝝃 is a matrix of values each generated from the distribution 𝒩(0,1). For each method, we run 10 episodes for each value of λξ after the initial training phase, and repeat this process 10 times. We again use the non-collaborative policy for the dynamic obstacle. As Figure 3 demonstrates, the PBP-RNN exhibits better robustness to noise, with fewer collisions on average. This performance can again be explained by the change in the PBP-RNN’s variance output as the level of noise increases. This is demonstrated in the bottom plot of Figure 3, which shows σ2 steadily increasing for the PBP-RNN, whereas this is not the case for MDE - again indicating that the PBP-RNN has a better model of uncertainty.

In the final test case, we combine the non-collaborative policy with dropped observations - whereby, for increasing values of ndropped, between 1 and 8 observations in the sequence are randomly selected and set to zero, simulating the kind of behaviour that may occur in electronic sensor systems. The PBP-RNN again exhibits better performance when compared with the MDE, as shown in Figure 4, and the same underlying theme is evident: the σ2 values for the PBP-RNN increase with the number of dropped observations, whereas these remain very small (although do increase marginally) for the MDE.

Tables 1 and 2 further validate the model quality of the PBP-RNN with respect to the MDE. In Table 1, we see that the PBP-RNN’s mean uncertainty increases as the test scenarios deviate from the training distribution. While the MDE’s variance does eventually increase, it requires the combination of the non-collaborative policy and a significant number of dropped observations for this to occur. Crucially, Table 2 illustrates that the log-likelihood values for the PBP-RNN and MDE differ significantly, with the PBP-RNN obtaining high and consistent log-likelihood scores, while MDE achieves low and inconsistent log-likelihoods. These poor log-likelihoods are a product of MDE’s over-confident predictions due to its lack of a descriptive posterior - a known drawback of MC Dropout (lutjens2018safe; pearce2018bayesian), which is clearly documented here in the variance and log-likelihood values of our results.

Figure 3: Plots depicting results for collision avoidance in increasingly noisy conditions. Top: proportion of collisions as λξ is increased. Bottom: σ2 (variance) values as λξ is increased. Shaded area denotes 95% confidence interval.
Figure 4: Plots depicting results for collision avoidance with increasing numbers of dropped observations. Top: proportion of collisions as the number of dropped observations is increased. Bottom: σ2 (variance) values as the number of dropped observations is increased. Shaded area denotes 95% confidence interval.

5.3 Computational Considerations

We analyzed compute time for inference with both the PBP-RNN and the MC Dropout Ensemble. For MDE, we ran the forward passes serially and recorded a mean inference time of 119.8±3.2ms. In the case of the PBP-RNN, we recorded a mean inference time of 29.7±0.7ms. While the inference time for MDE can be greatly reduced by running forward passes in parallel (lutjens2018safe), it will still require significantly more compute resource when compared with PBP, due to the requirement of multiple networks and multiple forward passes. Furthermore, in the case of MDE, the compute time required will increase with the quality requirements of the model uncertainty estimates.

Additionally, the PBP-RNN required far fewer epochs of training to achieve better performance than the MDE, with the PBP-RNN executing a total of 27 epochs of training, while the MDE executed 210 epochs of training.

The PBP-RNN is also advantageous in terms of memory footprint, as it only requires twice the memory of its non-probabilistic alternative, yet provides fully Bayesian uncertainty estimates. In contrast, the quality of the MDE’s uncertainty estimates is proportional to the number of networks in the ensemble - requiring larger networks and larger memory footprint to produce uncertainty estimates of reasonable quality (lakshminarayanan2017simple).

5.4 Variance Weights for Long Term Memory

This work demonstrates that a PBP-RNN is capable of achieving superior performance when compared with an LSTM-based approach. This is particularly interesting when considering that traditional RNNs are typically only performant on very short sequences (graves_supervised_2012), and LSTM’s are the state-of-the-art for tasks such as the collision avoidance task used here (lutjens2018safe). We hypothesize that the performance obtained here is possible due to the variance weights which, through learning variances associated with different features at different time steps, may function similarly to the gates used in an LSTM - however, more rigorous investigation is required to confirm whether this is the case.

6 Conclusion

While there has been previous work on Bayesian RNNs (fortunato2017bayesian), our work is the first to demonstrate that PBP can be effectively applied to an RNN architecture to produce a recurrent network with fully-Bayesian model uncertainty estimates. The resulting network is less demanding on both computation and memory resources than popular dropout and ensemble-based methods, such as the recently proposed MC Dropout Ensemble (lutjens2018safe). Crucially, our approach produces much higher quality uncertainty estimates, which are necessary for improving RL agent performance in novel scenarios, as demonstrated by the model’s competitive performance in dynamic obstacle avoidance tasks. In the case of Safe RL, this directly translates to safer behaviour on the part of the RL agent - a crucial step towards feasibly incorporating more complex machine learning algorithms in safety-critical tasks.


Thanks to Clyde Fare, Peter Fenner and Mohab Elkaref for useful discussion. This work was supported by the Science and Technology Facilities Council Hartree Centre’s Innovation Return on Research program, funded by the U.K. Department for Business, Energy, and Industrial Strategy.