Wasserstein Robust Reinforcement Learning

  • 2019-08-10 10:05:35
  • Mohammed Amin Abdullah, Hang Ren, Haitham Bou Ammar, Vladimir Milenkovic, Rui Luo, Mingtian Zhang, Jun Wang
  • 0


Reinforcement learning algorithms, though successful, tend to over-fit totraining environments hampering their application to the real-world. This paperproposes WR$^{2}$L; a robust reinforcement learning algorithm with significantrobust performance on low and high-dimensional control tasks. Our methodformalises robust reinforcement learning as a novel min-max game with aWasserstein constraint for a correct and convergent solver. Apart from theformulation, we also propose an efficient and scalable solver following a novelzero-order optimisation method that we believe can be useful to numericaloptimisation in general. We contribute both theoretically and empirically. Onthe theory side, we prove that WR$^{2}$L converges to a stationary point in thegeneral setting of continuous state and action spaces. Empirically, wedemonstrate significant gains compared to standard and robust state-of-the-artalgorithms on high-dimensional MuJuCo environments.


Quick Read (beta)

Wasserstein Robust Reinforcement Learning

Mohammed Amin Abdullah11footnotemark: 1
Huawei R&D UK
[email protected]
&Hang Ren 11footnotemark: 1 22footnotemark: 2
Huawei R&D UK
Imperial College London
[email protected]
&Haitham Bou-Ammar 11footnotemark: 1
Huawei R&D UK
University College London
[email protected]
\ANDVladimir Milenković22footnotemark: 2
Huawei R&D UK
University of Cambridge
&Rui Luo 22footnotemark: 2
Huawei R&D UK
University College London
&Mingtian Zhang22footnotemark: 2
Huawei R&D UK
University College London
&Jun Wang
Huawei R&D UK
University College London
The first three authors are to be considered as joint first co-authorsHonorary Lecturer Position at University College London.Work done while interning at the Reinforcement Learning Team in Huawei Technologies Research and Development in London.

Reinforcement learning algorithms, though successful, tend to over-fit to training environments hampering their application to the real-world. This paper proposes WR2L – a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. We contribute both theoretically and empirically. On the theory side, we prove that WR2L converges to a stationary point in the model parameters and that our update is a valid gradient step when it comes to policies. Empirically, we demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments.


Wasserstein Robust Reinforcement Learning

  Mohammed Amin Abdullah11footnotemark: 1 thanks: The first three authors are to be considered as joint first co-authors Huawei R&D UK [email protected] Hang Ren 11footnotemark: 1 22footnotemark: 2 Huawei R&D UK Imperial College London [email protected] Haitham Bou-Ammar 11footnotemark: 1 thanks: Honorary Lecturer Position at University College London. Huawei R&D UK University College London [email protected] Vladimir Milenković22footnotemark: 2 thanks: Work done while interning at the Reinforcement Learning Team in Huawei Technologies Research and Development in London. Huawei R&D UK University of Cambridge Rui Luo 22footnotemark: 2 Huawei R&D UK University College London Mingtian Zhang22footnotemark: 2 Huawei R&D UK University College London Jun Wang Huawei R&D UK University College London


noticebox[b]Based on NeurIPS 2019 Template.\[email protected]

1 Introduction

Reinforcement learning (RL) has become a standard tool for solving decision-making problems with minimal feedback. Applications with these characteristics are ubiquitous, including, but not limited to, computer games (Mnih et al., 2013), robotics (Kober and Peters, 2012; Deisenroth et al., 2013; Bou-Ammar et al., 2014), finance (Fischer, 2018), and personalised medicine (Emmert-Streib and Dehmer, 2018). Although significant progress has been made on developing algorithms for learning large-scale and high-dimensional reinforcement learning tasks, these algorithms often over-fit to training environments and fail to generalise across even slight variations of transition dynamics (Packer et al., 2018; Zhao et al., 2019).

Robustness to changes in transition dynamics, however, is a crucial component for adaptive and safe RL in real-world environments. To illustrate, consider a self-driving car scenario in which we attempt to design an agent capable of driving a vehicle smoothly, safely, and autonomously. A typical reinforcement learning work-flow to solving such a problem consists of building a simulator to emulate real-world scenarios, training in simulation, and then transferring resultant policies to physical systems for control. Unfortunately, such a strategy is easily prone to failure as designing accurate simulators that capture intricate complexities of large cities is extremely challenging. Rather than learning in simulation, another work-flow might consist of constructing a pipeline to directly learn on the hardware system itself. Apart from memory constraints, state-of-the-art reinforcement learning algorithms exhaust hundreds to millions of agent-environment interactions before acquiring successful behaviour. Of course, such high demands on sample complexities prohibit the direct application of learning algorithms on real-systems, leaving robustness to misspecified simulators a largely unresolved problem.

Motivated by real-world applications, recent literature in reinforcement learning has focused on the above problems, proposing a plethora of algorithms for robust decision-making (Morimoto and Doya, 2005; Pinto et al., 2017; Tessler et al., 2019). Most of these techniques borrow from game theory to analyse, typically in a discrete state and actions spaces, worst-case deviations of agents’ policies and/or environments, see Sargent and Hansen (2001); Nilim and El Ghaoui (2005); Iyengar (2005); Namkoong and Duchi (2016) and references therein. These methods have also been extended to linear function approximators Chow et al. (2015), and deep neural networks Peng et al. (2017) showing (modest) improvements in performance gain across a variety of disturbances, e.g., action uncertainties, or dynamical model variations.

Though successful in practice, current techniques to robust decision making remain task-specific, and there are two major lingering drawbacks. First, these algorithms fail to provide a general robustness framework due to their specialised heuristics. In particular, algorithms designed for discrete state-action spaces fail to generalise to continuous domains and vice versa. This is due to their exploiting mathematical properties valid only for discrete state and action spaces, or for convex-concave functions, e.g., minmaxf()=maxminf(). Clearly, such mathematical derivations are only loosely related to implementation, and further insights can be gathered with more rigorous attempts tackling the continuous problem itself. Apart from theoretical limitations, the second drawback of current approaches can be traced back to the process by which empirical validation is conducted. Rarely do the papers presenting these techniques compare gains against other robust algorithms in literature. In fact, focus is mainly diverted to outperforming state-of-the-art reinforcement learning – a criterion easy to satisfy as algorithms from RL were never trained for robustness. Interestingly, upon careful evaluation, we came to realise that robust adversarial reinforcement learning of Pinto et al. (2017), for example, is superior to some of the more recently published works attempting to solve the same problem, see Section 7.

In this paper, we contribute to the above endeavour to solve the robustness problem in RL by proposing a generic framework for robust reinforcement learning that can cope with both discrete and continuous state and actions spaces. The algorithm we introduce, which we call Wasserstein Robust Reinforcement Learning (WR2L), Algorithm 1, aims to find the best policy, where any given policy is judged by the worst-case dynamics amongst all candidate dynamics in a certain set. This set is essentially the average Wasserstein ball around a reference dynamics 𝒫0. The constraints makes the problem well-defined, as searching over arbitrary dynamics can only result in non-performing system. The measure of performance is the standard RL objective form, the expected return. Both the policy and the dynamics are parameterised; the policy parameters 𝜽k may be the weights of a deep neural network, and the dynamics parameters ϕj the parameters of a simulator or differential equation solver. The algorithm performs estimated descent steps in ϕ space and - after (almost) convergence - performs policy gradient steps in 𝜽 space. Since ϕj may be high-dimensional, we adapt a zero’th order sampling method based extending Salimans et al. (2017) to make estimations of gradients, and in order to define the constraint set which ϕj is bounded by, we generalise the technique to estimate Hessians (Proposition 6).

We emphasise that although access to a simulator with parameterisable dynamics are required, the actual reference dynamics 𝒫0 need not be known explicitly nor learnt by our algorithm. Put another way, we are in the “RL setting”, not the “MDP setting” where the transition probability matrix is known a priori. The difference is made obvious, for example, in the fact that we cannot perform dynamic programming, and the determination of a particular probability transition can only be estimated from sampling, not retrieved explicitly. Hence, our algorithm is not model-based in the traditional sense of learning a model to perform planning.

We believe our contribution is useful and novel for two main reasons. Firstly, our framing of the robust learning problem is in terms of dynamics uncertainty sets defined by Wasserstein distance. Whilst we are not the first to introduce the Wasserstein distance into the context of MDPs (see, e.g., Yang (2017) or Lecarpentier and Rachelson (2019)), we believe our formulation is amongst the first suitable for application to the demanding application-space we desire, that being, high-dimensional, continuous state and action spaces. Secondly, we believe our solution approach is both novel and effective (as evidenced by experiments below, see Section 7), and does not place a great demand on model or domain knowledge, merely access to a simulator or differentiable equation solver that allows for the parameterisation of dynamics. Furthermore, it is not computationally demanding, in particular, because it does not attempt to build a model of the dynamics, and operations involving matrices are efficiently executable using the Jacobian-vector product facility of automatic differentiation engines.

The rest of the paper is organised as follows. In the next section, we provide a background on notation and an overview of Wasserstein distance in its various forms. The problem formulation and main algorithm is then presented in Section 3. In Section 4, we demonstrate convergence to a stationary point of the robust objective. In Section 5, we describe the zero’th order method we use to estimate the Hessian matrix used by the algorithm, and the proof of its correctness is given. In Section 6 we survey related literature. Experiments and results are given in Section 7, and finally, conclusions and future work are discussed in Section 8.

2 Background

This section provides background material needed for the remainder of the paper. We first describe the reinforcement learning framework adopted in this paper, and then proceed to detail notions from Wasserstein distances needed to constrain our learning objective.

2.1 Reinforcement Learning

In reinforcement learning, an agent interacts with an unknown and stochastic environment with the goal of maximising a notion of return (Sutton and Barto, 1998; Peters and Schaal, 2008b; Busoniu et al., 2010). These problems are typically formalised as Markov decision processes (MDPs)11 1 Please note that we present reinforcement learning with continuous states and actions. This allows us to easily draw similarities to optimal control as detailed later. Extending these notions to discrete settings is relatively straight-forward. =𝒮,𝒜,𝒫,,γ, where 𝒮d denotes the state space, 𝒜n the action space, 𝒫:𝒮×𝒜×𝒮[0,1] is a state transition probability describing the system’s dynamics, :𝒮×𝒜 is the reward function measuring the agent’s performance, and γ[0,1) specifies the degree to which rewards are discounted over time.

At each time step t, the agent is in state 𝒔t𝒮 and must choose an action 𝒂t𝒜, transitioning it to a new state 𝒔t+1𝒫(𝒔t+1|𝒔t,𝒂t), and yielding a reward (𝒔t,𝒂t). A policy π:𝒮×A[0,1] is defined as a probability distribution over state-action pairs, where π(𝒂t|𝒔t) represents the density of selecting action 𝒂t in state 𝒔t. Upon consequent interactions with the environment, the agent collects a trajectory 𝝉 of state-action pairs. The goal is to determine an optimal policy π by solving:

π=argmaxπ𝔼𝝉pπ(𝝉)[Total(𝝉)], (1)

where pπ(𝝉) denotes the trajectory density function, and Total(𝝉) the return, that is, the total accumulated reward:

pπ(𝝉)=μ0(𝒔0)π(𝒂0|𝒔0)t=1T-1𝒫(𝒔t+1|𝒔t,𝒂t)π(𝒂t|𝒔t)and   Total(𝝉)=t=0T-1γt(𝒔t,𝒂t), (2)

with μ0() denoting the initial state distribution.

2.2 Wasserstein Distance

In our problem definition, we make use of a distance measure to bound allowed variations from a reference transition density 𝒫0(). In general, a number of common metrics for measuring closeness between two probability distributions exist. Examples of which are total variation distance and Kullback-Leibler divergence. In this paper, however, we measure distance between two different dynamics by the Wasserstein distance. This has a number of desirable properties; firstly, it is a genuine distance, exhibiting symmetry, which is a property that K-L divergence lacks. Secondly, it is very flexible in the forms of the distributions that can be compared; it can measure the distance between two discrete distributions, two continuous distributions, and a discrete and continuous distribution (this latter case implying another valuable advantage - that the supports of the distributions can be different). In all cases, the Wasserstein distance is well-defined. Finally, and perhaps most importantly, the Wasserstein distance takes into account the underlying geometry of the space the distributions are defined on, which could be information that is fruitful to exploit in learning optimal control. Indeed this last point is the core motivator for our choosing Wasserstein distance for our algorithm, as shall be explained later.


Given a measurable space (𝒳,) with 𝒳 being a metric space, a pair of discrete measures μ,ν defined on this measurable space can be written as μ=i=1nμiδxi and ν=j=1mνjδyj where all xi,yj𝒳. A coupling κ(,) of μ and ν is a measure over {x1,,xn}×{y1,ym} that preserves marginals, i.e, μi=jκ(μi,νj) i and νj=iκ(μi,νj) j. This then induces a cost of “moving” the mass of μ to ν, given as the (Frobenius) inner product κ,C where the matrix Cn×m has [C]ij=cij=d(xi,yj), i.e., the cost of moving a unit of measure from xi to yj. Minimised over the space of all couplings 𝐊(μ,ν), we get the Wasserstein distance, also known as the Earth-Mover Distance (EMD).

More generally, let 𝒳 be a metric space with metric d(,). Let 𝒞(𝒳) be the space of continuous functions on 𝒳 and let (𝒳) be the set of probability measures on 𝒳. Let μ,ν(𝒳). Let 𝐊(μ,ν) be the set of couplings between μ,ν:

𝐊(μ,ν):={κ(𝒳×𝒳);(A,B)𝒳×𝒳,κ(A×𝒳)=μ(A),κ(𝒳×B)=ν(B)} (3)

That is, the set of joint distributions κ(𝒳×𝒳) whose marginals agree with μ and ν respectively. Given a metric (serving as a cost function) d(,) for 𝒳, the p’th Wasserstein distance Wp(μ,ν) for p1 between μ and ν is defined as:

Wp(μ,ν):=(minκ𝐊(μ,ν)𝒳×𝒴d(x,y)p𝑑κ(x,y))1/p (4)

3 Wasserstein Robust Reinforcement Learning

This section formalises robust reinforcement learning by equipping agents with capabilities of determining well-behaved policies under worst-case models which are bounded in ϵ-Wasserstein balls. Our motivation for formalising WR2L is rooted in robust optimal control – a field dedicated to determining optimal action-selection rules under uncertainties induced by modelling assumptions. Here an agent controlling a plant/system faces an adversary that optimises for a disturbance controller while aiming at minimising rewards received by the agent. Interestingly, these problems relate to two-player min-max games and provide a rich literature with efficient solutions in certain specific scenarios, e.g., discrete states and actions, robust linear quadratic regulators, among others. Though generic min-max objectives can lead to robustness (see Pinto et al. (2017)), typical algorithms rooted in game-theory optimise well-behaved objectives by introducing additional structural assumptions to adversaries. For example, it is not uncommon in robust optimal control to assume the process by which disturbances are applied (e.g., additive, multiplicative), or to only consider a subset adhering to maximally bounded norms 22 2 Please note that this is not to say that robust optimal control is restricted to the above settings, see Doyle et al. (2013)..

Starting from robust optimal control, we derive WR2L’s objective by introducing two major refinements to standard reinforcement learning. Similar to robust control, we enable agents to perturb transition models from a reference simulator with the goal of determining worst-case scenarios. We do not, however, pose additional structural assumptions on the process by which adversaries apply these perturbations. Rather, we posit a parameterised class of disturbances and adopt zero’th order optimisation (see Section 5) for flexibility, thus broadening our application spectrum to high-dimensional and stochastic dynamical systems. Optimisation problems of this nature, on the other hand, tend to be ill-specified due to the unconstrained process by which models are fit. To bound allowed variations in transition models, and ensure correctness and tractability, we then introduce an ϵ-ball Wasserstein constraint around a simulator 𝒫0() to guarantee convergence.

Before continuing, however, it is worth revisiting the motivations for choosing such a distance. Per the definition, constraining the possible dynamics to be within an ϵ-Wasserstein ball of a reference dynamics 𝒫0() means constraining it in a certain way. Wasserstein distance has the form mass×distance. If this quantity is constrained to be less than a constant ϵ, then if the mass is large, the distance is small, and if the distance is large, the mass is small. Intuitively, when modelling the dynamics of a system, it may be reasonable to concede that there could be a systemic error - or bias - in the model, but that bias should not be too large. It is also reasonable to suppose that occasionally, the behaviour of the system may be wildly different to the model, but that this should be a low-probability event. If the model is frequently wrong by a large amount, then there is no use in it. In a sense, the Wasserstein ball formalises these assumptions.

In what comes next, we detail the problem definition as a novel min-max game with Wasserstein constrained dynamics. In Section 3.2, we elaborate a generic algorithm capable of robustly updating both model and policy parameters.

3.1 Problem Definition: Robust Objectives and Constraints

As mentioned earlier, the problem definition we introduce in this paper extends reinforcement learning in two directions. In the first, we introduce a min-max objective with parameterised transition models, while in the second, we incorporate Wasserstein constraints to bound allowed perturbations.

Parameterising Policies and Transition Models:

Due to the continuous nature of the state and action spaces considered in this work, we resort to deep neural networks to parameterise policies, which we write as π𝜽(𝒂t|𝒔t), where 𝜽d1 is a set of tunable hyper-parameters to optimise. For instance, these policies can correspond to multi-layer perceptrons for MuJoCo environments, or to convolutional neural networks in case of high-dimensional states depicted as images. Exact policy details are ultimately application dependent and, consequently, provided in the relevant experiment sections.

In principle, one can similarly parameterise transition models using deep networks (e.g., LSTM-type models) to provide one or more action-conditioned future state predictions. Though appealing, going down this path led us to agents that discover worst-case transition models which minimise training data but lack any valid physical meaning. For instance, original experiments we conducted on CartPole ended up involving transitions that alter angles without any change in angular velocities. More importantly, these effects became more apparent in high-dimensional settings where the number of potential minimisers increases significantly. It is worth noting that we are not the first to realise such an artifact when attempting to model physics-based dynamics using deep networks. Authors in (Lutter et al., 2019) remedy these problems by introducing Lagrangian mechanics to deep networks, while others (Koller et al., 2018; Chen et al., 2018) argue the need to model dynamics given by differential equation structures directly.

Though incorporating physics-based priors to deep networks is an important and challenging task that holds the promise of scaling model-based reinforcement learning for efficient solvers, in this paper we rather study an alternative direction focusing on perturbing differential equation solvers and/or simulators with respect to the dynamic specification parameters ϕd2. Not only would such a consideration reduce the dimensionality of parameter spaces representing transition models, but would also guarantee valid dynamics due to the nature of the discrete differential equation solver. Though tackling some of the above problems, such a direction arrives with a new set of challenges related to computing gradients and Hessians of black-box solvers. In Section 5, we develop an efficient and scalable zero-order method for valid and accurate model updates.

Unconstrained Loss Function:

Having equipped agents with the capability of representing policies and perturbing transition models, we are now ready to present an unconstrained version of WR2L’s loss function. Borrowing from robust optimal control, we define robust reinforcement learning as an algorithm that learns best-case policies under worst-case transitions:

max𝜽[minϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]], (5)

where p𝜽ϕ(𝝉) is a trajectory density function parameterised by both policies and transition models, i.e., 𝜽 and ϕ, respectively:

p𝜽ϕ(𝝉)=μ0(𝒔0)π(𝒂0|𝒔0)t=1T-1𝒫ϕ(𝒔t+1|𝒔t,𝒂t)specs vector and diff. solverπ𝜽(𝒂t|𝒔t) deep network.

At this stage, it should be clear that our formulation, though inspired from robust optimal control, is, truthfully, more generic as it allows for parameterised classes of transition models without incorporating additional restrictions on the structure or the scope by which variations are executed33 3 Of course, allowed perturbations are ultimately constrained by the hypothesis space. Even then, our model is more general compared to robust optimal control that assumes additive, multiplicative, or other forms of disturbances..

Constraints & Complete Problem Definition:

Clearly, the problem in Equation 5 is ill-defined due to the arbitrary class of parameterised transitions. To ensure well-behaved optimisation objectives, we next introduce constraints to bound search spaces and ensure convergence to feasible transition models. For a valid constraint set, our method assumes access to samples from a reference dynamics model 𝒫0(|𝒔,𝒂), and bounds learnt transitions in an ϵ-Wasserstein ball around 𝒫0(|𝒔,𝒂), i.e., the set defined as:

𝒲ϵ(𝒫ϕ(),𝒫0())={𝒫ϕ():𝒲22(𝒫ϕ(|𝒔,𝒂),𝒫0(|𝒔,𝒂))ϵ,(𝒔,𝒂)𝒮×𝒜}, (6)

where ϵ+ is a hyperparameter used to specify the “degree of robustness” in a similar spirit to maximum norm bounds in robust optimal control. It is worth noting, that though we have access to samples from a reference simulator, our setting is by no means restricted to model-based reinforcement learning in an MDP setting. That is, our algorithm operates successfully given only traces from 𝒫0 accompanied with its specification parameters, e.g., pole lengths, torso masses, etc. – a more flexible framework that does not require full model learners.

Though defining a better behaved optimisation objective, the set in Equation 6 introduces infinite number of constraints when considering continuous state and/or actions spaces. To remedy this problem, we target a relaxed version that considers average Wasserstein constraints instead:

𝒲^ϵ(average)(𝒫ϕ(),𝒫0()) ={𝒫ϕ():(𝒔,𝒂)𝒫(𝒔,𝒂)𝒲22(𝒫ϕ(|𝒔,𝒂),𝒫0(|𝒔,𝒂))d(𝒔,𝒂)ϵ} (7)

In words, Equation 7 defines a set where expected Wasserstein distance is bounded by ϵ. Expectation in the above equation is evaluated over state-action pairs sampled according to 𝒫(𝒔,𝒂) which is policy and transition-model dependent. Precisely, one can factor 𝒫(𝒔,𝒂) as:


where 𝒔𝒮 and 𝒂𝒜 are to be interpreted as events being elements of the state and action sets – a notation better suites for the continuous nature of the considered random variables. Moreover, ρπϕ0(𝒔𝒮) is a uniform distribution over state-actions pairs sampled from a trajectory. Precisely, the way we compute the expected Wasserstein distance is two steps. In the first, given a batch of trajectories sampled according to any policy π, potentially of varying lengths, we create a bucket of state-action pairs44 4 Since we do not require access to current policies in order to compute the average Wasserstein distance, an argument can be made that WR2L support off-policy constraint evaluation. This is an important link that we plan to further exploit in the future to improve efficiency, scalablity, and enable transfer between various tasks. . Given such data, we then compute expected Wasserstein distance over the pairs using Monte-Carlo estimation. The π we use is one which samples actions uniformly at random.

With this in mind, we arrive at WR2L’s optimisation problem allowing for best policies under worst-case yet bounded transition models:


[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Wasserstein Robust Reinforcement Learning Objective:

max𝜽[minϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]]s.t.𝔼(𝒔,𝒂)π()ρπϕ0()[𝒲22(𝒫ϕ(|𝒔,𝒂),𝒫0(|𝒔,𝒂))]ϵ (8)

3.2 Solution Methodology

Having derived our formal problem definition, this section presents our approach to solving for 𝜽 and ϕ in the objective of Equation 8. On a high level, our solution methodology follows an alternating procedure interchangeably updating one variable given the other fixed.

Updating Policy Parameters:

It is clear from Equation 8 that the average Wasserstein distance constraint is independent from 𝜽 and can, in fact, use any other policy π to estimate the expectation. Hence, given a fixed set of model-parameters, 𝜽 can be updated by solving the relevant sub-problem of Equation 8 written as:


Interestingly, this problem is a standard reinforcement learning one with a crucial difference in that traces are sampled according to the transition model given by fixed model-parameters, ϕ that ultimately differ from these of the original simulator ϕ0. Consequently, one can easily adapt any policy search method for updating policies under fixed dynamical models. As described later in Section 5, we make use of proximal policy optimisation (Schulman et al., 2017), for instance, to update such action-selection-rules.

Updating Model Parameters:

Now, we turn our attention to solving the average constraint optimisation problem needed for updating ϕ given a set of fixed policy parameters 𝜽. Contrary to the previous step, here, the Wasserstein constraints play an important role due to their dependence on ϕ. Unfortunately, even with the simplification introduced in Section 3.1 the resultant constraint is still difficult to computer in general, the difficulty being the evaluation of the Wasserstein term55 5 The situation is easier in case of two Gaussian densities. Here, however, we keep the treatment general by proposing an alternative direction using Taylor expansions..

To alleviate this problem, we propose to approximate the constraint in (8) by its Taylor expansion up to second order. That is, defining


The above can be approximated around ϕ0 by a second-order Taylor as:


Recognising that W(ϕ0)=0 (the distance between the same probability densities), and ϕW(ϕ0)=0 since ϕ0 minimises W(ϕ), we can simplify the Hessian approximation by writing:


Substituting our approximation back in the original problem in Equation 8, we reach the following optimisation problem for determining model parameter given fixed policies:

minϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]s.t.12(ϕ-ϕ0)𝖳𝑯0(ϕ-ϕ0)ϵ, (9)

where 𝑯0=ϕ2𝔼(𝒔,𝒂)π()ρπϕ0()[𝒲22(𝒫ϕ(|𝒔,𝒂),𝒫0(|𝒔,𝒂))]|ϕ=ϕ0 is the Hessian of the expected squared 2-Wasserstein distance evaluated at ϕ0.

Optimisation problems with quadratic constraints can be efficiently solved using interior-point methods. To do so, one typically approximates the loss with a first-order expansion and determines a closed-form solution. Consider a fixed 𝜽[k] and ϕ[k] at some iteration k. To find ϕ[k+1], we solve:


It is easy to show that a minimiser to the above equation can derived in a closed-form as:

ϕ[k+1]=ϕ0-2ϵ𝒈[k],𝖳𝑯0-1𝒈[k]𝑯0-1𝒈[k], (10)

with 𝒈[k] denoting the gradient66 6 Remark: Although this looks superficially similar to an approximation made in TRPO Schulman et al. (2015a), the latter aims to optimise the policy parameter rather than dynamics. Furthermore, the constraint is based on the Kullback-Leibler divergence rather than the Wasserstein distance evaluated at 𝜽[k] and ϕ[k], i.e., 𝒈[k]=ϕ𝔼𝝉p𝜽ϕ(𝝉)𝔼[total(𝝉)]|𝜽[k],ϕ[k].

Generic Algorithm:

Having described the two main steps needed for updating policies and models, we now summarise these findings in the pseudo-code in Algorithm 1.

\[email protected]@algorithmic[1] \STATEInputs: Wasserstein distance Hessian, 𝑯0 evaluated at ϕ0 under any policy π, radius of the Wasserstein ball ϵ, and the reference simulator specification parameters ϕ0 \STATEInitialise ϕ[0] with ϕ0 and policy parameters 𝜽[0] arbitrarily \FORk=0,1, \STATESet 𝒙[0]ϕ[0] \STATEPhase I: Update Model-Parameter while fixing the policy: \FORj=0,1, \STATECompute descent direction for the model parameters as given by Equation 10:
\STATEUpdate candidate solution, while satisfying step size conditions (see discussion of Wolfe conditions below) on the learning rate α:
\ENDFOR\STATEPerform model update setting ϕ[k+1]𝒙[j] \STATEPhase II: Update policy given new model-parameters: \STATEUse any standard reinforcement learning algorithm for ascending in the gradient direction, e.g., 𝜽[k+1]𝜽[k]+β[k]𝜽𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]|𝜽[k],ϕ[k+1], with β[k] is a policy learning rate. \ENDFOR
\algorithmcfname 1 Wasserstein Robust Reinforcement Learning

As the Hessian77 7 Please note our algorithm does not need to store the Hessian matrix. In line 7 of the algorithm, it is clear that we require Hessian-vector products. These can be easily computed using computational graphs without having access to the full Hessian matrix. of the Wasserstein distance is evaluated based on reference dynamics and any policy π, we pass it, along with ϵ and ϕ0 as inputs. Then Algorithms 1 operates in a descent-ascent fashion in two main phases. In the first, lines 5 to 10 in Algorithm 1, dynamics parameters are updated using the closed-form solution in Equation 10, while ensuring that learning rates abide by the Wolfe conditions (Wolfe (1969)). With this, the second phase (line 11) utilises any state-of-the-art reinforcement learning method to adapt policy parameters generating 𝜽[k+1].

In the next section we provide theoretical guarantees of Algorithm 1. Namely, we show that the inner loop, when idealised, converges to a stationary point in model parameters. Given these results, we show that the update in line 12 of Algorithm 1 amounts a valid gradient-step. This, in turn, reduces the second step of the proof to a standard gradient-based one with an objective only parameterised through policies.

4 Theoretical Guarantees

In this section, we discuss and prove theoretical properties related to Algorithm 1. The algorithm consists of an inner loop which iterates through a sequence of dynamics parameters, and an outer look which updates policies parameters.

In order to analyse the process, we make the following assumptions:

  1. (A1)

    ϕ𝚽d2, and 𝚽 is compact .

  2. (A2)

    For any given 𝜽, 𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)] as a function of ϕ is continuous on 𝚽 and continuously differentiable on int(𝚽).

  3. (A3)

    𝑯0 is positive definite.

  4. (A4)

    For any given 𝜽, ϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)] as a function of ϕ is Lipschitz continuous on int(𝚽).

The need for these assumptions will be made clear, but in essence, they are needed for the analysis below which shows convergence of iterates of the inner loop and that the computation of the gradient step in line 1 in Algorithm 1 is a valid gradient update.

Consider the following function and optimisation problem:

Fixing 𝜽,

minϕF𝜽(ϕ)=𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]+bart(12(ϕ-ϕ0)𝖳𝑯0(ϕ-ϕ0)-ϵ). (11)

Here, bart is a barrier function parameterised by t{}. A standard barrier function in interior point optimisation methods - and one which we take here - is

bart(x)=-1tlog(-x) (12)

for x<0. As t, (12) approaches 𝟏{x=0}. The barrier term in (11) is only defined for values of ϕ for which 12(ϕ-ϕ0)𝖳𝑯0(ϕ-ϕ0)<ϵ, and when t=, (11) is equivalent to (9). A solution to (9) can be approximated arbitrarily closely by making t large enough.

Consider line 1 in Algorithm 1 where 𝒑[j] is computed. We may consider a modification like so:

𝒑[j](ϕ0-2ϵ𝒈[k],𝖳𝑯0-1𝒈[k]𝑯0-1𝒈[k]-𝒙[j])-νσϕF𝜽(𝒙[j]) (13)

where νσ0 is a non-negative real number which ensures 𝒑[j] is a decent direction for F𝜽 at 𝒙[j]. Such a νσ always exists when ϕF𝜽(𝒙[j])>0, by taking νσ large enough. Indeed, we do not require νσ to be merely a descent direction, but a σ-strict descent direction: Let σ(0,1] be a constant (viewed as a hyperparameter of the algorithm) and define υj(νσ) by

cosυj(νσ):=-ϕF𝜽(𝒙[j])T𝒑[j]ϕF𝜽(𝒙[j])𝒑[j] (14)

for 𝒑[j] defined by (13). Now we choose νσ=min{ν:cosυj(ν)σ}.

The reasons for this criterion will be made clearer later, but one of the implications will be that if t (which can be treated as a hyperparameter) is set to t=, then νσ=0 for any point 𝒙[j] in the interior of the constraint set. Indeed, Algorithm 1 can be viewed as implementing (13) with hyper parameter t=.

In line (1), α is chosen per the Wolfe conditions. In this context, the Wolfe conditions are as follows: For 𝒑[j] a descent direction for F𝜽 at 𝒙[j], and constants 0<c1<c2<1,

F𝜽(𝒙[j]+α𝒑[j]) F𝜽(𝒙[j])+c1αϕF𝜽(𝒙[j])T𝒑[j] (15a)
ϕF𝜽(𝒙[j]+α𝒑[j])T𝒑[j] c2ϕF𝜽(𝒙[j])T𝒑[j] (15b)

Condition (15a) is also known as the Armijo condition.

For notational convenience, define for a positive number a, the constraint set

𝒞a:={ϕd2:12(ϕ-ϕ0)𝖳𝑯0(ϕ-ϕ0)a} (16)

We show below that there is always an α that satisfies (15a)– (15b) when t<. We then use this fact to show that iterates (𝒙[j])j determined in this way converge to a stationary point of F𝜽 in the interior of the constraint set 𝒞ϵ.

Lemma 1.

Suppose t< and Assumptions (A1)(A4) hold. Let L:={ϕΦ:F𝛉(ϕ)F𝛉(ϕ0)}.

Then for any given 𝛉, the following hold:

  1. (i)

    F𝜽 is bounded from below on 𝚽 and continuously differentiable on int(𝚽).

  2. (ii)

    There exists an 0<ϵ(𝜽)<ϵ such that 𝒩:=int(𝒞ϵ(𝜽)).

  3. (iii)

    ϕF𝜽 is Lipschitz continuous on 𝒩.

  4. (iv)

    For 𝒙, there is always an α0 that satisfies (15a)– (15b) with respect to a descent direction 𝒑 of F𝜽 at 𝒙, and also satisfies the condition 𝒙+α𝒑.


(i) Since 𝚽 is compact (Assumption (A1)) and 𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)] is continuous on 𝚽 (Assumption (A2)), it follows that 𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)] is bounded from below on 𝚽. This implies that F𝜽 is bounded from below on 𝚽.

Since 𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)] is continuously differentiable on int(𝚽) (Assumption (A2)), and the chosen barrier function (12) is continuously differentiable on (-,0), it follows F𝜽 is continuously differentiable on int(𝚽).

(ii) Since 𝑯0 is positive definite (Assumption (A3)), we have 0<a<b𝒞a𝒞b (strictly). The barrier function (12) forces F𝜽 to go to + at the boundary of 𝒞ϵ, and per the proof of part (i) above, 𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)] is bounded from below on 𝚽. This means that there is an ϵ(𝜽)<ϵ such that F𝜽(ϕ)>F𝜽(ϕ0) for any ϕ𝒞ϵ𝒞ϵ(𝜽). For such an ϵ(𝜽), we have int(𝒞ϵ(𝜽))

(iii) Clearly, (12) is Lipschitz continuous when x is bounded away from zero. In conjunction with Assumption (A4), we see that ϕF𝜽 is Lipschitz continuous on 𝒩.

(iv) Since 𝒑 is a descent direction, 0<c1<1, F𝜽 is bounded from below on 𝚽 and continuously differentiable on int(𝚽) (from part (i) above), there must exist an α~>0 such that F𝜽(𝒙+α~𝒑)=F𝜽(𝒙)+c1α~ϕF𝜽(𝒙)T𝒑 and any αI:=[0,α~] satisfies (15a). By virtue of satisfying (15a), 𝒙+α𝒑 for any αI.

Regarding condition (15b), observe that by the Mean Value Theorem, there is some α1I such that ϕF𝜽(𝒙+α1𝒑)T𝒑=c1ϕF𝜽(𝒙)T𝒑. Since 𝒑 is a descent direction and 0<c1<c2, there is, by the continuous differentiability of F𝜽, some 0α2α1 such that ϕF𝜽(𝒙+α2𝒑)=c2ϕF𝜽(𝒙)T𝒑. Furthermore, α2I since α2α1 and α1I. Therefore, there is a sub-interval of I for which both Wolfe conditions, (15a) and (15b) are satisfied. ∎

The above gives us the following:

Corollary 2.

For t<, the sequence (𝐱[j])j0 generated by Algorithm 1 with assignment (13) in line 1 converges to a stationary point of F𝛉.


We apply a result of Zoutendijk, reproduced as Proposition 7 in Appendix C from Nocedal and Wright (2006).

By Lemma 1, the conditions of Proposition 7 are satisfied for F𝜽 with 𝒩 defined in the lemma. We have cosυjνσσ, therefore, per the conclusion of the aforementioned proposition, we conclude limjϕF𝜽(𝒙[j])=0.

Note, that whilst the statement of Proposition 7 does not impose the condition that ϕ𝚽 for the sub-level set , the result is still applicable as we showed in part (iv) that there are always choices of α that can be made such that iterates that start in remain in , and that 𝒩.

Let 𝒑~[j] be the bracketed term in (13) (so 𝒑[j]=𝒑~[j]-νσϕF𝜽(𝒙[j])). Let 𝒙[j]int(𝒞ϵ) and for ease of notation, let 𝒈j=ϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]|ϕ=𝒙[j]. Since 𝒑~[j] was derived from (10), it is a descent direction for 𝒈j. Let D be the diameter of the constraint set 𝒞ϵ, then 𝒑~[j]D. Let υ~j be the angle between 𝒑~[j] and 𝒈j. Then 𝒑~[j]cosυ~jj where j is the distance from 𝒙[j] to the boundary of 𝒞ϵ along the direction 𝒈j. Thus, cosυ~jj/D. Therefore, for any given 𝒙[j]int(𝒞ϵ), ϕF𝜽(𝒙[j])𝒈j as t, implying that υσ0 if σj/D. Furthermore, fixing t and generating a sequence (𝒙[j])j0 from the inner loop (Phase I) of Algorithm 1 using (13), we see that due to the barrier function, there is some lower bound =inf(j)j0, and so there is an upper bound on υσ.

Finally, consider the special case where t=. Then νσ=0, but we can only guarantee to satisfy the the first of the Wolfe conditions, the Armijo condition (15a), and therefore, cannot use the above proof for convergence of to a stationary point of (𝒙[j])j0.

Now we continue to consider the convergence of the algorithm for t<. We define a mapping ϕ*:𝚯𝚽 as follows: ϕ*(𝜽):=limj𝒙[j] when Phase I of Algorithm 1 is executed with 𝒙[0]=ϕ[0] (as it always is), and with 𝜽 as the policy parameter. Thus, if the for loop in Phase I were to run for sufficiently many iterations, then after line 1, we would have ϕk+1=ϕ*(𝜽[k]). For notational convenience, we use the shorthand ϕk+1*=ϕ*(𝜽[k]).

Recalling that F𝜽(ϕ) is a function of both ϕ and 𝜽, we can define a function G(𝜽):=F𝜽(ϕ*(𝜽)).

The point of the above definitions is this: If the assumptions made in Theorem 3 (see below) hold, then Algorithm 1 will be generating a sequence of policy parameters (𝜽[k])k0 through gradient ascent of the function G:


Getting back to the original motivation, we wished to solve the max-min problem (8). As discussed, this problem is intractable in part because the minimisation is over a non-convex function of ϕ. Thus, in Algorithm 1 we opted instead to determine ϕ which was (close to) a stationary point of the (barrier-augmented) objective. This is represented by the function G; specifically, G(𝜽[k]) is the value of the (barrier-augmented) objective at the stationary point.

But then, what of the maximisation? Ideally, we would like to solve max𝜽G(𝜽) as a substitute for the max-min problem (8), but again, due to the non-concavity of the objective with respect to 𝜽, maximisation is generally intractable, and we opt instead to converge to a 𝜽* that is a stationary point of G. The fact that the algorithm does this (and the sufficient conditions for it to do so) is shown in the following theorem.

Theorem 3.

Suppose t< and each sequence (𝐱[j])j0 generated in Phase I of Algorithm 1 is with assignment (13) applied in line 1. If ϕ[k+1]=limj𝐱[j] in line 1 of Algorithm 1, then the 𝛉[k] parameter update in line 1 (which is implemented with ϕ set to ϕ[k+1]) is a gradient ascent step with respect to G at 𝛉[k].


We have

𝜽G(𝜽[k])=𝜽F𝜽|𝜽=𝜽[k],ϕ=ϕk+1*+𝑱(𝜽[k])ϕF𝜽|𝜽=𝜽[k],ϕ=ϕk+1* (17)

where 𝑱(𝜽[k]) is the Jacobian of the function ϕ* evaluated at 𝜽[k].

The RHS will be equal to 𝜽F𝜽|𝜽=𝜽[k],ϕ=ϕk+1* if ϕF𝜽|𝜽=𝜽[k],ϕ=ϕk+1*=0. But this is indeed the implication of Corollary 2 if the sequence (𝒙[j])j0 is left to run to convergence.

Thus, we have


and observe the correspondence to the term in line 1 in Algorithm 1: Indeed, they will be exactly the same if the assumptions in the statement of the Theorem hold.

Consequently, the 𝜽[k] parameter update in line 1 in Algorithm 1 (which is implemented with ϕ fixed to ϕ[k+1]) is a valid gradient step if ϕ[k+1]=limj𝒙[j].

Consequently, we have the following, which follows by standard results from the numerical optimisation literature (see, e.g., Nocedal and Wright (2006))

Corollary 4.

If G is bounded from above and has Lipschitz-continuous first derivatives, then the sequence of step sizes (β[k])k0 can be chosen so as to ensure that the sequence (𝛉[k])k0 will converge to a stationary point of G.

Of course, in practice, we may have to terminate Phase I before the sequence has converged, in which case the update in line 1 is an approximate gradient update which can be made to have an error as small as desired by terminating Phase I as late as is needed.

5 Zero’th Order Wasserstein Robust Reinforcement Learning

So far, we have presented an algorithm for robust reinforcement learning assuming accessibility to first and second-order information of the loss and constraint. It is relatively easy to attain such information if we were to follow a model-based setting that parameterises transitions with deep networks. As mentioned earlier, however, we follow another route that utilises a black-box optimisation scheme as deep neural networks are not always suitable for dynamical systems grounded in Lagrangian mechanics and physics (Lutter et al., 2019; Lutter and Peters, 2019).

This section details our zero-order robust solver, where we present an implementation of our approach for a scenario where training can be done on a simulator for which the dynamics are parameterised and can be altered at will. Whilst we refer to simulators (since much of RL training is done on such), almost exactly the same applies to differential equation solvers or other software based techniques for training a policy before deployment into the real world.

To elaborate, consider a simulator 𝕊ϕ for which the dynamics are parameterised by a real vector ϕ, and for which we can execute steps of a trajectory (i.e., the simulator takes as input an action 𝒂 and gives back a successor state and reward). For generating novel physics-grounded transitions, one can simply alter ϕ and execute the instruction in 𝕊ϕ from some a state 𝒔𝒮, while applying an action 𝒂𝒜. Not only does this ensure valid (under mechanics) transitions, but also promises scalability as specification parameters typically reside in lower dimensional spaces compared to the number of tuneable weights when using deep networks as transition models.

As we do not explicitly model transitions (e.g., the intricate operations of the simulator or differential equations solver), one has to tackle an additional challenge when requiring gradient or Hessian information to perform optimisation. Namely, if the idea of parameterising simulators through dynamic specifications in Phases I and II of Algorithm 1 is to be successfully executed, we require a procedure for estimating first and second-order information based on only function value evaluations of 𝕊ϕ.

Next, we elaborate how one can acquire such estimates by proposing a novel zero-order method for estimating gradients and Hessians that we use in our experiments that demonstrate scalability, and robustness on high-dimensional robotic environments.

Gradient Estimation:

Recalling the update rule in Phase I of Algorithm 1, we realise the need for, estimating the gradient of the loss function with respect to the vector specifying the dynamics of the environment, i.e., 𝒈[k]=ϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]|𝜽[k],ϕ[k] at each iteration of the inner-loop j. Handling simulators as black-box models, we estimate the gradients by sampling from a Gaussian distribution with mean 𝟎 and σ2𝑰 co-variance matrix. Our choice for such estimates is not arbitrary but rather theoretically grounded as one can easily prove the following proposition:

Proposition 5 (Zero-Order Gradient Estimate).

For a fixed 𝛉 and ϕ, the gradient can be computed as:


The proof of the above proposition can easily be derived by combining the lines of reasoning in (Salimans et al., 2017; Nesterov, 2011), while extending to the parameterisation of dynamical models. To commence, begin by defining 𝒥𝜽(ϕ)=𝔼𝝉p𝜽ϕ[total(𝝉)] for fixed policy parameters 𝜽. Given any perturbation vector, 𝝃𝒩(𝟎,σ2𝑰), we can derive (through a Taylor expansion) the following:

𝒥𝜽(ϕ+𝝃)=𝒥𝜽(ϕ)+𝝃𝖳ϕ𝒥𝜽(ϕ)+12𝝃𝖳ϕ2𝒥𝜽(ϕ)𝝃+𝒪(higher-order terms).

Multiplying by 𝝃, and taking the expectation on both sides of the above equation, we get:

𝔼𝝃𝒩(0,σ2𝑰)[𝝃𝒥𝜽(ϕ+𝝃)] =𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃𝒥𝜽(ϕ)+𝝃𝝃𝖳ϕ𝒥𝜽(ϕ)+12𝝃𝝃𝖳ϕ2𝒥𝜽(ϕ)𝝃]

Dividing by σ2, we derive the statement of the proposition as:


Hessian Estimation:

Having derived a zero-order gradient estimator, we now generalise these notions to a form allowing us to estimate the Hessian. It is also worth reminding the reader that such a Hessian estimator needs to be performed one time only before executing the instructions in Algorithm 1 (i.e., 𝑯0 is passed as an input). Precisely, we prove the following proposition:

Proposition 6 (Zero-Order Hessian Estimate).

The hessian of the Wasserstein distance around ϕ0 can be estimated based on function evaluations. Recalling that 𝐇0=ϕ2E(𝐬,𝐚)π()ρπϕ0()[W22(Pϕ(|𝐬,𝐚),P0(|𝐬,𝐚))]|ϕ=ϕ0, and defining W(𝐬,𝐚)(ϕ):=W22(Pϕ(|𝐬,𝐚),P0(|𝐬,𝐚)), we prove:

𝑯0 =1σ2𝔼𝝃𝒩(𝟎,σ2𝑰)[1σ2𝝃(𝔼(𝒔,𝒂)π()ρπϕ0()[𝒲(𝒔,𝒂)(ϕ0+𝝃)])𝝃𝖳

Commencing with the right-hand-side of the above equation, we perform second-order Taylor expansions for each of the two terms under under the expectation of 𝝃. Namely, we write:

𝑯0 =1σ2𝔼𝝃𝒩(𝟎,σ2𝑰)[1σ2𝝃(𝔼(𝒔,𝒂)π()ρπϕ0()[𝒲(𝒔,𝒂)(ϕ0+𝝃)])𝝃𝖳 (18)

Now, we analyse each of the above terms separately. For ease of notation, we define the following variables:

𝒈 =ϕ𝔼(𝒔,𝒂)π()ρπϕ0[𝒲(𝒔,𝒂)(ϕ0)]          𝑯=ϕ2𝔼(𝒔,𝒂)π()ρπϕ0[𝒲(𝒔,𝒂)(ϕ0)]
𝑨 =𝝃𝝃𝖳𝒈𝝃𝖳                    𝑩=𝝃𝝃𝖳𝑯𝝃𝝃𝖳
c =𝝃𝖳𝑯𝝃.

Starting with 𝑨, we can easily see that any (i,j) component can be written as 𝑨=n=1d2𝝃i𝝃j𝝃n𝒈n. Therefore, the expectation under 𝝃𝒩(𝟎,σ2𝑰) can be derived as:

𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i𝝃j𝝃n𝒈n]=𝒈n𝔼𝝃i𝒩(0,σ2)[𝝃i3]=0   if i=j=n and 0 otherwise.

Thus, we conclude that 𝔼𝝃𝒩(𝟎,σ2𝑰)[𝑨]=𝟎d2×d2.

Continuing with the second term, i.e., 𝑩, we realise that any (i,j) component can be written as 𝑩i,j=n=1d2m=1d2𝝃i𝝃j𝝃n𝝃m𝑯m,n. Now, we consider two cases:

  • Diagonal Elements (i.e., when i=j): The expectation under 𝝃𝒩(𝟎,σ2𝑰) can be further split in three sub-cases

    • Sub-Case I when i=j=m=n: We have 𝔼𝝃𝒩(0,σ2𝑰)[𝝃i𝝃j𝝃n𝝃m𝑯m,n]=𝑯i,i𝔼𝝃i𝒩(0,σ2)=3σ4𝑯i,i.

    • Sub-Case II when i=jm=n: We have 𝔼𝝃𝒩(0,σ2𝑰)[𝝃i𝝃j𝝃n𝝃m𝑯m,n]=𝑯m,m𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i2𝝃m2]=σ4𝑯m,m.

    • Sub-Case III when indices are all distinct: We have 𝔼𝝃𝒩(0,σ2𝑰)[𝝃i𝝃j𝝃n𝝃m𝑯m,n]=0.


    [enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Diagonal Elements Conclusion: Using the above results we conclude that 𝔼𝝃𝒩(𝟎,σ2𝑰)[𝑩i,i]=2σ4𝑯i,i+σ4trace(𝑯).

  • Off-Diagonal Elements (i.e., when ij): The above analysis is now repeated for computing the expectation of the off-diagonal elements of matrix 𝑩. Similarly, this can also be split into three sub-cases depending on indices:

    • Sub-Case I when i=mj=n: We have 𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i𝝃j𝝃n𝝃m𝑯m,n]=𝑯i,j𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i2𝝃j2]=σ4𝑯i,j.

    • Sub-Case II when i=nj=m: We have 𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i𝝃j𝝃n𝝃m𝑯m,n]=𝑯j,i𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i2𝝃j2]=σ4𝑯j,i.

    • Sub-Case III when indices are all distinct: We have 𝔼𝝃𝒩(𝟎,σ2𝑰)[𝝃i𝝃j𝝃n𝝃m𝑯m,n]=0.


    [enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Off-Diagonal Elements Conclusion: Using the above results and due to the symmetric properties of 𝑯, we conclude that 𝔼𝝃𝒩(𝟎,σ2𝑰)[𝑩i,j]=2σ4𝑯i,j

Finally, analysing c, one can realise that 𝔼𝝃𝒩(𝟎,σ2𝑰)[c]=𝔼𝝃𝒩(𝟎,σ2𝑰)[i=1d2j=1d2𝝃i𝝃j𝑯i,j]=σ2trace(𝑯).

Substituting the above conclusions back in the original approximation in Equation 18, and using the linearity of the expectation we can easily achieve the statement of the proposition. ∎

With the above two propositions, we can now perform the updates in Algorithm 1 without the need for performing explicit model learning. This is true as Propositions 6 and 7 devise procedure where gradient and Hessian estimates can be simply based on simulator value evaluations while perturbing ϕ and ϕ0. It is important to note that in order to apply the above, we are required to be able to evaluate 𝔼(𝒔,𝒂)π()ρπϕ0()[𝒲(𝒔,𝒂)(ϕ0)] under random 𝝃 perturbations sampled from 𝒩(𝟎,σ2𝑰). An empirical estimate of the p-Wasserstein distance between two measures μ and ν can be performed by computing the p-Wasserstein distance between the empirical distributions evaluated at sampled data. That is, one can approximation μ by μn=1ni=1nδ𝒙i where 𝒙i are identically and independently distributed according to μ. Approximating νn similarly, we then realise that88 8 In case the dynamics are assumed to be Gaussian, a similar procedure can be followed or a closed form can be used, see Takatsu (2008). 𝒲2(μ,ν)𝒲2(μn,νn).

6 Related work

In this section we review some of the related literature. There is a common theme running through previous works: the formulation of the problem as a max-min (or min-max, depending on whether the goal is to maximise for rewards or minimise for costs) problem. This game-theoretic view is natural formulation when viewing nature as an adversary. Thus, it will be no surprise that the papers discussed below mainly take this approach or close variations of it.

There is a long-standing thread of research on robustness in the classical control community, and the literature in this area is vast, with the method being a standard approach Doyle et al. (2013). This approach was introduced into reinforcement learning by Morimoto and Doya (2005). In that paper, a continuous time reinforcement learning setting was studied for which a max-min problem was formulated involving a modified value function, the optimal solutions of which can be determined by solving Hamilton-Jacobi-Isaacs (HJI) equation.

There is also a line of work on robust MDPs, amongst which are Iyengar (2005); Nilim and El Ghaoui (2005); Wiesemann et al. (2013); Tirinzoni et al. (2018); Petrik and Russell (2019). In particular, Yang (2017) uses the Wasserstein distance to define uncertainty sets of dynamics in similar way to this work, that is, in an ϵ-ball around a particular dynamics (referred to as nominal distribution in that paper). The paper shows that an optimal Markov control policy is possible the max-min Bellman equations and shows how convex-optimisation techniques can be applied to solve it.

Whilst valuable in their own right, these approaches are not sufficient for the RL setting due to the need in the latter case to give efficient solutions for large state and action spaces, and the fact that the dynamics are not known a priori. We emphasise once again that in our setting, cannot explicitly define the MDP of the reference dynamics, since we do not assume knowledge of it. We assume only that we can sample from it, which is the standard assumption made in RL.

Reasonably, one might expect that model-based reinforcement learning may be a plausible route to address robustness. In Asadi et al. (2018), the learning of Lipschitz continuous models is addressed, and a bound on multi-step prediction error is given in terms of the Wasserstein distance. However, the major stumbling-block with model-based RL techniques is that in high-dimensional state building models that are sufficient for controlling an agent can suffer greatly from model mis-specification or excessive computational costs (e.g., as in training a Gaussian Process, see, e.g., Rasmussen (2003); Deisenroth and Rasmussen (2011)).

We now discuss some papers closer in objective and/or technique to our own. Rajeswaran et al. (2017) approaches the robustness problem by training on an ensemble of dynamics in order to be deployed on a target environment. The algorithm introduced, Ensemble Policy Optimisation (EPOpt), alternates between two phases: (i) given a distribution over dynamics for which simulators (or models) are available (the source domain), train a policy that performs well for the whole distribution; (ii) gather data from the deployed environment (target domain) to adapt the distribution. The objective is not max-min, but a softer variation defined by conditional value-at-risk (CVaR). The algorithm samples a set of dynamics {ϕk} from a distribution over dynamics 𝒫𝝍, and for each dynamics ϕk, it samples a trajectory using the current policy parameter 𝜽i. It then selects the worst performing ϵ-fraction of the trajectories to use to update the policy parameter. Clearly this process bears some resemblance to our algorithm, but there is a crucial difference: our algorithm takes descent steps in the ϕ space. The difference if important when the dynamics parameters sit in a high-dimensional space, since in that case, optimisation-from-sampling could demand a considerable number of samples. A counter argument against our technique might be that our zero’th-order method for estimating gradients and Hessians also requires sampling in high dimensions. This is, indeed the case, but obtaining localised estimates (as gradients and Hessians are local properties) could be easier than global properties (the worse set of parameters in the high-dimensional space). In any case, our experiments demonstrate our algorithm performs well even in these high dimensions. The experiments of Rajeswaran et al. (2017) are on Hopper and HalfCheetah, in which their algorithm is compared to TRPO. We note that Rajeswaran et al. (2017) does not have theoretical guarantees. We also note that we were were unable to find the code for this paper, and did not attempt to implement it ourselves.

The CVaR criterion is also adopted in Pinto et al. (2017), in which, rather than sampling trajectories and finding a quantile in terms of performance, two policies are trained simultaneously: a “protagonist” which aims to optimise performance, and an adversary which aims to disrupt the protagonist. The protagonist and adversary train alternatively, with one being fixed whilst the other adapts. The action space for the adversary, in the tests documented in the paper includes forces on the entities (InvertedPendulum, HalfCheetah, Swimmer, Hopper, Walker2D) that aim to destabalise it. We made comparisons against this algorithm in our experiments.

More recently, Tessler et al. (2019) studies robustness with respect to action perturbations. There are two forms of perturbation addressed: (i) Probabilistic Action Robust MDP (PR-MDP), and (ii) Noisy Action Robust MDP (NR-MDP). In PR-MDP, when an action is taken by an agent, with probability α, a different, possibly adversarial action is taken instead. In NR-MDP, when an action is taken, a perturbation is added to the action itself. Like Rajeswaran et al. (2017) and Pinto et al. (2017), the algorithm is suitable for applying deep neural networks, and the paper reports experiments on InvertedPendulum, Hopper, Walker2d and Humanoid. We tested against PR-MDP in some of our experiments, and found it to be lacking in robustness (see Section 7, Figure 1 and Figure 2).

In Lecarpentier and Rachelson (2019) a non-stationary Markov Decision Process model is considered, where the dynamics can change from one time step to another. The constraint is based on Wasserstein distance, specifically, the Wasserstein distance between dynamics at time t and t is bounded by L|t-t|, i.e., is L-Lipschitz with respect to time, for some constant L. They approach the problem by treating nature as an adversary and implement a Minimax algorithm. The basis of their algorithm is that due to the fact that the dynamics changes slowly (due to the Lipschitz constraint), a planning algorithm can project into the future the scope of possible future dynamics and plan for the worst. The resulting algorithm, known as Risk Averse Tree Search, is - as the name implies - a tree search algorithm. It operates on a sequence “snapshots” of the evolving MDP, which are instances of the MDP at points in time. The algorithm is tested on small grid world, and does not appear to be readily extendible to the continuous state and action scenarios our algorithm addresses.

To summarise, our paper uses the Wasserstein distance for addressing, in common with Lecarpentier and Rachelson (2019), but is suited to applying deep neural networks for continuous state and action spaces. Our paper does not require a full dynamics available to it, merely a parameterisable dynamics. It competes well with the above papers, and operates well for high dimensional problems, as evidenced by the experiments.

7 Experiments & Results

We evaluate WR2L on a variety of continuous control benchmarks from the MuJoCo environment. Dynamics in our benchmarks were parameterised by variables defining physical behaviour, e.g., density of the robot’s torso, friction of the ground, and so on. We consider both low and high dimensional dynamics and demonstrate that our algorithm outperforms state-of-the-art from both standard and robust reinforcement learning. We are chiefly interested in policy generalisation across environments with varying dynamics, which we measure using average test returns on novel systems. The comparison against standard reinforcement learning algorithms allows us to understand whether lack of robustness is a critical challenge for sequential decision making, while comparisons against robust algorithms test if we outperform state-of-the-art that considered a similar setting to ours. From standard algorithms, we compare against proximal policy optimisation (PPO) (Schulman et al., 2017), and trust region policy optimisation (TRPO) (Schulman et al., 2015b); an algorithm based on natural actor-crtic (Peters and Schaal, 2008a; Pajarinen et al., 2019). From robust algorithms, we demonstrate how WR2L favours against robust adversarial reinforcement learning (RARL) (Pinto et al., 2017), and action-perturbed Markov decision processes (PR-MDP) proposed in (Tessler et al., 2019).

It is worth noting that we attempted to include deep deterministic policy gradients (DDPG) (Silver et al., 2014) in our comparisons. Results including DDPG were, however, omitted as it failed to show any significant robustness performance even on relatively simple systems, such as the inverted pendulum; see results reported in Appendix A. During initial trials, we also performed experiments parameterising models using deep neural networks. Results demonstrated that these models, though minimising training data error, fail to provide valid physics-grounded dynamics. For instance, we arrived at inverted pendula models that vary pole angles without exerting any angular speeds. This problem became even more apparent in high-dimensional systems, e.g., Hopper, Walker, etc due to the increased number of possible minima. As such, results presented in this section make use of our zero-order method that can be regarded as a scalable alternative for robust solutions.

7.1 MuJoCo benchmarks

Contrary to other methods rooted in model-based reinforcement learning, we evaluate our method both in low and high-dimensional MuJuCo tasks (Brockman et al., 2016). We consider a variety of systems including CartPole, Hopper, and Walker2D; all of which require direct joint-torque control. Keeping with the generality of our method, we utilise these dynamical as-is with no additional alterations. Namely, we use the exact setting of these benchmarks as that shipped with OpenAI gym without any reward shaping, state-space augmentation, feature extraction, or any other modifications of-that-sort. For clarity, we summarise variables parameterising dynamics in Table 1, and detail specifics next.


The goal of this classic control benchmark is to balance a pole by driving a cart along a rail. The state space is composed of the position x and velocity x˙ of the cart, as well as the angle θ and angular velocities of the pole θ˙. We consider two termination conditions in our experiments: 1) pole deviates from the upright position beyond a pre-specified threshold, or 2) cart deviates from its zeroth initial position beyond a certain threshold. To conduct robustness experiments, we parameterise the dynamics of the CartPole by the pole length lp, and test by varying lp[0.3,3].

Hopper: In this benchmark, the agent is required to control a hopper robot to move forward without falling. The state of the hopper is represented by positions, {x,y,z}, and linear velocities, {x˙,y˙,z˙}, of the torso in global coordinate, as well as angles, {θi}i=02, and angular speeds, {θ˙i}i=02, of the three joints. During training, we exploit an early-stopping scheme if “unhealthy” states of the robot were visited. Parameters characterising dynamics included densities {ρi}i=03 of the four links, armature {ai}i=02 and damping {ζi}i=02 of three joints, and the friction coefficient μg. To test for robustness, we varied both frictions and torso densities leading to significant variations in dynamics. We further conducted additional experiments while varying all 11 dimensional specification parameters.

Walker2D: This benchmark is similar to Hopper except that the controlled system is a biped robot with seven bodies and six joints. Dimensions for its dynamics are extended accordingly as reported in Table 1. Here, we again varied the torso density for performing robustness experiments in the range ρ0[500,3000].

Halfcheetah: This benchmark is similar to the above except that the controlled system is a two-dimensional slice of a three-dimensional cheetah robot. Parameters specifying the simulator consist of 21 dimensions, with 7 representing densities. In our two-dimensional experiments we varied the torso-density and floor friction, while in high-dimensional ones, we allowed the algorithm to control all 21 variables.

7.2 Experimental protocol

1D experiment 2D experiment High-dimensional experiment
Inverted Pendulum lp None None
Hopper ρ0 {ρ0,μg} {ρi}i=03{ai}i=02{ζi}i=02μg
Walker2D ρ0 {ρ0,μg} {ρi}i=06{ai}i=05{ζi}i=05μg
HalfCheetah None {ρ0,μg} {ρi}i=07{ai}i=05{ζi}i=05μg
Table 1: Parameterisation of dynamics. See section 7.1 for the physical meaning of these parameters.

Our experiments included training and a testing phases. During the training phase we applied Algorithm 1 for determining robust policies while updating transition model parameters according to the min-max formulation. Training was performed independently for each of the algorithms on the relevant benchmarks while ensuring best operating conditions using hyper-parameter values reported elsewhere (Schulman et al., 2017; Pinto et al., 2017; Tessler et al., 2019).

For all benchmarks, policies were represented using parametrised Gaussian distributions with their means given by a neural network and standard derivations by a group of free parameters. The neural network consisted of two hidden layers with 64 units and hyperbolic tangent activations in each of the layers. The final layer exploited linear activation so as to output a real number. Following the actor-critic framework, we also trained a standalone critic network having the same structure as that of the policy.

For each policy update, we rolled-out in the current worst-case dynamics to collect a number of transitions. The number associated to these transitions was application-dependent and varied between benchmarks in the range of 5,000 to 10,000. The policy was then optimised (i.e., Phase II of Algorithm 1) using proximal policy optimization with a generalised advantage estimation. To solve the minimisation problem in the inner loop of Algorithm 1, we sampled a number of dynamics from a diagonal Gaussian distribution that is centered at the current worst-case dynamics model. The number of sampled dynamics and the variance of the sampled distributions depended on both the benchmark itself, and well as the dimensions of the dynamics. Gradients needed for model updates were estimated using the results in Propositions 7 and 8. Finally, we terminated training when the policy entropy dropped below an application-dependent threshold.

When testing, we evaluated policies on unseen dynamics that exhibited simulator variations as described earlier. We measured performance using returns averaged over 20 episodes with a maximum length of 1,000 time steps on testing environments. We note that we used non-discounted mean episode rewards to compute such averages.

7.3 Comparison Results & Benchmarking

This section summarises robustness results showing that our method significantly outperforms others from both standard and robust reinforcement learning in terms of average testing returns as dynamics vary.

Results with One-Dimensional Model Variation:

Figure 1 shows the robustness of policies on a simple inverted pendulum while varying the pole length in the ranges from 0.3 to 3.0. For a fair comparison, we trained two standard policy gradient methods (TRPO (Schulman et al., 2015b) and PPO (Schulman et al., 2017)), and two robust RL algorithms (RARL (Pinto et al., 2017), PR-MDP (Tessler et al., 2019)) with the reference dynamics preset by our algorithm. The range of evaluation parameters was intentionally designed to include dynamics out of the ϵ-Wasserstein ball. Clearly, WR2L outperforms all baselines in this benchmark.

Figure 1: Robustness results on the inverted pendulum demonstrating that our method outperforms state-of-the-art in terms of average test returns.

Given successful behaviour in low-dimensional state representations, we performed additional experiments on the Hopper and Walker systems to assess robustness against model changes in high-dimensional environments. Figure 2 illustrates these results depicting that our method is again capable of outperforming others including RARL and PR-MDP. It is also interesting to realise that in high-dimensional environments, our algorithm exhibits a trade-off between robustness and optimality due to the min-max definition of WR2L’s objective.


[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Experimental Conclusion I: From the above, we conclude that WR2L outperforms others when one-dimensional simulator variations are considered.

Figure 2: Robustness results on Hopper (left) and Walker (right) systems demonstrating that our method outperforms others significantly in terms of average test returns as torso densities vary. It is also interesting to realise that due to the robust problem formulation, our algorithm exhibits a trade-off between optimality and generalisation. Hopper results are with a reference ρ0=1750; PPO2 uses the same implementation as PPO but trained with ρ0=3000. Walker results are attained with a reference model of ρ0=1750.

Results with Two-Dimensional Model Variation:

Though successful when considering one dimensional variation in the environment, it is instructive to analyse how our algorithm performs with multi-dimensional variations. To do so, we performed additional experiments using the Hopper benchmark varying both friction and the density parameters. Results shown in Figure 3 demonstrate that our method is capable of significantly outperforming PPO (i.e., when ϵ=0 – Figure 3(a)), where the range of high (over 2300) test returns is broader even with a small ϵ-ball constraint of 0.003.

(a) Hopper - ϵ=0
(b) Hopper - ϵ=0.003
(c) Hopper - ϵ=0.015
(d) Walker - ϵ=0
(e) Walker - ϵ=0.1
(f) Walker - ϵ=1.0
(g) Cheetah - ϵ=0.0
(h) Cheetah - ϵ=0.005
(i) Cheetah - ϵ=0.03
Figure 3: Results on various benchmarks. Top row represents Hopper results, middle is concerned with Walker, and bottom denotes HalfCheetah. These graphs depict test returns as a function of changes in dynamical parameters and Wasserstein distance. These graphs again show that WR2L outperforms PPO (i.e., when ϵ=0) and that its robustness improves as ϵ increases.

Furthermore, one would expect that such advantages increase with the increase in the radius ϵ. To validate these claims, we re-ran the same experiment devised above while allowing for a larger ϵ of 0.015. It is clear from Figure 3(c) that the robustness range of the policy generated by WR2L does increase with the increase in the ball’s radius.

These results were also verified in two additional benchmarks (i.e., the Walker and HalfCheetah). Here, again our results demonstrate that when 2 dimensional changes are considered, our method outperforms state-of-the-art significantly. We also arrive at the same conclusion that if ϵ increases so does the robustness range. For instance, a robust policy trained with an ϵ of 0.03 achieves high average test returns on a broader range of HalfCheetahs compares to that with an ϵ=0.005, see Figures 4 (h) and 4 (i).

This, in turn, takes us to the following conclusion:


[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Experimental Conclusion II: From the above, we conclude that WR2L outperforms others when two-dimensional simulator variations are considered, and that robustness increase with ϵ.

Results with High-Dimensional Model Variation:

Though results above demonstrate robustness, an argument against a min-max objective can be made especially when only considering low-dimensional changes in the simulator. Namely, one can argue the need for such an objective as opposed to simply sampling a set of systems and determining policies performing-well on average similar to the approach proposed in (Rajeswaran et al., 2017).

(a) PPO High-D Var. - HP
(b) Train Low Test High - HP
(c) Train High Test High - HP
(d) PPO High-D Var.- HC
(e) Train Low Test High - HC
(f) Train High Test High - HC
Figure 4: Results evaluating performance when considering high-dimensional variations on the hopper (HP - top row) and HalfCheetah (HC - bottom row) environment. All figures show the empirical distribution of returns on 1,000 testing systems. Figure (a) demonstrates the robustness of PPO. Figure (b) reports empirical test returns of WR2L’s policy trained on only two parameter changes (e.g., friction and density) of the environment but tested on systems with all high-dimensional dynamical parameters modified. Figure (c) trains and tests WR2L altering all dimensional parameters of the simulator. Clearly, our method exhibits robustness even if high-dimensional variations were considered.

A counter-argument to the above is that a gradient-based optimisation scheme is more efficient than a sampling-based one when high-dimensional changes are considered. In other words, a sampling procedure is hardly applicable when more than a few parameters are altered, while WR2L can remain suitable. To assess these claims, we conducted two additional experiments on the Hopper and HalfCheetah benchmarks. In the first, we trained robustly while changing friction and torso densities, and tested on 1000 systems generated by varying all 11 dimensions of the Hopper dynamics, and 21 dimensions of the HalfCheetah system. Results reported in Figures 4(b) and (e) demonstrate that the empirical densities of the average test returns are mostly centered around 3000 for the Hopper, and around 4500 for the Cheetah, which improves that of PPO (Figures 4(a) and (d)) with return masses mostly accumulated at around 1000 in the case of the Hopper and almost equally distributed when considering HalfCheetah.

Such improvements, however, can be an artifact of the careful choice of the low-dimensional degrees of freedom allowed to be modified during Phase I of Algorithm 1. To get further insights, Figures 4(c) and (f) demonstrate the effectiveness of our method trained and tested while allowing to tune all 11 dimensional parameters of the Hopper simulator, and the 21 dimensions of the HalfCheetah. Indeed, our results are in accordance with these of the previous experiment depicting that most of the test returns’ mass remains around 3000 for the Hopper, and improves to accumulate around 4500 for the HalfCheetah. Interestingly, however, our algorithm is now capable of acquiring higher returns on all systems99 9 Please note that we attempted to compare against Rajeswaran et al. (2017). Due to the lack of open-source code, we were not able to regenerate their results. since it is allowed to alter all parameters defining the simulator. As such, we conclude:


[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Experimental Conclusion III: From the above, we conclude that WR2L outperforms others when high-dimensional simulator variations are considered.

8 Conclusion & Future Work

In this paper, we proposed a novel robust reinforcement learning algorithm capable of outperforming others in terms of test returns on unseen dynamical systems. Our algorithm formalises a new min-max objective with Wasserstein constraints for policies generalising across varying domains, and considers a zero-order method for scalable solutions. We evaluated our method both theoretically and empirically. On theory side, we proved how our algorithm can attain estimates of gradients and Hessians based on random samples. Empirically, we demonstrated superior performance against state-of-the-art from both standard and robust reinforcement learning on low and high-dimensional MuJuCo environments.

There are a lot of interesting directions we plan to target in the future. First, we aim to consider robustness in terms of other components of MDPs, e.g., state representations, reward functions, and others. Second, we will implement WR2L on real-hardware considering sim-to-real experiments.

9 Acknowledgements

We are grateful to Jan Peters and Andreas Krause for the interesting discussions that helped better-shape this paper. Moreover, we would like to thank each of Rasul Tutunov, and Aivar Sootla for guiding us through discussion on optimisation and control theory. Finally, we thank Jun Yao, Liu Wulong, Chen Zhitang, Jia Zheng, and Zhengguo Li for helping us improve this work with inputs about real-world considerations.


  • Asadi et al. (2018) Asadi, K., Misra, D., and Littman, M. (2018). Lipschitz continuity in model-based reinforcement learning. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 264–273, Stockholmsmässan, Stockholm Sweden. PMLR.
  • Bou-Ammar et al. (2014) Bou-Ammar, H., Eaton, E., Ruvolo, P., and Taylor, M. E. (2014). Online multi-task learning for policy gradient methods. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1206–II–1214. JMLR.org.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
  • Busoniu et al. (2010) Busoniu, L., Babuska, R., Schutter, B. D., and Ernst, D. (2010). Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Inc., Boca Raton, FL, USA, 1st edition.
  • Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc.
  • Chow et al. (2015) Chow, Y., Tamar, A., Mannor, S., and Pavone, M. (2015). Risk-sensitive and robust decision-making: a cvar optimization approach. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1522–1530. Curran Associates, Inc.
  • Deisenroth et al. (2013) Deisenroth, M., Peters, J., and Neumann, G. (2013). A survey on policy search for robotics. Found. Trends Robot, 2:1–142.
  • Deisenroth and Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472.
  • Doyle et al. (2013) Doyle, J. C., Francis, B. A., and Tannenbaum, A. R. (2013). Feedback control theory. Courier Corporation.
  • Emmert-Streib and Dehmer (2018) Emmert-Streib, F. and Dehmer, M. (2018). A machine learning perspective on personalized medicine: An automized, comprehensive knowledge base with ontology for pattern recognition. Machine Learning and Knowledge Extraction, 1(1):149–156.
  • Fischer (2018) Fischer, T. G. (2018). Reinforcement learning in financial markets - a survey. FAU Discussion Papers in Economics 12/2018, Erlangen.
  • Iyengar (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
  • Kober and Peters (2012) Kober, J. and Peters, J. (2012). Reinforcement Learning in Robotics: A Survey, volume 12, pages 579–610. Springer, Berlin, Germany.
  • Koller et al. (2018) Koller, T., Berkenkamp, F., Turchetta, M., and Krause, A. (2018). Learning-based model predictive control for safe exploration and reinforcement learning. CoRR, abs/1803.08287.
  • Lecarpentier and Rachelson (2019) Lecarpentier, E. and Rachelson, E. (2019). Non-stationary markov decision processes a worst-case approach using model-based reinforcement learning. arXiv preprint arXiv:1904.10090.
  • Lutter and Peters (2019) Lutter, M. and Peters, J. (2019). Deep lagrangian networks for end-to-end learning of energy-based control for under-actuated systems. In International Conference on Intelligent Robots and Systems (IROS).
  • Lutter et al. (2019) Lutter, M., Ritter, C., and Peters, J. (2019). Deep lagrangian networks: Using physics as model prior for deep learning.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A. (2013). Playing atari with deep reinforcement learning. CoRR, abs/1312.5602.
  • Morimoto and Doya (2005) Morimoto, J. and Doya, K. (2005). Robust reinforcement learning. Neural Comput., 17(2):335–359.
  • Namkoong and Duchi (2016) Namkoong, H. and Duchi, J. C. (2016). Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2216–2224, USA. Curran Associates Inc.
  • Nesterov (2011) Nesterov, Y. (2011). Random gradient-free minimization of convex functions. CORE Discussion Papers 2011001, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
  • Nilim and El Ghaoui (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
  • Nocedal and Wright (2006) Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.
  • Packer et al. (2018) Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., and Song, D. (2018). Assessing generalization in deep reinforcement learning. CoRR, abs/1810.12282.
  • Pajarinen et al. (2019) Pajarinen, J., Thai, H. L., Akrour, R., Peters, J., and Neumann, G. (2019). Compatible natural gradient policy search. CoRR, abs/1902.02823.
  • Peng et al. (2017) Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2017). Sim-to-real transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537.
  • Peters and Schaal (2008a) Peters, J. and Schaal, S. (2008a). Natural actor-critic. Neurocomput., 71(7-9):1180–1190.
  • Peters and Schaal (2008b) Peters, J. and Schaal, S. (2008b). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697.
  • Petrik and Russell (2019) Petrik, M. and Russell, R. H. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. arXiv preprint arXiv:1902.07605.
  • Pinto et al. (2017) Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. (2017). Robust adversarial reinforcement learning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2817–2826, International Convention Centre, Sydney, Australia. PMLR.
  • Rajeswaran et al. (2017) Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. (2017). Epopt: Learning robust neural network policies using model ensembles. International Conference on Learning Representations (ICLR) 2017, arXiv preprint arXiv:1610.01283.
  • Rasmussen (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer.
  • Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
  • Sargent and Hansen (2001) Sargent, T. and Hansen, L. (2001). Robust control and model uncertainty. American Economic Review, 91(2):60–66.
  • Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France. PMLR.
  • Schulman et al. (2015b) Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015b). Trust region policy optimization. CoRR, abs/1502.05477.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages I–387–I–395. JMLR.org.
  • Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.
  • Takatsu (2008) Takatsu, A. (2008). Wasserstein geometry of gaussian measures.
  • Tessler et al. (2019) Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6215–6224, Long Beach, California, USA. PMLR.
  • Tirinzoni et al. (2018) Tirinzoni, A., Petrik, M., Chen, X., and Ziebart, B. (2018). Policy-conditioned uncertainty sets for robust markov decision processes. In Advances in Neural Information Processing Systems, pages 8939–8949.
  • Wiesemann et al. (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183.
  • Wolfe (1969) Wolfe, P. (1969). Convergence conditions for ascent methods. SIAM review, 11(2):226–235.
  • Yang (2017) Yang, I. (2017). A convex optimization approach to distributionally robust markov decision processes with wasserstein distance. IEEE control systems letters, 1(1):164–169.
  • Zhao et al. (2019) Zhao, C., Sigaud, O., Stulp, F., and Hospedales, T. M. (2019). Investigating generalisation in continuous deep reinforcement learning. CoRR, abs/1902.07015.

Appendix A Deep Deterministic Policy Gradients Results

As mentioned in the experiments section of the main paper, we refrained from presenting results involving deep deterministic policy gradients (DDPG) due to its lack in robustness even on simple systems, such as the CartPole.

Figure 5: Robustness results on the inverted pendulum demonstrating that our method outperforms state-of-the-art in terms of average test returns and that DDPG lacks in robustness performance.

Figure 5 depicts these results showing that DDPG lacks robustness even when minor variations in the pole lenght are introduced. TRPO and PPO, on the other hand, demonstrate an acceptable performance retaining a test return of 1,000 across a broad range of pole lengths variations.

Appendix B Derivation of the Closed Form Solution

In Section 3.2 we presented a closed form solution to the following optimisation problem:


which took the form of:


In this section of the appendix, we derive such an update rule from first principles. We commence transforming the constraint optimisation problem into an unconstrained one using the Lagrangian:


where 𝝀 is a Lagrange multiplier, and 𝒈[k]=ϕ𝔼𝝉p𝜽ϕ(𝝉)[total(𝝉)]|𝜽[k],ϕ[k]𝖳.

Deriving the Lagrangian with respect to the primal parameters ϕ, we write:

ϕ(ϕ,𝝀)=𝒈[k],𝖳+𝝀(ϕ-ϕ0)𝖳𝑯0. (19)

Setting Equation 19 to zero and solving for primal parameters, we attain:


Plugging ϕ back into the equation representing the constraints, we derive:


It is easy to see that with the positive solution for 𝝀, the Karush–Kuhn–Tucker (KKT) conditions are satisfied. Since the objective and constraint are both convex, the KKT conditions are sufficient and necessary for optimality, thus finalising our derivation.

Appendix C Miscellaneous

The following result is used in the proof of Corollary 2. For the reader’s convenience, we reproduce it here from Nocedal and Wright (2006).

Proposition 7 (Zoutendijk, taken from Nocedal and Wright (2006)).

For a differentiable function f:RnR consider any iteration of the form:


where 𝐩j is a descent direction for f at 𝐱j and αj0 is a real number that satisfies the Wolfe conditions. Suppose that f is bounded below in Rn and that f is continuously differentiable in an open set N containing the level set L:={𝐱:f(𝐱)f(𝐱0)} where 𝐱0 is the starting point of the iteration. Assume also that the gradient f is Lipschitz continuous on N, i.e., there exists a constant L>0 such that