Abstract
Reinforcement learning algorithms, though successful, tend to overfit totraining environments hampering their application to the realworld. This paperproposes WR$^{2}$L; a robust reinforcement learning algorithm with significantrobust performance on low and highdimensional control tasks. Our methodformalises robust reinforcement learning as a novel minmax game with aWasserstein constraint for a correct and convergent solver. Apart from theformulation, we also propose an efficient and scalable solver following a novelzeroorder optimisation method that we believe can be useful to numericaloptimisation in general. We contribute both theoretically and empirically. Onthe theory side, we prove that WR$^{2}$L converges to a stationary point in thegeneral setting of continuous state and action spaces. Empirically, wedemonstrate significant gains compared to standard and robust stateoftheartalgorithms on highdimensional MuJuCo environments.
Quick Read (beta)
Wasserstein Robust Reinforcement Learning
Abstract
Reinforcement learning algorithms, though successful, tend to overfit to training environments hampering their application to the realworld. This paper proposes ${\text{WR}}^{2}\text{L}$ – a robust reinforcement learning algorithm with significant robust performance on low and highdimensional control tasks. Our method formalises robust reinforcement learning as a novel minmax game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zeroorder optimisation method that we believe can be useful to numerical optimisation in general. We contribute both theoretically and empirically. On the theory side, we prove that ${\text{WR}}^{2}\text{L}$ converges to a stationary point in the model parameters and that our update is a valid gradient step when it comes to policies. Empirically, we demonstrate significant gains compared to standard and robust stateoftheart algorithms on highdimensional MuJuCo environments.
Wasserstein Robust Reinforcement Learning
Mohammed Amin Abdullah^{1}^{1}footnotemark: 1 ^{†}^{†}thanks: The first three authors are to be considered as joint first coauthors Huawei R&D UK [email protected] Hang Ren ^{1}^{1}footnotemark: 1 ^{2}^{2}footnotemark: 2 Huawei R&D UK Imperial College London [email protected] Haitham BouAmmar ^{1}^{1}footnotemark: 1 ^{†}^{†}thanks: Honorary Lecturer Position at University College London. Huawei R&D UK University College London [email protected] Vladimir Milenković^{2}^{2}footnotemark: 2 ^{†}^{†}thanks: Work done while interning at the Reinforcement Learning Team in Huawei Technologies Research and Development in London. Huawei R&D UK University of Cambridge Rui Luo ^{2}^{2}footnotemark: 2 Huawei R&D UK University College London Mingtian Zhang^{2}^{2}footnotemark: 2 Huawei R&D UK University College London Jun Wang Huawei R&D UK University College London
noticebox[b]Based on NeurIPS 2019 Template.\[email protected]
1 Introduction
Reinforcement learning (RL) has become a standard tool for solving decisionmaking problems with minimal feedback. Applications with these characteristics are ubiquitous, including, but not limited to, computer games (Mnih et al., 2013), robotics (Kober and Peters, 2012; Deisenroth et al., 2013; BouAmmar et al., 2014), finance (Fischer, 2018), and personalised medicine (EmmertStreib and Dehmer, 2018). Although significant progress has been made on developing algorithms for learning largescale and highdimensional reinforcement learning tasks, these algorithms often overfit to training environments and fail to generalise across even slight variations of transition dynamics (Packer et al., 2018; Zhao et al., 2019).
Robustness to changes in transition dynamics, however, is a crucial component for adaptive and safe RL in realworld environments. To illustrate, consider a selfdriving car scenario in which we attempt to design an agent capable of driving a vehicle smoothly, safely, and autonomously. A typical reinforcement learning workflow to solving such a problem consists of building a simulator to emulate realworld scenarios, training in simulation, and then transferring resultant policies to physical systems for control. Unfortunately, such a strategy is easily prone to failure as designing accurate simulators that capture intricate complexities of large cities is extremely challenging. Rather than learning in simulation, another workflow might consist of constructing a pipeline to directly learn on the hardware system itself. Apart from memory constraints, stateoftheart reinforcement learning algorithms exhaust hundreds to millions of agentenvironment interactions before acquiring successful behaviour. Of course, such high demands on sample complexities prohibit the direct application of learning algorithms on realsystems, leaving robustness to misspecified simulators a largely unresolved problem.
Motivated by realworld applications, recent literature in reinforcement learning has focused on the above problems, proposing a plethora of algorithms for robust decisionmaking (Morimoto and Doya, 2005; Pinto et al., 2017; Tessler et al., 2019). Most of these techniques borrow from game theory to analyse, typically in a discrete state and actions spaces, worstcase deviations of agents’ policies and/or environments, see Sargent and Hansen (2001); Nilim and El Ghaoui (2005); Iyengar (2005); Namkoong and Duchi (2016) and references therein. These methods have also been extended to linear function approximators Chow et al. (2015), and deep neural networks Peng et al. (2017) showing (modest) improvements in performance gain across a variety of disturbances, e.g., action uncertainties, or dynamical model variations.
Though successful in practice, current techniques to robust decision making remain taskspecific, and there are two major lingering drawbacks. First, these algorithms fail to provide a general robustness framework due to their specialised heuristics. In particular, algorithms designed for discrete stateaction spaces fail to generalise to continuous domains and vice versa. This is due to their exploiting mathematical properties valid only for discrete state and action spaces, or for convexconcave functions, e.g., $\mathrm{min}\mathrm{max}f(\cdot )=\mathrm{max}\mathrm{min}f(\cdot )$. Clearly, such mathematical derivations are only loosely related to implementation, and further insights can be gathered with more rigorous attempts tackling the continuous problem itself. Apart from theoretical limitations, the second drawback of current approaches can be traced back to the process by which empirical validation is conducted. Rarely do the papers presenting these techniques compare gains against other robust algorithms in literature. In fact, focus is mainly diverted to outperforming stateoftheart reinforcement learning – a criterion easy to satisfy as algorithms from RL were never trained for robustness. Interestingly, upon careful evaluation, we came to realise that robust adversarial reinforcement learning of Pinto et al. (2017), for example, is superior to some of the more recently published works attempting to solve the same problem, see Section 7.
In this paper, we contribute to the above endeavour to solve the robustness problem in RL by proposing a generic framework for robust reinforcement learning that can cope with both discrete and continuous state and actions spaces. The algorithm we introduce, which we call Wasserstein Robust Reinforcement Learning (WR${}^{2}$L), Algorithm 1, aims to find the best policy, where any given policy is judged by the worstcase dynamics amongst all candidate dynamics in a certain set. This set is essentially the average Wasserstein ball around a reference dynamics ${\mathcal{P}}_{0}$. The constraints makes the problem welldefined, as searching over arbitrary dynamics can only result in nonperforming system. The measure of performance is the standard RL objective form, the expected return. Both the policy and the dynamics are parameterised; the policy parameters ${\bm{\theta}}_{k}$ may be the weights of a deep neural network, and the dynamics parameters ${\mathit{\varphi}}_{j}$ the parameters of a simulator or differential equation solver. The algorithm performs estimated descent steps in $\mathit{\varphi}$ space and  after (almost) convergence  performs policy gradient steps in $\bm{\theta}$ space. Since ${\mathit{\varphi}}_{j}$ may be highdimensional, we adapt a zero’th order sampling method based extending Salimans et al. (2017) to make estimations of gradients, and in order to define the constraint set which ${\mathit{\varphi}}_{j}$ is bounded by, we generalise the technique to estimate Hessians (Proposition 6).
We emphasise that although access to a simulator with parameterisable dynamics are required, the actual reference dynamics ${\mathcal{P}}_{0}$ need not be known explicitly nor learnt by our algorithm. Put another way, we are in the “RL setting”, not the “MDP setting” where the transition probability matrix is known a priori. The difference is made obvious, for example, in the fact that we cannot perform dynamic programming, and the determination of a particular probability transition can only be estimated from sampling, not retrieved explicitly. Hence, our algorithm is not modelbased in the traditional sense of learning a model to perform planning.
We believe our contribution is useful and novel for two main reasons. Firstly, our framing of the robust learning problem is in terms of dynamics uncertainty sets defined by Wasserstein distance. Whilst we are not the first to introduce the Wasserstein distance into the context of MDPs (see, e.g., Yang (2017) or Lecarpentier and Rachelson (2019)), we believe our formulation is amongst the first suitable for application to the demanding applicationspace we desire, that being, highdimensional, continuous state and action spaces. Secondly, we believe our solution approach is both novel and effective (as evidenced by experiments below, see Section 7), and does not place a great demand on model or domain knowledge, merely access to a simulator or differentiable equation solver that allows for the parameterisation of dynamics. Furthermore, it is not computationally demanding, in particular, because it does not attempt to build a model of the dynamics, and operations involving matrices are efficiently executable using the Jacobianvector product facility of automatic differentiation engines.
The rest of the paper is organised as follows. In the next section, we provide a background on notation and an overview of Wasserstein distance in its various forms. The problem formulation and main algorithm is then presented in Section 3. In Section 4, we demonstrate convergence to a stationary point of the robust objective. In Section 5, we describe the zero’th order method we use to estimate the Hessian matrix used by the algorithm, and the proof of its correctness is given. In Section 6 we survey related literature. Experiments and results are given in Section 7, and finally, conclusions and future work are discussed in Section 8.
2 Background
This section provides background material needed for the remainder of the paper. We first describe the reinforcement learning framework adopted in this paper, and then proceed to detail notions from Wasserstein distances needed to constrain our learning objective.
2.1 Reinforcement Learning
In reinforcement learning, an agent interacts with an unknown and stochastic environment with the goal of maximising a notion of return (Sutton and Barto, 1998; Peters and Schaal, 2008b; Busoniu et al., 2010). These problems are typically formalised as Markov decision processes (MDPs)^{1}^{1} 1 Please note that we present reinforcement learning with continuous states and actions. This allows us to easily draw similarities to optimal control as detailed later. Extending these notions to discrete settings is relatively straightforward. $\mathcal{M}=\u27e8\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma \u27e9$, where $\mathcal{S}\subseteq {\mathbb{R}}^{d}$ denotes the state space, $\mathcal{A}\subseteq {\mathbb{R}}^{n}$ the action space, $\mathcal{P}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0,1]$ is a state transition probability describing the system’s dynamics, $\mathcal{R}:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ is the reward function measuring the agent’s performance, and $\gamma \in [0,1)$ specifies the degree to which rewards are discounted over time.
At each time step $t$, the agent is in state ${\bm{s}}_{t}\in \mathcal{S}$ and must choose an action ${\bm{a}}_{t}\in \mathcal{A}$, transitioning it to a new state ${\bm{s}}_{t+1}\sim \mathcal{P}\left({\bm{s}}_{t+1}{\bm{s}}_{t},{\bm{a}}_{t}\right)$, and yielding a reward $\mathcal{R}({\bm{s}}_{t},{\bm{a}}_{t})$. A policy $\pi :\mathcal{S}\times A\to [0,1]$ is defined as a probability distribution over stateaction pairs, where $\pi ({\bm{a}}_{t}{\bm{s}}_{t})$ represents the density of selecting action ${\bm{a}}_{t}$ in state ${\bm{s}}_{t}$. Upon consequent interactions with the environment, the agent collects a trajectory $\bm{\tau}$ of stateaction pairs. The goal is to determine an optimal policy ${\pi}^{\star}$ by solving:
$${\pi}^{\star}=\mathrm{arg}\underset{\pi}{\mathrm{max}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\pi}(\bm{\tau})}\left[{\mathcal{R}}_{\text{Total}}(\bm{\tau})\right],$$  (1) 
where ${p}_{\pi}(\bm{\tau})$ denotes the trajectory density function, and ${\mathcal{R}}_{\text{Total}}(\bm{\tau})$ the return, that is, the total accumulated reward:
$${p}_{\pi}(\bm{\tau})={\mu}_{0}({\bm{s}}_{0})\pi ({\bm{a}}_{0}{\bm{s}}_{0})\prod _{t=1}^{T1}\mathcal{P}({\bm{s}}_{t+1}{\bm{s}}_{t},{\bm{a}}_{t})\pi ({\bm{a}}_{t}{\bm{s}}_{t})\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}\hspace{1em}\u2006}{\mathcal{R}}_{\text{Total}}(\bm{\tau})=\sum _{t=0}^{T1}{\gamma}^{t}\mathcal{R}({\bm{s}}_{t},{\bm{a}}_{t}),$$  (2) 
with ${\mu}_{0}(\cdot )$ denoting the initial state distribution.
2.2 Wasserstein Distance
In our problem definition, we make use of a distance measure to bound allowed variations from a reference transition density ${\mathcal{P}}_{0}(\cdot )$. In general, a number of common metrics for measuring closeness between two probability distributions exist. Examples of which are total variation distance and KullbackLeibler divergence. In this paper, however, we measure distance between two different dynamics by the Wasserstein distance. This has a number of desirable properties; firstly, it is a genuine distance, exhibiting symmetry, which is a property that KL divergence lacks. Secondly, it is very flexible in the forms of the distributions that can be compared; it can measure the distance between two discrete distributions, two continuous distributions, and a discrete and continuous distribution (this latter case implying another valuable advantage  that the supports of the distributions can be different). In all cases, the Wasserstein distance is welldefined. Finally, and perhaps most importantly, the Wasserstein distance takes into account the underlying geometry of the space the distributions are defined on, which could be information that is fruitful to exploit in learning optimal control. Indeed this last point is the core motivator for our choosing Wasserstein distance for our algorithm, as shall be explained later.
Definition:
Given a measurable space $(\mathcal{X},\mathcal{F})$ with $\mathcal{X}$ being a metric space, a pair of discrete measures $\mu ,\nu $ defined on this measurable space can be written as $\mu ={\sum}_{i=1}^{n}{\mu}_{i}{\delta}_{{x}_{i}}$ and $\nu ={\sum}_{j=1}^{m}{\nu}_{j}{\delta}_{{y}_{j}}$ where all ${x}_{i},{y}_{j}\in \mathcal{X}$. A coupling $\kappa (\cdot ,\cdot )$ of $\mu $ and $\nu $ is a measure over $\{{x}_{1},\mathrm{\dots},{x}_{n}\}\times \{{y}_{1},\mathrm{\dots}{y}_{m}\}$ that preserves marginals, i.e, ${\mu}_{i}={\sum}_{j}\kappa ({\mu}_{i},{\nu}_{j})$ $\forall i$ and ${\nu}_{j}={\sum}_{i}\kappa ({\mu}_{i},{\nu}_{j})$ $\forall j$. This then induces a cost of “moving” the mass of $\mu $ to $\nu $, given as the (Frobenius) inner product $\u27e8\kappa ,C\u27e9$ where the matrix $C\in {\mathbb{R}}^{n\times m}$ has ${[C]}_{ij}={c}_{ij}=d({x}_{i},{y}_{j})$, i.e., the cost of moving a unit of measure from ${x}_{i}$ to ${y}_{j}$. Minimised over the space of all couplings $\mathbf{K}(\mu ,\nu )$, we get the Wasserstein distance, also known as the EarthMover Distance (EMD).
More generally, let $\mathcal{X}$ be a metric space with metric $d(\cdot ,\cdot )$. Let $\mathcal{C}(\mathcal{X})$ be the space of continuous functions on $\mathcal{X}$ and let $\mathcal{M}(\mathcal{X})$ be the set of probability measures on $\mathcal{X}$. Let $\mu ,\nu \in \mathcal{M}(\mathcal{X})$. Let $\mathbf{K}(\mu ,\nu )$ be the set of couplings between $\mu ,\nu $:
$$\mathbf{K}(\mu ,\nu ):=\{\kappa \in \mathcal{M}(\mathcal{X}\times \mathcal{X});\forall (A,B)\subset \mathcal{X}\times \mathcal{X},\kappa (A\times \mathcal{X})=\mu (A),\kappa (\mathcal{X}\times B)=\nu (B)\}$$  (3) 
That is, the set of joint distributions $\kappa \in \mathcal{M}(\mathcal{X}\times \mathcal{X})$ whose marginals agree with $\mu $ and $\nu $ respectively. Given a metric (serving as a cost function) $d(\cdot ,\cdot )$ for $\mathcal{X}$, the $p$’th Wasserstein distance ${W}_{p}(\mu ,\nu )$ for $p\ge 1$ between $\mu $ and $\nu $ is defined as:
$${W}_{p}(\mu ,\nu ):={\left(\underset{\kappa \in \mathbf{K}(\mu ,\nu )}{\mathrm{min}}{\int}_{\mathcal{X}\times \mathcal{Y}}d{(x,y)}^{p}\mathit{d}\kappa (x,y)\right)}^{1/p}$$  (4) 
3 Wasserstein Robust Reinforcement Learning
This section formalises robust reinforcement learning by equipping agents with capabilities of determining wellbehaved policies under worstcase models which are bounded in $\u03f5$Wasserstein balls. Our motivation for formalising ${\text{WR}}^{2}\text{L}$ is rooted in robust optimal control – a field dedicated to determining optimal actionselection rules under uncertainties induced by modelling assumptions. Here an agent controlling a plant/system faces an adversary that optimises for a disturbance controller while aiming at minimising rewards received by the agent. Interestingly, these problems relate to twoplayer minmax games and provide a rich literature with efficient solutions in certain specific scenarios, e.g., discrete states and actions, robust linear quadratic regulators, among others. Though generic minmax objectives can lead to robustness (see Pinto et al. (2017)), typical algorithms rooted in gametheory optimise wellbehaved objectives by introducing additional structural assumptions to adversaries. For example, it is not uncommon in robust optimal control to assume the process by which disturbances are applied (e.g., additive, multiplicative), or to only consider a subset adhering to maximally bounded norms ^{2}^{2} 2 Please note that this is not to say that robust optimal control is restricted to the above settings, see Doyle et al. (2013)..
Starting from robust optimal control, we derive ${\text{WR}}^{2}\text{L}$’s objective by introducing two major refinements to standard reinforcement learning. Similar to robust control, we enable agents to perturb transition models from a reference simulator with the goal of determining worstcase scenarios. We do not, however, pose additional structural assumptions on the process by which adversaries apply these perturbations. Rather, we posit a parameterised class of disturbances and adopt zero’th order optimisation (see Section 5) for flexibility, thus broadening our application spectrum to highdimensional and stochastic dynamical systems. Optimisation problems of this nature, on the other hand, tend to be illspecified due to the unconstrained process by which models are fit. To bound allowed variations in transition models, and ensure correctness and tractability, we then introduce an $\u03f5$ball Wasserstein constraint around a simulator ${\mathcal{P}}_{0}(\cdot )$ to guarantee convergence.
Before continuing, however, it is worth revisiting the motivations for choosing such a distance. Per the definition, constraining the possible dynamics to be within an $\u03f5$Wasserstein ball of a reference dynamics ${\mathcal{P}}_{0}(\cdot )$ means constraining it in a certain way. Wasserstein distance has the form $\text{mass}\times \text{distance}$. If this quantity is constrained to be less than a constant $\u03f5$, then if the mass is large, the distance is small, and if the distance is large, the mass is small. Intuitively, when modelling the dynamics of a system, it may be reasonable to concede that there could be a systemic error  or bias  in the model, but that bias should not be too large. It is also reasonable to suppose that occasionally, the behaviour of the system may be wildly different to the model, but that this should be a lowprobability event. If the model is frequently wrong by a large amount, then there is no use in it. In a sense, the Wasserstein ball formalises these assumptions.
In what comes next, we detail the problem definition as a novel minmax game with Wasserstein constrained dynamics. In Section 3.2, we elaborate a generic algorithm capable of robustly updating both model and policy parameters.
3.1 Problem Definition: Robust Objectives and Constraints
As mentioned earlier, the problem definition we introduce in this paper extends reinforcement learning in two directions. In the first, we introduce a minmax objective with parameterised transition models, while in the second, we incorporate Wasserstein constraints to bound allowed perturbations.
Parameterising Policies and Transition Models:
Due to the continuous nature of the state and action spaces considered in this work, we resort to deep neural networks to parameterise policies, which we write as ${\pi}_{\bm{\theta}}({\bm{a}}_{t}{\bm{s}}_{t})$, where $\bm{\theta}\in {\mathbb{R}}^{{d}_{1}}$ is a set of tunable hyperparameters to optimise. For instance, these policies can correspond to multilayer perceptrons for MuJoCo environments, or to convolutional neural networks in case of highdimensional states depicted as images. Exact policy details are ultimately application dependent and, consequently, provided in the relevant experiment sections.
In principle, one can similarly parameterise transition models using deep networks (e.g., LSTMtype models) to provide one or more actionconditioned future state predictions. Though appealing, going down this path led us to agents that discover worstcase transition models which minimise training data but lack any valid physical meaning. For instance, original experiments we conducted on CartPole ended up involving transitions that alter angles without any change in angular velocities. More importantly, these effects became more apparent in highdimensional settings where the number of potential minimisers increases significantly. It is worth noting that we are not the first to realise such an artifact when attempting to model physicsbased dynamics using deep networks. Authors in (Lutter et al., 2019) remedy these problems by introducing Lagrangian mechanics to deep networks, while others (Koller et al., 2018; Chen et al., 2018) argue the need to model dynamics given by differential equation structures directly.
Though incorporating physicsbased priors to deep networks is an important and challenging task that holds the promise of scaling modelbased reinforcement learning for efficient solvers, in this paper we rather study an alternative direction focusing on perturbing differential equation solvers and/or simulators with respect to the dynamic specification parameters $\mathit{\varphi}\in {\mathbb{R}}^{{d}_{2}}$. Not only would such a consideration reduce the dimensionality of parameter spaces representing transition models, but would also guarantee valid dynamics due to the nature of the discrete differential equation solver. Though tackling some of the above problems, such a direction arrives with a new set of challenges related to computing gradients and Hessians of blackbox solvers. In Section 5, we develop an efficient and scalable zeroorder method for valid and accurate model updates.
Unconstrained Loss Function:
Having equipped agents with the capability of representing policies and perturbing transition models, we are now ready to present an unconstrained version of ${\text{WR}}^{2}\text{L}$’s loss function. Borrowing from robust optimal control, we define robust reinforcement learning as an algorithm that learns bestcase policies under worstcase transitions:
$$\underset{\bm{\theta}}{\mathrm{max}}\left[\underset{\mathit{\varphi}}{\mathrm{min}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]\right],$$  (5) 
where ${p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})$ is a trajectory density function parameterised by both policies and transition models, i.e., $\bm{\theta}$ and $\mathit{\varphi}$, respectively:
$${p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})={\mu}_{0}({\bm{s}}_{0})\pi ({\bm{a}}_{0}{\bm{s}}_{0})\prod _{t=1}^{T1}\underset{\text{specs vector and diff. solver}}{\underset{\u23df}{{\mathcal{P}}_{\mathit{\varphi}}({\bm{s}}_{t+1}{\bm{s}}_{t},{\bm{a}}_{t})}}\underset{\text{deep network}}{\underset{\u23df}{{\pi}_{\bm{\theta}}({\bm{a}}_{t}{\bm{s}}_{t})}}.$$ 
At this stage, it should be clear that our formulation, though inspired from robust optimal control, is, truthfully, more generic as it allows for parameterised classes of transition models without incorporating additional restrictions on the structure or the scope by which variations are executed^{3}^{3} 3 Of course, allowed perturbations are ultimately constrained by the hypothesis space. Even then, our model is more general compared to robust optimal control that assumes additive, multiplicative, or other forms of disturbances..
Constraints & Complete Problem Definition:
Clearly, the problem in Equation 5 is illdefined due to the arbitrary class of parameterised transitions. To ensure wellbehaved optimisation objectives, we next introduce constraints to bound search spaces and ensure convergence to feasible transition models. For a valid constraint set, our method assumes access to samples from a reference dynamics model ${\mathcal{P}}_{0}(\cdot \bm{s},\bm{a})$, and bounds learnt transitions in an $\u03f5$Wasserstein ball around ${\mathcal{P}}_{0}(\cdot \bm{s},\bm{a})$, i.e., the set defined as:
$${\mathcal{W}}_{\u03f5}({\mathcal{P}}_{\mathit{\varphi}}(\cdot ),{\mathcal{P}}_{0}(\cdot ))=\{{\mathcal{P}}_{\mathit{\varphi}}(\cdot ):{\mathcal{W}}_{2}^{2}({\mathcal{P}}_{\mathit{\varphi}}(\cdot \bm{s},\bm{a}),{\mathcal{P}}_{0}(\cdot \bm{s},\bm{a}))\le \u03f5,\forall (\bm{s},\bm{a})\in \mathcal{S}\times \mathcal{A}\},$$  (6) 
where $\u03f5\in {\mathbb{R}}_{+}$ is a hyperparameter used to specify the “degree of robustness” in a similar spirit to maximum norm bounds in robust optimal control. It is worth noting, that though we have access to samples from a reference simulator, our setting is by no means restricted to modelbased reinforcement learning in an MDP setting. That is, our algorithm operates successfully given only traces from ${\mathcal{P}}_{0}$ accompanied with its specification parameters, e.g., pole lengths, torso masses, etc. – a more flexible framework that does not require full model learners.
Though defining a better behaved optimisation objective, the set in Equation 6 introduces infinite number of constraints when considering continuous state and/or actions spaces. To remedy this problem, we target a relaxed version that considers average Wasserstein constraints instead:
${\widehat{\mathcal{W}}}_{\u03f5}^{(\text{average})}({\mathcal{P}}_{\mathit{\varphi}}(\cdot ),{\mathcal{P}}_{0}(\cdot ))$  $=\{{\mathcal{P}}_{\mathit{\varphi}}(\cdot ):{\displaystyle {\int}_{(\bm{s},\bm{a})}}\mathcal{P}(\bm{s},\bm{a}){\mathcal{W}}_{2}^{2}({\mathcal{P}}_{\mathit{\varphi}}(\cdot \bm{s},\bm{a}),{\mathcal{P}}_{0}(\cdot \bm{s},\bm{a}))d(\bm{s},\bm{a})\le \u03f5\}$  (7)  
$=\{{\mathcal{P}}_{\mathit{\varphi}}(\cdot ):{\mathbb{E}}_{(\bm{s},\bm{a})\sim \mathcal{P}}\left[{\mathcal{W}}_{2}^{2}({\mathcal{P}}_{\mathit{\varphi}}(\cdot \bm{s},\bm{a}),{\mathcal{P}}_{0}(\cdot \bm{s},\bm{a}))\right]\le \u03f5\}$ 
In words, Equation 7 defines a set where expected Wasserstein distance is bounded by $\u03f5$. Expectation in the above equation is evaluated over stateaction pairs sampled according to $\mathcal{P}(\bm{s},\bm{a})$ which is policy and transitionmodel dependent. Precisely, one can factor $\mathcal{P}(\bm{s},\bm{a})$ as:
$\mathcal{P}(\bm{s}\in \mathcal{S},\bm{a}\in \mathcal{A})=\mathcal{P}(\bm{a}\in \mathcal{A}\bm{s}\in \mathcal{S})\mathcal{P}(\bm{s}\in \mathcal{S})=\pi (\bm{a}\in \mathcal{A}\bm{s}\in \mathcal{S}){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\bm{s}\in \mathcal{S}),$ 
where $\bm{s}\in \mathcal{S}$ and $\bm{a}\in \mathcal{A}$ are to be interpreted as events being elements of the state and action sets – a notation better suites for the continuous nature of the considered random variables. Moreover, ${\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\bm{s}\in \mathcal{S})$ is a uniform distribution over stateactions pairs sampled from a trajectory. Precisely, the way we compute the expected Wasserstein distance is two steps. In the first, given a batch of trajectories sampled according to any policy $\pi $, potentially of varying lengths, we create a bucket of stateaction pairs^{4}^{4} 4 Since we do not require access to current policies in order to compute the average Wasserstein distance, an argument can be made that ${\text{WR}}^{2}\text{L}$ support offpolicy constraint evaluation. This is an important link that we plan to further exploit in the future to improve efficiency, scalablity, and enable transfer between various tasks. . Given such data, we then compute expected Wasserstein distance over the pairs using MonteCarlo estimation. The $\pi $ we use is one which samples actions uniformly at random.
With this in mind, we arrive at ${\text{WR}}^{2}\text{L}$’s optimisation problem allowing for best policies under worstcase yet bounded transition models:
[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Wasserstein Robust Reinforcement Learning Objective:
$\underset{\bm{\theta}}{\mathrm{max}}\left[\underset{\mathit{\varphi}}{\mathrm{min}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]\right]\mathit{\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{2}^{2}({\mathcal{P}}_{\mathit{\varphi}}(\cdot \bm{s},\bm{a}),{\mathcal{P}}_{0}(\cdot \bm{s},\bm{a}))\right]\le \u03f5$  (8) 
3.2 Solution Methodology
Having derived our formal problem definition, this section presents our approach to solving for $\bm{\theta}$ and $\mathit{\varphi}$ in the objective of Equation 8. On a high level, our solution methodology follows an alternating procedure interchangeably updating one variable given the other fixed.
Updating Policy Parameters:
It is clear from Equation 8 that the average Wasserstein distance constraint is independent from $\bm{\theta}$ and can, in fact, use any other policy $\pi $ to estimate the expectation. Hence, given a fixed set of modelparameters, $\bm{\theta}$ can be updated by solving the relevant subproblem of Equation 8 written as:
$$\underset{\bm{\theta}}{\mathrm{max}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\varphi}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right].$$ 
Interestingly, this problem is a standard reinforcement learning one with a crucial difference in that traces are sampled according to the transition model given by fixed modelparameters, $\mathit{\varphi}$ that ultimately differ from these of the original simulator ${\mathit{\varphi}}_{0}$. Consequently, one can easily adapt any policy search method for updating policies under fixed dynamical models. As described later in Section 5, we make use of proximal policy optimisation (Schulman et al., 2017), for instance, to update such actionselectionrules.
Updating Model Parameters:
Now, we turn our attention to solving the average constraint optimisation problem needed for updating $\mathit{\varphi}$ given a set of fixed policy parameters $\bm{\theta}$. Contrary to the previous step, here, the Wasserstein constraints play an important role due to their dependence on $\mathit{\varphi}$. Unfortunately, even with the simplification introduced in Section 3.1 the resultant constraint is still difficult to computer in general, the difficulty being the evaluation of the Wasserstein term^{5}^{5} 5 The situation is easier in case of two Gaussian densities. Here, however, we keep the treatment general by proposing an alternative direction using Taylor expansions..
To alleviate this problem, we propose to approximate the constraint in (8) by its Taylor expansion up to second order. That is, defining
$$W(\mathit{\varphi}):={\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{2}^{2}({\mathcal{P}}_{\mathit{\varphi}}(\cdot \bm{s},\bm{a}),{\mathcal{P}}_{0}(\cdot \bm{s},\bm{a}))\right]$$ 
The above can be approximated around ${\mathit{\varphi}}_{0}$ by a secondorder Taylor as:
$$W(\mathit{\varphi})\approx W({\mathit{\varphi}}_{0})+{\nabla}_{\mathit{\varphi}}W{({\mathit{\varphi}}_{0})}^{\U0001d5b3}\left(\mathit{\varphi}{\mathit{\varphi}}_{0}\right)+\frac{1}{2}{\left(\mathit{\varphi}{\mathit{\varphi}}_{0}\right)}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}^{2}W({\mathit{\varphi}}_{0})\left(\mathit{\varphi}{\mathit{\varphi}}_{0}\right).$$ 
Recognising that $W({\mathit{\varphi}}_{0})=0$ (the distance between the same probability densities), and ${\nabla}_{\mathit{\varphi}}W({\mathit{\varphi}}_{0})=0$ since ${\mathit{\varphi}}_{0}$ minimises $W(\mathit{\varphi})$, we can simplify the Hessian approximation by writing:
$$W(\mathit{\varphi})\approx \frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}^{2}W({\mathit{\varphi}}_{0})(\mathit{\varphi}{\mathit{\varphi}}_{0}).$$ 
Substituting our approximation back in the original problem in Equation 8, we reach the following optimisation problem for determining model parameter given fixed policies:
$$\underset{\mathit{\varphi}}{\mathrm{min}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]\mathit{\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}}\frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}(\mathit{\varphi}{\mathit{\varphi}}_{0})\le \u03f5,$$  (9) 
where ${\bm{H}}_{0}={\nabla}_{\mathit{\varphi}}^{2}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{2}^{2}({\mathcal{P}}_{\mathit{\varphi}}(\cdot \bm{s},\bm{a}),{\mathcal{P}}_{0}(\cdot \bm{s},\bm{a}))\right]{}_{\mathit{\varphi}={\mathit{\varphi}}_{0}}$ is the Hessian of the expected squared $2$Wasserstein distance evaluated at ${\mathit{\varphi}}_{0}$.
Optimisation problems with quadratic constraints can be efficiently solved using interiorpoint methods. To do so, one typically approximates the loss with a firstorder expansion and determines a closedform solution. Consider a fixed ${\bm{\theta}}^{[k]}$ and ${\mathit{\varphi}}^{[k]}$ at some iteration $k$. To find ${\mathit{\varphi}}^{[k+1]}$, we solve:
$${\underset{\mathit{\varphi}}{\mathrm{min}}{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{{\bm{\theta}}^{[k]},{\mathit{\varphi}}^{[k]}}^{\U0001d5b3}(\mathit{\varphi}{\mathit{\varphi}}_{0})\mathit{\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}}\frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}(\mathit{\varphi}{\mathit{\varphi}}_{0})\le \u03f5.$$ 
It is easy to show that a minimiser to the above equation can derived in a closedform as:
${\mathit{\varphi}}^{[k+1]}={\mathit{\varphi}}_{0}\sqrt{{\displaystyle \frac{2\u03f5}{{\bm{g}}^{[k],\U0001d5b3}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}}}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]},$  (10) 
with ${\bm{g}}^{[k]}$ denoting the gradient^{6}^{6} 6 Remark: Although this looks superficially similar to an approximation made in TRPO Schulman et al. (2015a), the latter aims to optimise the policy parameter rather than dynamics. Furthermore, the constraint is based on the KullbackLeibler divergence rather than the Wasserstein distance evaluated at ${\bm{\theta}}^{[k]}$ and ${\mathit{\varphi}}^{[k]}$, i.e., ${\bm{g}}^{[k]}={{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\mathbb{E}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{{\bm{\theta}}^{[k]},{\mathit{\varphi}}^{[k]}}$.
Generic Algorithm:
Having described the two main steps needed for updating policies and models, we now summarise these findings in the pseudocode in Algorithm 1.
$${\bm{p}}^{[j]}\leftarrow {\mathit{\varphi}}_{0}\sqrt{\frac{2\u03f5}{{\bm{g}}^{[k],\U0001d5b3}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}{\bm{x}}^{[j]}$$ 
$${\bm{x}}^{[j+1]}\leftarrow {\bm{x}}^{[j]}+\alpha {\bm{p}}^{[j]}$$ 
As the Hessian^{7}^{7} 7 Please note our algorithm does not need to store the Hessian matrix. In line 7 of the algorithm, it is clear that we require Hessianvector products. These can be easily computed using computational graphs without having access to the full Hessian matrix. of the Wasserstein distance is evaluated based on reference dynamics and any policy $\pi $, we pass it, along with $\u03f5$ and ${\mathit{\varphi}}_{0}$ as inputs. Then Algorithms 1 operates in a descentascent fashion in two main phases. In the first, lines 5 to 10 in Algorithm 1, dynamics parameters are updated using the closedform solution in Equation 10, while ensuring that learning rates abide by the Wolfe conditions (Wolfe (1969)). With this, the second phase (line 11) utilises any stateoftheart reinforcement learning method to adapt policy parameters generating ${\bm{\theta}}^{[k+1]}$.
In the next section we provide theoretical guarantees of Algorithm 1. Namely, we show that the inner loop, when idealised, converges to a stationary point in model parameters. Given these results, we show that the update in line 12 of Algorithm 1 amounts a valid gradientstep. This, in turn, reduces the second step of the proof to a standard gradientbased one with an objective only parameterised through policies.
4 Theoretical Guarantees
In this section, we discuss and prove theoretical properties related to Algorithm 1. The algorithm consists of an inner loop which iterates through a sequence of dynamics parameters, and an outer look which updates policies parameters.
In order to analyse the process, we make the following assumptions:

(A1)
$\mathit{\varphi}\in \mathbf{\Phi}\subset {\mathbb{R}}^{{d}_{2}}$, and $\mathbf{\Phi}$ is compact .

(A2)
For any given $\bm{\theta}$, ${\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ as a function of $\mathit{\varphi}$ is continuous on $\mathbf{\Phi}$ and continuously differentiable on $\mathrm{int}(\mathbf{\Phi})$.

(A3)
${\bm{H}}_{0}$ is positive definite.

(A4)
For any given $\bm{\theta}$, ${\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ as a function of $\mathit{\varphi}$ is Lipschitz continuous on $\mathrm{int}(\mathbf{\Phi})$.
The need for these assumptions will be made clear, but in essence, they are needed for the analysis below which shows convergence of iterates of the inner loop and that the computation of the gradient step in line 1 in Algorithm 1 is a valid gradient update.
Consider the following function and optimisation problem:
Fixing $\bm{\theta}$,
$$\underset{\mathit{\varphi}}{\mathrm{min}}\mathit{\hspace{1em}}{F}_{\bm{\theta}}(\mathit{\varphi})={\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]+{\mathrm{bar}}_{t}\left(\frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}(\mathit{\varphi}{\mathit{\varphi}}_{0})\u03f5\right).$$  (11) 
Here, ${\mathrm{bar}}_{t}$ is a barrier function parameterised by $t\in \mathbb{R}\cup \{\mathrm{\infty}\}$. A standard barrier function in interior point optimisation methods  and one which we take here  is
$${\mathrm{bar}}_{t}(x)=\frac{1}{t}\mathrm{log}(x)$$  (12) 
for $$. As $t\to \mathrm{\infty}$, (12) approaches $\mathrm{\infty}\cdot {\mathrm{\U0001d7cf}}_{\{x=0\}}$. The barrier term in (11) is only defined for values of $\mathit{\varphi}$ for which $$, and when $t=\mathrm{\infty}$, (11) is equivalent to (9). A solution to (9) can be approximated arbitrarily closely by making $t$ large enough.
Consider line 1 in Algorithm 1 where ${\bm{p}}^{[j]}$ is computed. We may consider a modification like so:
$${\bm{p}}^{[j]}\leftarrow \left({\mathit{\varphi}}_{0}\sqrt{\frac{2\u03f5}{{\bm{g}}^{[k],\U0001d5b3}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}{\bm{x}}^{[j]}\right){\nu}_{\sigma}{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}({\bm{x}}^{[j]})$$  (13) 
where ${\nu}_{\sigma}\ge 0$ is a nonnegative real number which ensures ${\bm{p}}^{[j]}$ is a decent direction for ${F}_{\bm{\theta}}$ at ${\bm{x}}^{[j]}$. Such a ${\nu}_{\sigma}$ always exists when $\parallel {\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}({\bm{x}}^{[j]})\parallel >0$, by taking ${\nu}_{\sigma}$ large enough. Indeed, we do not require ${\nu}_{\sigma}$ to be merely a descent direction, but a $\sigma $strict descent direction: Let $\sigma \in (0,1]$ be a constant (viewed as a hyperparameter of the algorithm) and define ${\upsilon}_{j}^{({\nu}_{\sigma})}$ by
$$\mathrm{cos}{\upsilon}_{j}^{({\nu}_{\sigma})}:=\frac{{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{({\bm{x}}^{[j]})}^{T}{\bm{p}}^{[j]}}{\parallel {\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}({\bm{x}}^{[j]})\parallel \parallel {\bm{p}}^{[j]}\parallel}$$  (14) 
for ${\bm{p}}^{[j]}$ defined by (13). Now we choose ${\nu}_{\sigma}=\mathrm{min}\{\nu :\mathrm{cos}{\upsilon}_{j}^{(\nu )}\ge \sigma \}$.
The reasons for this criterion will be made clearer later, but one of the implications will be that if $t$ (which can be treated as a hyperparameter) is set to $t=\mathrm{\infty}$, then ${\nu}_{\sigma}=0$ for any point ${\bm{x}}^{[j]}$ in the interior of the constraint set. Indeed, Algorithm 1 can be viewed as implementing (13) with hyper parameter $t=\mathrm{\infty}$.
In line (1), $\alpha $ is chosen per the Wolfe conditions. In this context, the Wolfe conditions are as follows: For ${\bm{p}}^{[j]}$ a descent direction for ${F}_{\bm{\theta}}$ at ${\bm{x}}^{[j]}$, and constants $$,
${F}_{\bm{\theta}}({\bm{x}}^{[j]}+\alpha {\bm{p}}^{[j]})$  $\le {F}_{\bm{\theta}}({\bm{x}}^{[j]})+{c}_{1}\alpha {\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{({\bm{x}}^{[j]})}^{T}{\bm{p}}^{[j]}$  (15a)  
${\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{({\bm{x}}^{[j]}+\alpha {\bm{p}}^{[j]})}^{T}{\bm{p}}^{[j]}$  $\ge {c}_{2}{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{({\bm{x}}^{[j]})}^{T}{\bm{p}}^{[j]}$  (15b) 
Condition (15a) is also known as the Armijo condition.
For notational convenience, define for a positive number $a$, the constraint set
$${\mathcal{C}}_{a}:=\{\mathit{\varphi}\in {\mathbb{R}}^{{d}_{2}}:\frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}(\mathit{\varphi}{\mathit{\varphi}}_{0})\le a\}$$  (16) 
We show below that there is always an $\alpha $ that satisfies (15a)– (15b) when $$. We then use this fact to show that iterates ${({\bm{x}}^{[j]})}_{j}$ determined in this way converge to a stationary point of ${F}_{\bm{\theta}}$ in the interior of the constraint set ${\mathcal{C}}_{\u03f5}$.
Lemma 1.
Suppose $$ and Assumptions (A1)–(A4) hold. Let $\mathrm{L}\mathrm{:=}\mathrm{\{}\mathbf{\varphi}\mathrm{\in}\mathrm{\Phi}\mathrm{:}{F}_{\mathbf{\theta}}\mathit{}\mathrm{(}\mathbf{\varphi}\mathrm{)}\mathrm{\le}{F}_{\mathbf{\theta}}\mathit{}\mathrm{(}{\mathbf{\varphi}}_{\mathrm{0}}\mathrm{)}\mathrm{\}}$.
Then for any given $\mathbf{\theta}$, the following hold:

(i)
${F}_{\bm{\theta}}$ is bounded from below on $\mathbf{\Phi}$ and continuously differentiable on $\mathrm{int}(\mathbf{\Phi})$.

(ii)
There exists an $$ such that $\mathcal{L}\subseteq \mathcal{N}:=\mathrm{int}({\mathcal{C}}_{\u03f5(\bm{\theta})})$.

(iii)
${\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}$ is Lipschitz continuous on $\mathcal{N}$.
 (iv)
Proof.
(i) Since $\mathbf{\Phi}$ is compact (Assumption (A1)) and ${\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ is continuous on $\mathbf{\Phi}$ (Assumption (A2)), it follows that ${\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ is bounded from below on $\mathbf{\Phi}$. This implies that ${F}_{\bm{\theta}}$ is bounded from below on $\mathbf{\Phi}$.
Since ${\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ is continuously differentiable on $\mathrm{int}(\mathbf{\Phi})$ (Assumption (A2)), and the chosen barrier function (12) is continuously differentiable on $(\mathrm{\infty},0)$, it follows ${F}_{\bm{\theta}}$ is continuously differentiable on $\mathrm{int}(\mathbf{\Phi})$.
(ii) Since ${\bm{H}}_{0}$ is positive definite (Assumption (A3)), we have $$ (strictly). The barrier function (12) forces ${F}_{\bm{\theta}}$ to go to $+\mathrm{\infty}$ at the boundary of ${\mathcal{C}}_{\u03f5}$, and per the proof of part (i) above, ${\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ is bounded from below on $\mathbf{\Phi}$. This means that there is an $$ such that ${F}_{\bm{\theta}}(\mathit{\varphi})>{F}_{\bm{\theta}}({\mathit{\varphi}}_{0})$ for any $\mathit{\varphi}\in {\mathcal{C}}_{\u03f5}\setminus {\mathcal{C}}_{\u03f5(\bm{\theta})}$. For such an $\u03f5(\bm{\theta})$, we have $\mathcal{L}\subseteq \mathrm{int}({\mathcal{C}}_{\u03f5(\bm{\theta})})$
(iii) Clearly, (12) is Lipschitz continuous when $x$ is bounded away from zero. In conjunction with Assumption (A4), we see that ${\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}$ is Lipschitz continuous on $\mathcal{N}$.
(iv) Since $\bm{p}$ is a descent direction, $$, ${F}_{\bm{\theta}}$ is bounded from below on $\mathbf{\Phi}$ and continuously differentiable on $\mathrm{int}(\mathbf{\Phi})$ (from part (i) above), there must exist an $\stackrel{~}{\alpha}>0$ such that ${F}_{\bm{\theta}}(\bm{x}+\stackrel{~}{\alpha}\bm{p})={F}_{\bm{\theta}}(\bm{x})+{c}_{1}\stackrel{~}{\alpha}{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{(\bm{x})}^{T}\bm{p}$ and any $\alpha \in I:=[0,\stackrel{~}{\alpha}]$ satisfies (15a). By virtue of satisfying (15a), $\bm{x}+\alpha \bm{p}\in \mathcal{L}$ for any $\alpha \in I$.
Regarding condition (15b), observe that by the Mean Value Theorem, there is some ${\alpha}_{1}\in I$ such that ${\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{(\bm{x}+{\alpha}_{1}\bm{p})}^{T}\bm{p}={c}_{1}{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{(\bm{x})}^{T}\bm{p}$. Since $\bm{p}$ is a descent direction and $$, there is, by the continuous differentiability of ${F}_{\bm{\theta}}$, some $0\le {\alpha}_{2}\le {\alpha}_{1}$ such that ${\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}(\bm{x}+{\alpha}_{2}\bm{p})={c}_{2}{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}{(\bm{x})}^{T}\bm{p}$. Furthermore, ${\alpha}_{2}\in I$ since ${\alpha}_{2}\le {\alpha}_{1}$ and ${\alpha}_{1}\in I$. Therefore, there is a subinterval of $I$ for which both Wolfe conditions, (15a) and (15b) are satisfied. ∎
The above gives us the following:
Corollary 2.
Proof.
We apply a result of Zoutendijk, reproduced as Proposition 7 in Appendix C from Nocedal and Wright (2006).
By Lemma 1, the conditions of Proposition 7 are satisfied for ${F}_{\bm{\theta}}$ with $\mathcal{N}$ defined in the lemma. We have $\mathrm{cos}{\upsilon}_{j}^{{\nu}_{\sigma}}\ge \sigma $, therefore, per the conclusion of the aforementioned proposition, we conclude ${lim}_{j\to \mathrm{\infty}}\parallel {\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}({\bm{x}}^{[j]})\parallel =0$.
Note, that whilst the statement of Proposition 7 does not impose the condition that $\mathit{\varphi}\in \mathbf{\Phi}$ for the sublevel set $\mathcal{L}$, the result is still applicable as we showed in part (iv) that there are always choices of $\alpha $ that can be made such that iterates that start in $\mathcal{L}$ remain in $\mathcal{L}$, and that $\mathcal{L}\subseteq \mathcal{N}$.
∎
Let ${\stackrel{~}{\bm{p}}}^{[j]}$ be the bracketed term in (13) (so ${\bm{p}}^{[j]}={\stackrel{~}{\bm{p}}}^{[j]}{\nu}_{\sigma}{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}({\bm{x}}^{[j]})$). Let ${\bm{x}}^{[j]}\in \mathrm{int}({\mathcal{C}}_{\u03f5})$ and for ease of notation, let ${\bm{g}}_{j}={{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{\mathit{\varphi}={\bm{x}}^{[j]}}$. Since ${\stackrel{~}{\bm{p}}}^{[j]}$ was derived from (10), it is a descent direction for ${\bm{g}}_{j}$. Let $D$ be the diameter of the constraint set ${\mathcal{C}}_{\u03f5}$, then $\parallel {\stackrel{~}{\bm{p}}}^{[j]}\parallel \le D$. Let ${\stackrel{~}{\upsilon}}_{j}$ be the angle between ${\stackrel{~}{\bm{p}}}^{[j]}$ and ${\bm{g}}_{j}$. Then $\parallel {\stackrel{~}{\bm{p}}}^{[j]}\parallel \mathrm{cos}{\stackrel{~}{\upsilon}}_{j}\ge {\mathrm{\ell}}_{j}$ where ${\mathrm{\ell}}_{j}$ is the distance from ${\bm{x}}^{[j]}$ to the boundary of ${\mathcal{C}}_{\u03f5}$ along the direction ${\bm{g}}_{j}$. Thus, $\mathrm{cos}{\stackrel{~}{\upsilon}}_{j}\ge {\mathrm{\ell}}_{j}/D$. Therefore, for any given ${\bm{x}}^{[j]}\in \mathrm{int}({\mathcal{C}}_{\u03f5})$, ${\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}({\bm{x}}^{[j]})\to {\bm{g}}_{j}$ as $t\to \mathrm{\infty}$, implying that ${\upsilon}_{\sigma}\to 0$ if $\sigma \le {\mathrm{\ell}}_{j}/D$. Furthermore, fixing $t$ and generating a sequence ${({\bm{x}}^{[j]})}_{j\ge 0}$ from the inner loop (Phase I) of Algorithm 1 using (13), we see that due to the barrier function, there is some lower bound $\mathrm{\ell}=inf{({\mathrm{\ell}}_{j})}_{j\ge 0}$, and so there is an upper bound on ${\upsilon}_{\sigma}$.
Finally, consider the special case where $t=\mathrm{\infty}$. Then ${\nu}_{\sigma}=0$, but we can only guarantee to satisfy the the first of the Wolfe conditions, the Armijo condition (15a), and therefore, cannot use the above proof for convergence of to a stationary point of ${({\bm{x}}^{[j]})}_{j\ge 0}$.
Now we continue to consider the convergence of the algorithm for $$. We define a mapping ${\mathit{\varphi}}^{*}:\mathbf{\Theta}\to \mathbf{\Phi}$ as follows: ${\mathit{\varphi}}^{*}(\bm{\theta}):={lim}_{j\to \mathrm{\infty}}{\bm{x}}^{[j]}$ when Phase I of Algorithm 1 is executed with ${\bm{x}}^{[0]}={\mathit{\varphi}}^{[0]}$ (as it always is), and with $\bm{\theta}$ as the policy parameter. Thus, if the for loop in Phase I were to run for sufficiently many iterations, then after line 1, we would have ${\mathit{\varphi}}^{k+1}={\mathit{\varphi}}^{*}({\bm{\theta}}^{[k]})$. For notational convenience, we use the shorthand ${\mathit{\varphi}}_{k+1}^{*}={\mathit{\varphi}}^{*}({\bm{\theta}}^{[k]})$.
Recalling that ${F}_{\bm{\theta}}(\mathit{\varphi})$ is a function of both $\mathit{\varphi}$ and $\bm{\theta}$, we can define a function $G(\bm{\theta}):={F}_{\bm{\theta}}({\mathit{\varphi}}^{*}(\bm{\theta}))$.
The point of the above definitions is this: If the assumptions made in Theorem 3 (see below) hold, then Algorithm 1 will be generating a sequence of policy parameters ${({\bm{\theta}}^{[k]})}_{k\ge 0}$ through gradient ascent of the function $G$:
$${\bm{\theta}}^{[k+1]}\leftarrow {\bm{\theta}}^{[k]}+{\beta}^{[k]}{\nabla}_{\bm{\theta}}G({\bm{\theta}}^{[k]})$$ 
Getting back to the original motivation, we wished to solve the maxmin problem (8). As discussed, this problem is intractable in part because the minimisation is over a nonconvex function of $\mathit{\varphi}$. Thus, in Algorithm 1 we opted instead to determine $\mathit{\varphi}$ which was (close to) a stationary point of the (barrieraugmented) objective. This is represented by the function $G$; specifically, $G({\bm{\theta}}^{[k]})$ is the value of the (barrieraugmented) objective at the stationary point.
But then, what of the maximisation? Ideally, we would like to solve ${\mathrm{max}}_{\bm{\theta}}G(\bm{\theta})$ as a substitute for the maxmin problem (8), but again, due to the nonconcavity of the objective with respect to $\bm{\theta}$, maximisation is generally intractable, and we opt instead to converge to a ${\bm{\theta}}^{*}$ that is a stationary point of $G$. The fact that the algorithm does this (and the sufficient conditions for it to do so) is shown in the following theorem.
Theorem 3.
Suppose $$ and each sequence ${\mathrm{(}{\mathbf{x}}^{\mathrm{[}j\mathrm{]}}\mathrm{)}}_{j\mathrm{\ge}\mathrm{0}}$ generated in Phase I of Algorithm 1 is with assignment (13) applied in line 1. If ${\mathbf{\varphi}}^{\mathrm{[}k\mathrm{+}\mathrm{1}\mathrm{]}}\mathrm{=}{\mathrm{lim}}_{j\mathrm{\to}\mathrm{\infty}}\mathit{}{\mathbf{x}}^{\mathrm{[}j\mathrm{]}}$ in line 1 of Algorithm 1, then the ${\mathbf{\theta}}^{\mathrm{[}k\mathrm{]}}$ parameter update in line 1 (which is implemented with $\mathbf{\varphi}$ set to ${\mathbf{\varphi}}^{\mathrm{[}k\mathrm{+}\mathrm{1}\mathrm{]}}$) is a gradient ascent step with respect to $G$ at ${\mathbf{\theta}}^{\mathrm{[}k\mathrm{]}}$.
Proof.
We have
$${\nabla}_{\bm{\theta}}G({\bm{\theta}}^{[k]})={{\nabla}_{\bm{\theta}}{F}_{\bm{\theta}}}_{\bm{\theta}={\bm{\theta}}^{[k]},\mathit{\varphi}={\mathit{\varphi}}_{k+1}^{*}}+{\bm{J}({\bm{\theta}}^{[k]}){\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}}_{\bm{\theta}={\bm{\theta}}^{[k]},\mathit{\varphi}={\mathit{\varphi}}_{k+1}^{*}}$$  (17) 
where $\bm{J}({\bm{\theta}}^{[k]})$ is the Jacobian of the function ${\mathit{\varphi}}^{*}$ evaluated at ${\bm{\theta}}^{[k]}$.
The RHS will be equal to ${{\nabla}_{\bm{\theta}}{F}_{\bm{\theta}}}_{\bm{\theta}={\bm{\theta}}^{[k]},\mathit{\varphi}={\mathit{\varphi}}_{k+1}^{*}}$ if ${{\nabla}_{\mathit{\varphi}}{F}_{\bm{\theta}}}_{\bm{\theta}={\bm{\theta}}^{[k]},\mathit{\varphi}={\mathit{\varphi}}_{k+1}^{*}}=0$. But this is indeed the implication of Corollary 2 if the sequence ${({\bm{x}}^{[j]})}_{j\ge 0}$ is left to run to convergence.
Thus, we have
$${\nabla}_{\bm{\theta}}G({\bm{\theta}}^{[k]})={{\nabla}_{\bm{\theta}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{\bm{\theta}={\bm{\theta}}^{[k]},\mathit{\varphi}={\mathit{\varphi}}_{k+1}^{*}}$$ 
and observe the correspondence to the term in line 1 in Algorithm 1: Indeed, they will be exactly the same if the assumptions in the statement of the Theorem hold.
Consequently, the ${\bm{\theta}}^{[k]}$ parameter update in line 1 in Algorithm 1 (which is implemented with $\mathit{\varphi}$ fixed to ${\mathit{\varphi}}^{[k+1]}$) is a valid gradient step if ${\mathit{\varphi}}^{[k+1]}={lim}_{j\to \mathrm{\infty}}{\bm{x}}^{[j]}$.
∎
Consequently, we have the following, which follows by standard results from the numerical optimisation literature (see, e.g., Nocedal and Wright (2006))
Corollary 4.
If $G$ is bounded from above and has Lipschitzcontinuous first derivatives, then the sequence of step sizes ${\mathrm{(}{\beta}^{\mathrm{[}k\mathrm{]}}\mathrm{)}}_{k\mathrm{\ge}\mathrm{0}}$ can be chosen so as to ensure that the sequence ${\mathrm{(}{\mathbf{\theta}}^{\mathrm{[}k\mathrm{]}}\mathrm{)}}_{k\mathrm{\ge}\mathrm{0}}$ will converge to a stationary point of $G$.
Of course, in practice, we may have to terminate Phase I before the sequence has converged, in which case the update in line 1 is an approximate gradient update which can be made to have an error as small as desired by terminating Phase I as late as is needed.
5 Zero’th Order Wasserstein Robust Reinforcement Learning
So far, we have presented an algorithm for robust reinforcement learning assuming accessibility to first and secondorder information of the loss and constraint. It is relatively easy to attain such information if we were to follow a modelbased setting that parameterises transitions with deep networks. As mentioned earlier, however, we follow another route that utilises a blackbox optimisation scheme as deep neural networks are not always suitable for dynamical systems grounded in Lagrangian mechanics and physics (Lutter et al., 2019; Lutter and Peters, 2019).
This section details our zeroorder robust solver, where we present an implementation of our approach for a scenario where training can be done on a simulator for which the dynamics are parameterised and can be altered at will. Whilst we refer to simulators (since much of RL training is done on such), almost exactly the same applies to differential equation solvers or other software based techniques for training a policy before deployment into the real world.
To elaborate, consider a simulator ${\mathbb{S}}_{\mathit{\varphi}}$ for which the dynamics are parameterised by a real vector $\mathit{\varphi}$, and for which we can execute steps of a trajectory (i.e., the simulator takes as input an action $\bm{a}$ and gives back a successor state and reward). For generating novel physicsgrounded transitions, one can simply alter $\mathit{\varphi}$ and execute the instruction in ${\mathbb{S}}_{\mathit{\varphi}}$ from some a state $\bm{s}\in \mathcal{S}$, while applying an action $\bm{a}\in \mathcal{A}$. Not only does this ensure valid (under mechanics) transitions, but also promises scalability as specification parameters typically reside in lower dimensional spaces compared to the number of tuneable weights when using deep networks as transition models.
As we do not explicitly model transitions (e.g., the intricate operations of the simulator or differential equations solver), one has to tackle an additional challenge when requiring gradient or Hessian information to perform optimisation. Namely, if the idea of parameterising simulators through dynamic specifications in Phases I and II of Algorithm 1 is to be successfully executed, we require a procedure for estimating first and secondorder information based on only function value evaluations of ${\mathbb{S}}_{\mathit{\varphi}}$.
Next, we elaborate how one can acquire such estimates by proposing a novel zeroorder method for estimating gradients and Hessians that we use in our experiments that demonstrate scalability, and robustness on highdimensional robotic environments.
Gradient Estimation:
Recalling the update rule in Phase I of Algorithm 1, we realise the need for, estimating the gradient of the loss function with respect to the vector specifying the dynamics of the environment, i.e., ${\bm{g}}^{[k]}={{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{{\bm{\theta}}^{[k]},{\mathit{\varphi}}^{[k]}}$ at each iteration of the innerloop $j$. Handling simulators as blackbox models, we estimate the gradients by sampling from a Gaussian distribution with mean $\mathrm{\U0001d7ce}$ and ${\sigma}^{2}\bm{I}$ covariance matrix. Our choice for such estimates is not arbitrary but rather theoretically grounded as one can easily prove the following proposition:
Proposition 5 (ZeroOrder Gradient Estimate).
For a fixed $\mathbf{\theta}$ and $\mathbf{\varphi}$, the gradient can be computed as:
$${\bm{g}}^{[k]}={\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{\mathit{t}\mathit{o}\mathit{t}\mathit{a}\mathit{l}}}(\bm{\tau})\right]=\frac{1}{{\sigma}^{2}}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[\bm{\xi}{\int}_{\bm{\tau}}{p}_{\bm{\theta}}^{\mathit{\varphi}+\bm{\xi}}(\bm{\tau}){\mathcal{R}}_{\text{\mathit{t}\mathit{o}\mathit{t}\mathit{a}\mathit{l}}}(\bm{\tau})\mathit{d}\bm{\tau}\right].$$ 
Proof.
The proof of the above proposition can easily be derived by combining the lines of reasoning in (Salimans et al., 2017; Nesterov, 2011), while extending to the parameterisation of dynamical models. To commence, begin by defining ${\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})={\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\varphi}}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]$ for fixed policy parameters $\bm{\theta}$. Given any perturbation vector, $\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})$, we can derive (through a Taylor expansion) the following:
${\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi}+\bm{\xi})={\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})+{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})+{\displaystyle \frac{1}{2}}{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}^{2}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})\bm{\xi}+\mathcal{O}\left(\text{higherorder terms}\right).$ 
Multiplying by $\bm{\xi}$, and taking the expectation on both sides of the above equation, we get:
${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(0,{\sigma}^{2}\bm{I})}\left[\bm{\xi}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi}+\bm{\xi})\right]$  $={\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[\bm{\xi}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})+\bm{\xi}{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})+{\displaystyle \frac{1}{2}}\bm{\xi}{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}^{2}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})\bm{\xi}\right]$  
$={\sigma}^{2}{\nabla}_{\mathit{\varphi}}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi}).$ 
Dividing by ${\sigma}^{2}$, we derive the statement of the proposition as:
${\nabla}_{\mathit{\varphi}}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi})={\displaystyle \frac{1}{{\sigma}^{2}}}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(0,{\sigma}^{2}\bm{I})}\left[\bm{\xi}{\mathcal{J}}_{\bm{\theta}}(\mathit{\varphi}+\bm{\xi})\right]$ 
∎
Hessian Estimation:
Having derived a zeroorder gradient estimator, we now generalise these notions to a form allowing us to estimate the Hessian. It is also worth reminding the reader that such a Hessian estimator needs to be performed one time only before executing the instructions in Algorithm 1 (i.e., ${\bm{H}}_{0}$ is passed as an input). Precisely, we prove the following proposition:
Proposition 6 (ZeroOrder Hessian Estimate).
The hessian of the Wasserstein distance around ${\mathbf{\varphi}}_{\mathrm{0}}$ can be estimated based on function evaluations. Recalling that ${\mathbf{H}}_{\mathrm{0}}\mathrm{=}{\mathrm{\nabla}}_{\mathbf{\varphi}}^{\mathrm{2}}{\mathrm{E}}_{\mathrm{(}\mathbf{s}\mathrm{,}\mathbf{a}\mathrm{)}\mathrm{\sim}\pi \mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}\mathit{}{\rho}_{\pi}^{{\mathbf{\varphi}}_{\mathrm{0}}}\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}}\mathrm{\left[}{\mathrm{W}}_{\mathrm{2}}^{\mathrm{2}}\mathrm{(}{\mathrm{P}}_{\mathbf{\varphi}}\mathrm{(}\mathrm{\cdot}\mathrm{}\mathbf{s}\mathrm{,}\mathbf{a}\mathrm{)}\mathrm{,}{\mathrm{P}}_{\mathrm{0}}\mathrm{(}\mathrm{\cdot}\mathrm{}\mathbf{s}\mathrm{,}\mathbf{a}\mathrm{)}\mathrm{)}\mathrm{\right]}{\mathrm{}}_{\mathbf{\varphi}\mathrm{=}{\mathbf{\varphi}}_{\mathrm{0}}}$, and defining ${\mathrm{W}}_{\mathrm{(}\mathbf{s}\mathrm{,}\mathbf{a}\mathrm{)}}\mathrm{(}\mathbf{\varphi}\mathrm{)}\mathrm{:=}{\mathrm{W}}_{\mathrm{2}}^{\mathrm{2}}\mathrm{(}{\mathrm{P}}_{\mathbf{\varphi}}\mathrm{(}\mathrm{\cdot}\mathrm{}\mathbf{s}\mathrm{,}\mathbf{a}\mathrm{)}\mathrm{,}{\mathrm{P}}_{\mathrm{0}}\mathrm{(}\mathrm{\cdot}\mathrm{}\mathbf{s}\mathrm{,}\mathbf{a}\mathrm{)}\mathrm{)}$, we prove:
${\bm{H}}_{0}$  $={\displaystyle \frac{1}{{\sigma}^{2}}}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[{\displaystyle \frac{1}{{\sigma}^{2}}}\bm{\xi}\left({\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0}+\bm{\xi})\right]\right){\bm{\xi}}^{\U0001d5b3}$  
${\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0}+\bm{\xi})\right]\bm{I}].$ 
Proof.
Commencing with the righthandside of the above equation, we perform secondorder Taylor expansions for each of the two terms under under the expectation of $\bm{\xi}$. Namely, we write:
${\bm{H}}_{0}$  $={\displaystyle \frac{1}{{\sigma}^{2}}}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[{\displaystyle \frac{1}{{\sigma}^{2}}}\bm{\xi}\left({\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0}+\bm{\xi})\right]\right){\bm{\xi}}^{\U0001d5b3}$  (18)  
${\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0}+\bm{\xi})\right]\bm{I}]$  
$\approx {\displaystyle \frac{1}{{\sigma}^{4}}}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]\bm{\xi}{\bm{\xi}}^{\U0001d5b3}+\bm{\xi}{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]{\bm{\xi}}^{\U0001d5b3}$  
$+{\displaystyle \frac{1}{2}}{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}^{2}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]\bm{\xi}\bm{I}]$  
${\displaystyle \frac{1}{{\sigma}^{2}}}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]\bm{I}+{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]\bm{I}$  
$+{\displaystyle \frac{1}{2}}{\bm{\xi}}^{\U0001d5b3}{\nabla}_{\mathit{\varphi}}^{2}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]\bm{\xi}\bm{I}].$ 
Now, we analyse each of the above terms separately. For ease of notation, we define the following variables:
$\bm{g}$  $={\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}\bm{H}={\nabla}_{\mathit{\varphi}}^{2}{\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\varphi}_{0}}}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]$  
$\bm{A}$  $=\bm{\xi}{\bm{\xi}}^{\U0001d5b3}\bm{g}{\bm{\xi}}^{\U0001d5b3}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}\bm{B}=\bm{\xi}{\bm{\xi}}^{\U0001d5b3}\bm{H}\bm{\xi}{\bm{\xi}}^{\U0001d5b3}$  
$c$  $={\bm{\xi}}^{\U0001d5b3}\bm{H}\bm{\xi}.$ 
Starting with $\bm{A}$, we can easily see that any $(i,j)$ component can be written as $\bm{A}={\sum}_{n=1}^{{d}_{2}}{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{g}}_{n}$. Therefore, the expectation under $\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})$ can be derived as:
${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{g}}_{n}\right]={\bm{g}}_{n}{\mathbb{E}}_{{\bm{\xi}}_{i}\sim \mathcal{N}(0,{\sigma}^{2})}\left[{\bm{\xi}}_{i}^{3}\right]=0\mathit{\hspace{1em}\hspace{1em}\u2006}\text{if}i=j=n\text{and 0 otherwise.}$ 
Thus, we conclude that ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[\bm{A}]={\mathrm{\U0001d7ce}}_{{d}_{2}\times {d}_{2}}$.
Continuing with the second term, i.e., $\bm{B}$, we realise that any $(i,j)$ component can be written as ${\bm{B}}_{i,j}={\sum}_{n=1}^{{d}_{2}}{\sum}_{m=1}^{{d}_{2}}{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}$. Now, we consider two cases:

•
Diagonal Elements (i.e., when $i=j$): The expectation under $\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})$ can be further split in three subcases

–
SubCase I when $i=j=m=n$: We have ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(0,{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}\right]={\bm{H}}_{i,i}{\mathbb{E}}_{{\bm{\xi}}_{i}\sim \mathcal{N}(0,{\sigma}^{2})}=3{\sigma}^{4}{\bm{H}}_{i,i}.$

–
SubCase II when $i=j\ne m=n$: We have ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(0,{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}\right]={\bm{H}}_{m,m}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}^{2}{\bm{\xi}}_{m}^{2}\right]={\sigma}^{4}{\bm{H}}_{m,m}.$

–
SubCase III when indices are all distinct: We have ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(0,{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}\right]=0.$
[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Diagonal Elements Conclusion: Using the above results we conclude that ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[{\bm{B}}_{i,i}]=2{\sigma}^{4}{\bm{H}}_{i,i}+{\sigma}^{4}\text{trace}(\bm{H})$.

–

•
OffDiagonal Elements (i.e., when $i\ne j$): The above analysis is now repeated for computing the expectation of the offdiagonal elements of matrix $\bm{B}$. Similarly, this can also be split into three subcases depending on indices:

–
SubCase I when $i=m\ne j=n$: We have ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}\right]={\bm{H}}_{i,j}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}^{2}{\bm{\xi}}_{j}^{2}\right]={\sigma}^{4}{\bm{H}}_{i,j}.$

–
SubCase II when $i=n\ne j=m$: We have ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}\right]={\bm{H}}_{j,i}{\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}^{2}{\bm{\xi}}_{j}^{2}\right]={\sigma}^{4}{\bm{H}}_{j,i}.$

–
SubCase III when indices are all distinct: We have ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{\xi}}_{n}{\bm{\xi}}_{m}{\bm{H}}_{m,n}\right]=0.$
[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] OffDiagonal Elements Conclusion: Using the above results and due to the symmetric properties of $\bm{H}$, we conclude that ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I}\mathbf{)}}\left[{\bm{B}}_{i,j}\right]=2{\sigma}^{4}{\bm{H}}_{i,j}$

–
Finally, analysing $c$, one can realise that ${\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}[c]={\mathbb{E}}_{\bm{\xi}\sim \mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})}\left[{\sum}_{i=1}^{{d}_{2}}{\sum}_{j=1}^{{d}_{2}}{\bm{\xi}}_{i}{\bm{\xi}}_{j}{\bm{H}}_{i,j}\right]={\sigma}^{2}\text{trace}(\bm{H})$.
Substituting the above conclusions back in the original approximation in Equation 18, and using the linearity of the expectation we can easily achieve the statement of the proposition. ∎
With the above two propositions, we can now perform the updates in Algorithm 1 without the need for performing explicit model learning. This is true as Propositions 6 and 7 devise procedure where gradient and Hessian estimates can be simply based on simulator value evaluations while perturbing $\mathit{\varphi}$ and ${\mathit{\varphi}}_{0}$. It is important to note that in order to apply the above, we are required to be able to evaluate ${\mathbb{E}}_{(\bm{s},\bm{a})\sim \pi (\cdot ){\rho}_{\pi}^{{\mathit{\varphi}}_{0}}(\cdot )}\left[{\mathcal{W}}_{(\bm{s},\bm{a})}({\mathit{\varphi}}_{0})\right]$ under random $\bm{\xi}$ perturbations sampled from $\mathcal{N}(\mathrm{\U0001d7ce},{\sigma}^{2}\bm{I})$. An empirical estimate of the $p$Wasserstein distance between two measures $\mu $ and $\nu $ can be performed by computing the $p$Wasserstein distance between the empirical distributions evaluated at sampled data. That is, one can approximation $\mu $ by ${\mu}_{n}=\frac{1}{n}{\sum}_{i=1}^{n}{\delta}_{{\bm{x}}_{i}}$ where ${\bm{x}}_{i}$ are identically and independently distributed according to $\mu $. Approximating ${\nu}_{n}$ similarly, we then realise that^{8}^{8} 8 In case the dynamics are assumed to be Gaussian, a similar procedure can be followed or a closed form can be used, see Takatsu (2008). ${\mathcal{W}}_{2}(\mu ,\nu )\approx {\mathcal{W}}_{2}({\mu}_{n},{\nu}_{n})$.
6 Related work
In this section we review some of the related literature. There is a common theme running through previous works: the formulation of the problem as a maxmin (or minmax, depending on whether the goal is to maximise for rewards or minimise for costs) problem. This gametheoretic view is natural formulation when viewing nature as an adversary. Thus, it will be no surprise that the papers discussed below mainly take this approach or close variations of it.
There is a longstanding thread of research on robustness in the classical control community, and the literature in this area is vast, with the ${\mathscr{H}}_{\mathrm{\infty}}$ method being a standard approach Doyle et al. (2013). This approach was introduced into reinforcement learning by Morimoto and Doya (2005). In that paper, a continuous time reinforcement learning setting was studied for which a maxmin problem was formulated involving a modified value function, the optimal solutions of which can be determined by solving HamiltonJacobiIsaacs (HJI) equation.
There is also a line of work on robust MDPs, amongst which are Iyengar (2005); Nilim and El Ghaoui (2005); Wiesemann et al. (2013); Tirinzoni et al. (2018); Petrik and Russell (2019). In particular, Yang (2017) uses the Wasserstein distance to define uncertainty sets of dynamics in similar way to this work, that is, in an $\u03f5$ball around a particular dynamics (referred to as nominal distribution in that paper). The paper shows that an optimal Markov control policy is possible the maxmin Bellman equations and shows how convexoptimisation techniques can be applied to solve it.
Whilst valuable in their own right, these approaches are not sufficient for the RL setting due to the need in the latter case to give efficient solutions for large state and action spaces, and the fact that the dynamics are not known a priori. We emphasise once again that in our setting, cannot explicitly define the MDP of the reference dynamics, since we do not assume knowledge of it. We assume only that we can sample from it, which is the standard assumption made in RL.
Reasonably, one might expect that modelbased reinforcement learning may be a plausible route to address robustness. In Asadi et al. (2018), the learning of Lipschitz continuous models is addressed, and a bound on multistep prediction error is given in terms of the Wasserstein distance. However, the major stumblingblock with modelbased RL techniques is that in highdimensional state building models that are sufficient for controlling an agent can suffer greatly from model misspecification or excessive computational costs (e.g., as in training a Gaussian Process, see, e.g., Rasmussen (2003); Deisenroth and Rasmussen (2011)).
We now discuss some papers closer in objective and/or technique to our own. Rajeswaran et al. (2017) approaches the robustness problem by training on an ensemble of dynamics in order to be deployed on a target environment. The algorithm introduced, Ensemble Policy Optimisation (EPOpt), alternates between two phases: (i) given a distribution over dynamics for which simulators (or models) are available (the source domain), train a policy that performs well for the whole distribution; (ii) gather data from the deployed environment (target domain) to adapt the distribution. The objective is not maxmin, but a softer variation defined by conditional valueatrisk (CVaR). The algorithm samples a set of dynamics $\{{\mathit{\varphi}}_{k}\}$ from a distribution over dynamics ${\mathcal{P}}_{\bm{\psi}}$, and for each dynamics ${\mathit{\varphi}}_{k}$, it samples a trajectory using the current policy parameter ${\bm{\theta}}_{i}$. It then selects the worst performing $\u03f5$fraction of the trajectories to use to update the policy parameter. Clearly this process bears some resemblance to our algorithm, but there is a crucial difference: our algorithm takes descent steps in the $\mathit{\varphi}$ space. The difference if important when the dynamics parameters sit in a highdimensional space, since in that case, optimisationfromsampling could demand a considerable number of samples. A counter argument against our technique might be that our zero’thorder method for estimating gradients and Hessians also requires sampling in high dimensions. This is, indeed the case, but obtaining localised estimates (as gradients and Hessians are local properties) could be easier than global properties (the worse set of parameters in the highdimensional space). In any case, our experiments demonstrate our algorithm performs well even in these high dimensions. The experiments of Rajeswaran et al. (2017) are on Hopper and HalfCheetah, in which their algorithm is compared to TRPO. We note that Rajeswaran et al. (2017) does not have theoretical guarantees. We also note that we were were unable to find the code for this paper, and did not attempt to implement it ourselves.
The CVaR criterion is also adopted in Pinto et al. (2017), in which, rather than sampling trajectories and finding a quantile in terms of performance, two policies are trained simultaneously: a “protagonist” which aims to optimise performance, and an adversary which aims to disrupt the protagonist. The protagonist and adversary train alternatively, with one being fixed whilst the other adapts. The action space for the adversary, in the tests documented in the paper includes forces on the entities (InvertedPendulum, HalfCheetah, Swimmer, Hopper, Walker2D) that aim to destabalise it. We made comparisons against this algorithm in our experiments.
More recently, Tessler et al. (2019) studies robustness with respect to action perturbations. There are two forms of perturbation addressed: (i) Probabilistic Action Robust MDP (PRMDP), and (ii) Noisy Action Robust MDP (NRMDP). In PRMDP, when an action is taken by an agent, with probability $\alpha $, a different, possibly adversarial action is taken instead. In NRMDP, when an action is taken, a perturbation is added to the action itself. Like Rajeswaran et al. (2017) and Pinto et al. (2017), the algorithm is suitable for applying deep neural networks, and the paper reports experiments on InvertedPendulum, Hopper, Walker2d and Humanoid. We tested against PRMDP in some of our experiments, and found it to be lacking in robustness (see Section 7, Figure 1 and Figure 2).
In Lecarpentier and Rachelson (2019) a nonstationary Markov Decision Process model is considered, where the dynamics can change from one time step to another. The constraint is based on Wasserstein distance, specifically, the Wasserstein distance between dynamics at time $t$ and ${t}^{\prime}$ is bounded by $Lt{t}^{\prime}$, i.e., is $L$Lipschitz with respect to time, for some constant $L$. They approach the problem by treating nature as an adversary and implement a Minimax algorithm. The basis of their algorithm is that due to the fact that the dynamics changes slowly (due to the Lipschitz constraint), a planning algorithm can project into the future the scope of possible future dynamics and plan for the worst. The resulting algorithm, known as Risk Averse Tree Search, is  as the name implies  a tree search algorithm. It operates on a sequence “snapshots” of the evolving MDP, which are instances of the MDP at points in time. The algorithm is tested on small grid world, and does not appear to be readily extendible to the continuous state and action scenarios our algorithm addresses.
To summarise, our paper uses the Wasserstein distance for addressing, in common with Lecarpentier and Rachelson (2019), but is suited to applying deep neural networks for continuous state and action spaces. Our paper does not require a full dynamics available to it, merely a parameterisable dynamics. It competes well with the above papers, and operates well for high dimensional problems, as evidenced by the experiments.
7 Experiments & Results
We evaluate ${\text{WR}}^{2}\text{L}$ on a variety of continuous control benchmarks from the MuJoCo environment. Dynamics in our benchmarks were parameterised by variables defining physical behaviour, e.g., density of the robot’s torso, friction of the ground, and so on. We consider both low and high dimensional dynamics and demonstrate that our algorithm outperforms stateoftheart from both standard and robust reinforcement learning. We are chiefly interested in policy generalisation across environments with varying dynamics, which we measure using average test returns on novel systems. The comparison against standard reinforcement learning algorithms allows us to understand whether lack of robustness is a critical challenge for sequential decision making, while comparisons against robust algorithms test if we outperform stateoftheart that considered a similar setting to ours. From standard algorithms, we compare against proximal policy optimisation (PPO) (Schulman et al., 2017), and trust region policy optimisation (TRPO) (Schulman et al., 2015b); an algorithm based on natural actorcrtic (Peters and Schaal, 2008a; Pajarinen et al., 2019). From robust algorithms, we demonstrate how ${\text{WR}}^{2}\text{L}$ favours against robust adversarial reinforcement learning (RARL) (Pinto et al., 2017), and actionperturbed Markov decision processes (PRMDP) proposed in (Tessler et al., 2019).
It is worth noting that we attempted to include deep deterministic policy gradients (DDPG) (Silver et al., 2014) in our comparisons. Results including DDPG were, however, omitted as it failed to show any significant robustness performance even on relatively simple systems, such as the inverted pendulum; see results reported in Appendix A. During initial trials, we also performed experiments parameterising models using deep neural networks. Results demonstrated that these models, though minimising training data error, fail to provide valid physicsgrounded dynamics. For instance, we arrived at inverted pendula models that vary pole angles without exerting any angular speeds. This problem became even more apparent in highdimensional systems, e.g., Hopper, Walker, etc due to the increased number of possible minima. As such, results presented in this section make use of our zeroorder method that can be regarded as a scalable alternative for robust solutions.
7.1 MuJoCo benchmarks
Contrary to other methods rooted in modelbased reinforcement learning, we evaluate our method both in low and highdimensional MuJuCo tasks (Brockman et al., 2016). We consider a variety of systems including CartPole, Hopper, and Walker2D; all of which require direct jointtorque control. Keeping with the generality of our method, we utilise these dynamical asis with no additional alterations. Namely, we use the exact setting of these benchmarks as that shipped with OpenAI gym without any reward shaping, statespace augmentation, feature extraction, or any other modifications ofthatsort. For clarity, we summarise variables parameterising dynamics in Table 1, and detail specifics next.
CartPole:
The goal of this classic control benchmark is to balance a pole by driving a cart along a rail. The state space is composed of the position $x$ and velocity $\dot{x}$ of the cart, as well as the angle $\theta $ and angular velocities of the pole $\dot{\theta}$. We consider two termination conditions in our experiments: 1) pole deviates from the upright position beyond a prespecified threshold, or 2) cart deviates from its zeroth initial position beyond a certain threshold. To conduct robustness experiments, we parameterise the dynamics of the CartPole by the pole length ${l}_{p}$, and test by varying ${l}_{p}\in [0.3,3]$.
Hopper: In this benchmark, the agent is required to control a hopper robot to move forward without falling. The state of the hopper is represented by positions, $\{x,y,z\}$, and linear velocities, $\{\dot{x},\dot{y},\dot{z}\}$, of the torso in global coordinate, as well as angles, ${\{{\theta}_{i}\}}_{i=0}^{2}$, and angular speeds, ${\{{\dot{\theta}}_{i}\}}_{i=0}^{2}$, of the three joints. During training, we exploit an earlystopping scheme if “unhealthy” states of the robot were visited. Parameters characterising dynamics included densities ${\{{\rho}_{i}\}}_{i=0}^{3}$ of the four links, armature ${\{{a}_{i}\}}_{i=0}^{2}$ and damping ${\{{\zeta}_{i}\}}_{i=0}^{2}$ of three joints, and the friction coefficient ${\mu}_{g}$. To test for robustness, we varied both frictions and torso densities leading to significant variations in dynamics. We further conducted additional experiments while varying all 11 dimensional specification parameters.
Walker2D: This benchmark is similar to Hopper except that the controlled system is a biped robot with seven bodies and six joints. Dimensions for its dynamics are extended accordingly as reported in Table 1. Here, we again varied the torso density for performing robustness experiments in the range ${\rho}_{0}\in [500,3000]$.
Halfcheetah: This benchmark is similar to the above except that the controlled system is a twodimensional slice of a threedimensional cheetah robot. Parameters specifying the simulator consist of 21 dimensions, with 7 representing densities. In our twodimensional experiments we varied the torsodensity and floor friction, while in highdimensional ones, we allowed the algorithm to control all 21 variables.
7.2 Experimental protocol
1D experiment  2D experiment  Highdimensional experiment  

Inverted Pendulum  ${l}_{p}$  None  None 
Hopper  ${\rho}_{0}$  $\{{\rho}_{0},{\mu}_{g}\}$  ${\{{\rho}_{i}\}}_{i=0}^{3}\cup {\{{a}_{i}\}}_{i=0}^{2}\cup {\{{\zeta}_{i}\}}_{i=0}^{2}\cup {\mu}_{g}$ 
Walker2D  ${\rho}_{0}$  $\{{\rho}_{0},{\mu}_{g}\}$  ${\{{\rho}_{i}\}}_{i=0}^{6}\cup {\{{a}_{i}\}}_{i=0}^{5}\cup {\{{\zeta}_{i}\}}_{i=0}^{5}\cup {\mu}_{g}$ 
HalfCheetah  None  $\{{\rho}_{0},{\mu}_{g}\}$  ${\{{\rho}_{i}\}}_{i=0}^{7}\cup {\{{a}_{i}\}}_{i=0}^{5}\cup {\{{\zeta}_{i}\}}_{i=0}^{5}\cup {\mu}_{g}$ 
Our experiments included training and a testing phases. During the training phase we applied Algorithm 1 for determining robust policies while updating transition model parameters according to the minmax formulation. Training was performed independently for each of the algorithms on the relevant benchmarks while ensuring best operating conditions using hyperparameter values reported elsewhere (Schulman et al., 2017; Pinto et al., 2017; Tessler et al., 2019).
For all benchmarks, policies were represented using parametrised Gaussian distributions with their means given by a neural network and standard derivations by a group of free parameters. The neural network consisted of two hidden layers with 64 units and hyperbolic tangent activations in each of the layers. The final layer exploited linear activation so as to output a real number. Following the actorcritic framework, we also trained a standalone critic network having the same structure as that of the policy.
For each policy update, we rolledout in the current worstcase dynamics to collect a number of transitions. The number associated to these transitions was applicationdependent and varied between benchmarks in the range of 5,000 to 10,000. The policy was then optimised (i.e., Phase II of Algorithm 1) using proximal policy optimization with a generalised advantage estimation. To solve the minimisation problem in the inner loop of Algorithm 1, we sampled a number of dynamics from a diagonal Gaussian distribution that is centered at the current worstcase dynamics model. The number of sampled dynamics and the variance of the sampled distributions depended on both the benchmark itself, and well as the dimensions of the dynamics. Gradients needed for model updates were estimated using the results in Propositions 7 and 8. Finally, we terminated training when the policy entropy dropped below an applicationdependent threshold.
When testing, we evaluated policies on unseen dynamics that exhibited simulator variations as described earlier. We measured performance using returns averaged over 20 episodes with a maximum length of 1,000 time steps on testing environments. We note that we used nondiscounted mean episode rewards to compute such averages.
7.3 Comparison Results & Benchmarking
This section summarises robustness results showing that our method significantly outperforms others from both standard and robust reinforcement learning in terms of average testing returns as dynamics vary.
Results with OneDimensional Model Variation:
Figure 1 shows the robustness of policies on a simple inverted pendulum while varying the pole length in the ranges from $0.3$ to $3.0$. For a fair comparison, we trained two standard policy gradient methods (TRPO (Schulman et al., 2015b) and PPO (Schulman et al., 2017)), and two robust RL algorithms (RARL (Pinto et al., 2017), PRMDP (Tessler et al., 2019)) with the reference dynamics preset by our algorithm. The range of evaluation parameters was intentionally designed to include dynamics out of the $\u03f5$Wasserstein ball. Clearly, ${\text{WR}}^{2}\text{L}$ outperforms all baselines in this benchmark.
Given successful behaviour in lowdimensional state representations, we performed additional experiments on the Hopper and Walker systems to assess robustness against model changes in highdimensional environments. Figure 2 illustrates these results depicting that our method is again capable of outperforming others including RARL and PRMDP. It is also interesting to realise that in highdimensional environments, our algorithm exhibits a tradeoff between robustness and optimality due to the minmax definition of ${\text{WR}}^{2}\text{L}$’s objective.
[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Experimental Conclusion I: From the above, we conclude that ${\text{WR}}^{2}\text{L}$ outperforms others when onedimensional simulator variations are considered.
Results with TwoDimensional Model Variation:
Though successful when considering one dimensional variation in the environment, it is instructive to analyse how our algorithm performs with multidimensional variations. To do so, we performed additional experiments using the Hopper benchmark varying both friction and the density parameters. Results shown in Figure 3 demonstrate that our method is capable of significantly outperforming PPO (i.e., when $\u03f5=0$ – Figure 3(a)), where the range of high (over 2300) test returns is broader even with a small $\u03f5$ball constraint of $0.003$.
Furthermore, one would expect that such advantages increase with the increase in the radius $\u03f5$. To validate these claims, we reran the same experiment devised above while allowing for a larger $\u03f5$ of $0.015$. It is clear from Figure 3(c) that the robustness range of the policy generated by ${\text{WR}}^{2}\text{L}$ does increase with the increase in the ball’s radius.
These results were also verified in two additional benchmarks (i.e., the Walker and HalfCheetah). Here, again our results demonstrate that when 2 dimensional changes are considered, our method outperforms stateoftheart significantly. We also arrive at the same conclusion that if $\u03f5$ increases so does the robustness range. For instance, a robust policy trained with an $\u03f5$ of 0.03 achieves high average test returns on a broader range of HalfCheetahs compares to that with an $\u03f5=0.005$, see Figures 4 (h) and 4 (i).
This, in turn, takes us to the following conclusion:
[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Experimental Conclusion II: From the above, we conclude that ${\text{WR}}^{2}\text{L}$ outperforms others when twodimensional simulator variations are considered, and that robustness increase with $\u03f5$.
Results with HighDimensional Model Variation:
Though results above demonstrate robustness, an argument against a minmax objective can be made especially when only considering lowdimensional changes in the simulator. Namely, one can argue the need for such an objective as opposed to simply sampling a set of systems and determining policies performingwell on average similar to the approach proposed in (Rajeswaran et al., 2017).
A counterargument to the above is that a gradientbased optimisation scheme is more efficient than a samplingbased one when highdimensional changes are considered. In other words, a sampling procedure is hardly applicable when more than a few parameters are altered, while ${\text{WR}}^{2}\text{L}$ can remain suitable. To assess these claims, we conducted two additional experiments on the Hopper and HalfCheetah benchmarks. In the first, we trained robustly while changing friction and torso densities, and tested on 1000 systems generated by varying all 11 dimensions of the Hopper dynamics, and 21 dimensions of the HalfCheetah system. Results reported in Figures 4(b) and (e) demonstrate that the empirical densities of the average test returns are mostly centered around 3000 for the Hopper, and around 4500 for the Cheetah, which improves that of PPO (Figures 4(a) and (d)) with return masses mostly accumulated at around 1000 in the case of the Hopper and almost equally distributed when considering HalfCheetah.
Such improvements, however, can be an artifact of the careful choice of the lowdimensional degrees of freedom allowed to be modified during Phase I of Algorithm 1. To get further insights, Figures 4(c) and (f) demonstrate the effectiveness of our method trained and tested while allowing to tune all 11 dimensional parameters of the Hopper simulator, and the 21 dimensions of the HalfCheetah. Indeed, our results are in accordance with these of the previous experiment depicting that most of the test returns’ mass remains around 3000 for the Hopper, and improves to accumulate around 4500 for the HalfCheetah. Interestingly, however, our algorithm is now capable of acquiring higher returns on all systems^{9}^{9} 9 Please note that we attempted to compare against Rajeswaran et al. (2017). Due to the lack of opensource code, we were not able to regenerate their results. since it is allowed to alter all parameters defining the simulator. As such, we conclude:
[enhanced, sharp corners, colback=white, colframe=black, drop shadow=black,opacity=1] Experimental Conclusion III: From the above, we conclude that ${\text{WR}}^{2}\text{L}$ outperforms others when highdimensional simulator variations are considered.
8 Conclusion & Future Work
In this paper, we proposed a novel robust reinforcement learning algorithm capable of outperforming others in terms of test returns on unseen dynamical systems. Our algorithm formalises a new minmax objective with Wasserstein constraints for policies generalising across varying domains, and considers a zeroorder method for scalable solutions. We evaluated our method both theoretically and empirically. On theory side, we proved how our algorithm can attain estimates of gradients and Hessians based on random samples. Empirically, we demonstrated superior performance against stateoftheart from both standard and robust reinforcement learning on low and highdimensional MuJuCo environments.
There are a lot of interesting directions we plan to target in the future. First, we aim to consider robustness in terms of other components of MDPs, e.g., state representations, reward functions, and others. Second, we will implement ${\text{WR}}^{2}\text{L}$ on realhardware considering simtoreal experiments.
9 Acknowledgements
We are grateful to Jan Peters and Andreas Krause for the interesting discussions that helped bettershape this paper. Moreover, we would like to thank each of Rasul Tutunov, and Aivar Sootla for guiding us through discussion on optimisation and control theory. Finally, we thank Jun Yao, Liu Wulong, Chen Zhitang, Jia Zheng, and Zhengguo Li for helping us improve this work with inputs about realworld considerations.
References
 Asadi et al. (2018) Asadi, K., Misra, D., and Littman, M. (2018). Lipschitz continuity in modelbased reinforcement learning. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 264–273, Stockholmsmässan, Stockholm Sweden. PMLR.
 BouAmmar et al. (2014) BouAmmar, H., Eaton, E., Ruvolo, P., and Taylor, M. E. (2014). Online multitask learning for policy gradient methods. In Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, ICML’14, pages II–1206–II–1214. JMLR.org.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
 Busoniu et al. (2010) Busoniu, L., Babuska, R., Schutter, B. D., and Ernst, D. (2010). Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Inc., Boca Raton, FL, USA, 1st edition.
 Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc.
 Chow et al. (2015) Chow, Y., Tamar, A., Mannor, S., and Pavone, M. (2015). Risksensitive and robust decisionmaking: a cvar optimization approach. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1522–1530. Curran Associates, Inc.
 Deisenroth et al. (2013) Deisenroth, M., Peters, J., and Neumann, G. (2013). A survey on policy search for robotics. Found. Trends Robot, 2:1–142.
 Deisenroth and Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472.
 Doyle et al. (2013) Doyle, J. C., Francis, B. A., and Tannenbaum, A. R. (2013). Feedback control theory. Courier Corporation.
 EmmertStreib and Dehmer (2018) EmmertStreib, F. and Dehmer, M. (2018). A machine learning perspective on personalized medicine: An automized, comprehensive knowledge base with ontology for pattern recognition. Machine Learning and Knowledge Extraction, 1(1):149–156.
 Fischer (2018) Fischer, T. G. (2018). Reinforcement learning in financial markets  a survey. FAU Discussion Papers in Economics 12/2018, Erlangen.
 Iyengar (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
 Kober and Peters (2012) Kober, J. and Peters, J. (2012). Reinforcement Learning in Robotics: A Survey, volume 12, pages 579–610. Springer, Berlin, Germany.
 Koller et al. (2018) Koller, T., Berkenkamp, F., Turchetta, M., and Krause, A. (2018). Learningbased model predictive control for safe exploration and reinforcement learning. CoRR, abs/1803.08287.
 Lecarpentier and Rachelson (2019) Lecarpentier, E. and Rachelson, E. (2019). Nonstationary markov decision processes a worstcase approach using modelbased reinforcement learning. arXiv preprint arXiv:1904.10090.
 Lutter and Peters (2019) Lutter, M. and Peters, J. (2019). Deep lagrangian networks for endtoend learning of energybased control for underactuated systems. In International Conference on Intelligent Robots and Systems (IROS).
 Lutter et al. (2019) Lutter, M., Ritter, C., and Peters, J. (2019). Deep lagrangian networks: Using physics as model prior for deep learning.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A. (2013). Playing atari with deep reinforcement learning. CoRR, abs/1312.5602.
 Morimoto and Doya (2005) Morimoto, J. and Doya, K. (2005). Robust reinforcement learning. Neural Comput., 17(2):335–359.
 Namkoong and Duchi (2016) Namkoong, H. and Duchi, J. C. (2016). Stochastic gradient methods for distributionally robust optimization with fdivergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2216–2224, USA. Curran Associates Inc.
 Nesterov (2011) Nesterov, Y. (2011). Random gradientfree minimization of convex functions. CORE Discussion Papers 2011001, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
 Nilim and El Ghaoui (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
 Nocedal and Wright (2006) Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.
 Packer et al. (2018) Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., and Song, D. (2018). Assessing generalization in deep reinforcement learning. CoRR, abs/1810.12282.
 Pajarinen et al. (2019) Pajarinen, J., Thai, H. L., Akrour, R., Peters, J., and Neumann, G. (2019). Compatible natural gradient policy search. CoRR, abs/1902.02823.
 Peng et al. (2017) Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2017). Simtoreal transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537.
 Peters and Schaal (2008a) Peters, J. and Schaal, S. (2008a). Natural actorcritic. Neurocomput., 71(79):1180–1190.
 Peters and Schaal (2008b) Peters, J. and Schaal, S. (2008b). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697.
 Petrik and Russell (2019) Petrik, M. and Russell, R. H. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. arXiv preprint arXiv:1902.07605.
 Pinto et al. (2017) Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. (2017). Robust adversarial reinforcement learning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2817–2826, International Convention Centre, Sydney, Australia. PMLR.
 Rajeswaran et al. (2017) Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. (2017). Epopt: Learning robust neural network policies using model ensembles. International Conference on Learning Representations (ICLR) 2017, arXiv preprint arXiv:1610.01283.
 Rasmussen (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer.
 Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
 Sargent and Hansen (2001) Sargent, T. and Hansen, L. (2001). Robust control and model uncertainty. American Economic Review, 91(2):60–66.
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France. PMLR.
 Schulman et al. (2015b) Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015b). Trust region policy optimization. CoRR, abs/1502.05477.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, ICML’14, pages I–387–I–395. JMLR.org.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.
 Takatsu (2008) Takatsu, A. (2008). Wasserstein geometry of gaussian measures.
 Tessler et al. (2019) Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6215–6224, Long Beach, California, USA. PMLR.
 Tirinzoni et al. (2018) Tirinzoni, A., Petrik, M., Chen, X., and Ziebart, B. (2018). Policyconditioned uncertainty sets for robust markov decision processes. In Advances in Neural Information Processing Systems, pages 8939–8949.
 Wiesemann et al. (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183.
 Wolfe (1969) Wolfe, P. (1969). Convergence conditions for ascent methods. SIAM review, 11(2):226–235.
 Yang (2017) Yang, I. (2017). A convex optimization approach to distributionally robust markov decision processes with wasserstein distance. IEEE control systems letters, 1(1):164–169.
 Zhao et al. (2019) Zhao, C., Sigaud, O., Stulp, F., and Hospedales, T. M. (2019). Investigating generalisation in continuous deep reinforcement learning. CoRR, abs/1902.07015.
Appendix A Deep Deterministic Policy Gradients Results
As mentioned in the experiments section of the main paper, we refrained from presenting results involving deep deterministic policy gradients (DDPG) due to its lack in robustness even on simple systems, such as the CartPole.
Figure 5 depicts these results showing that DDPG lacks robustness even when minor variations in the pole lenght are introduced. TRPO and PPO, on the other hand, demonstrate an acceptable performance retaining a test return of 1,000 across a broad range of pole lengths variations.
Appendix B Derivation of the Closed Form Solution
In Section 3.2 we presented a closed form solution to the following optimisation problem:
$${\underset{\mathit{\varphi}}{\mathrm{min}}{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{{\bm{\theta}}^{[k]},{\mathit{\varphi}}^{[k]}}^{\U0001d5b3}(\mathit{\varphi}{\mathit{\varphi}}_{0})\mathit{\hspace{1em}}\text{s.t.}\mathit{\hspace{1em}}\frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}(\mathit{\varphi}{\mathit{\varphi}}_{0})\le \u03f5,$$ 
which took the form of:
${\mathit{\varphi}}^{[k+1]}={\mathit{\varphi}}_{0}\sqrt{{\displaystyle \frac{2\u03f5}{{\bm{g}}^{[k],\U0001d5b3}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}}}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}.$ 
In this section of the appendix, we derive such an update rule from first principles. We commence transforming the constraint optimisation problem into an unconstrained one using the Lagrangian:
$$\mathcal{L}(\mathit{\varphi},\bm{\lambda})={\bm{g}}^{[k],\U0001d5b3}\left(\mathit{\varphi}{\mathit{\varphi}}^{[k]}\right)+\bm{\lambda}\left[\frac{1}{2}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}(\mathit{\varphi}{\mathit{\varphi}}_{0})\u03f5\right],$$ 
where $\bm{\lambda}$ is a Lagrange multiplier, and ${\bm{g}}^{[k]}={{\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{\bm{\tau}\sim {p}_{\bm{\theta}}^{\mathit{\varphi}}(\bm{\tau})}\left[{\mathcal{R}}_{\text{total}}(\bm{\tau})\right]}_{{\bm{\theta}}^{[k]},{\mathit{\varphi}}^{[k]}}^{\U0001d5b3}$.
Deriving the Lagrangian with respect to the primal parameters $\mathit{\varphi}$, we write:
${\nabla}_{\mathit{\varphi}}\mathcal{L}(\mathit{\varphi},\bm{\lambda})={\bm{g}}^{[k],\U0001d5b3}+\bm{\lambda}{(\mathit{\varphi}{\mathit{\varphi}}_{0})}^{\U0001d5b3}{\bm{H}}_{0}.$  (19) 
Setting Equation 19 to zero and solving for primal parameters, we attain:
$$\mathit{\varphi}={\mathit{\varphi}}_{0}\frac{1}{\bm{\lambda}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}.$$ 
Plugging $\mathit{\varphi}$ back into the equation representing the constraints, we derive:
${\left({\mathit{\varphi}}_{0}{\displaystyle \frac{1}{\bm{\lambda}}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}{\mathit{\varphi}}_{0}\right)}^{\U0001d5b3}{\bm{H}}_{0}\left({\mathit{\varphi}}_{0}{\displaystyle \frac{1}{\bm{\lambda}}}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}{\mathit{\varphi}}_{0}\right)=2\u03f5\u27f9{\bm{\lambda}}^{2}={\displaystyle \frac{1}{2\u03f5}}{\bm{g}}^{[k],\U0001d5b3}{\bm{H}}_{0}^{1}{\bm{g}}^{[k]}.$ 
It is easy to see that with the positive solution for $\bm{\lambda}$, the Karush–Kuhn–Tucker (KKT) conditions are satisfied. Since the objective and constraint are both convex, the KKT conditions are sufficient and necessary for optimality, thus finalising our derivation.
Appendix C Miscellaneous
The following result is used in the proof of Corollary 2. For the reader’s convenience, we reproduce it here from Nocedal and Wright (2006).
Proposition 7 (Zoutendijk, taken from Nocedal and Wright (2006)).
For a differentiable function $f\mathrm{:}{\mathrm{R}}^{n}\mathrm{\to}\mathrm{R}$ consider any iteration of the form:
$${\bm{x}}_{j+1}={\bm{x}}_{j}+{\alpha}_{j}{\bm{p}}_{j}$$ 
where ${\mathbf{p}}_{j}$ is a descent direction for $f$ at ${\mathbf{x}}_{j}$ and ${\alpha}_{j}\mathrm{\ge}\mathrm{0}$ is a real number that satisfies the Wolfe conditions. Suppose that $f$ is bounded below in ${\mathrm{R}}^{n}$ and that $f$ is continuously differentiable in an open set $\mathrm{N}$ containing the level set $\mathrm{L}\mathrm{:=}\mathrm{\{}\mathbf{x}\mathrm{:}f\mathit{}\mathrm{(}\mathbf{x}\mathrm{)}\mathrm{\le}f\mathit{}\mathrm{(}{\mathbf{x}}_{\mathrm{0}}\mathrm{)}\mathrm{\}}$ where ${\mathbf{x}}_{\mathrm{0}}$ is the starting point of the iteration. Assume also that the gradient $\mathrm{\nabla}\mathit{}f$ is Lipschitz continuous on $\mathrm{N}$, i.e., there exists a constant $L\mathrm{>}\mathrm{0}$ such that
$$\parallel \nabla f(\bm{x})\nabla f({\bm{x}}^{\prime})\parallel \le L\parallel \bm{x}{\bm{x}}^{\prime}\parallel ,\forall \bm{x},{\bm{x}}^{\prime}\in \mathcal{N}.$$ 
Then
$$ 
where
$$\mathrm{cos}{\upsilon}_{j}=\frac{\nabla f{({\bm{x}}_{j})}^{T}{\bm{p}}_{j}}{\parallel \nabla f({\bm{x}}_{j})\parallel \parallel {\bm{p}}_{j}\parallel}.$$ 