Behavior-Guided Reinforcement Learning

  • 2019-09-29 16:02:32
  • Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska, Krzysztof Choromanski, Michael I. Jordan
  • 0

Abstract

We introduce a new approach for comparing reinforcement learning policies,using Wasserstein distances (WDs) in a newly defined latent behavioral space.We show that by utilizing the dual formulation of the WD, we can learn scorefunctions over trajectories that can be in turn used to lead policyoptimization towards (or away from) (un)desired behaviors. Combined withsmoothed WDs, the dual formulation allows us to devise efficient algorithmsthat take stochastic gradient descent steps through WD regularizers. Weincorporate these regularizers into two novel on-policy algorithms,Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, whichwe demonstrate can outperform existing methods in a variety of challengingenvironments. We also provide an open source demo.

 

Quick Read (beta)

Behavior-Guided Reinforcement Learning

Aldo Pacchiano
UC Berkeley &Jack Parker-Holder11footnotemark: 1
University of Oxford &Yunhao Tang11footnotemark: 1
Columbia University \ANDAnna Choromanska
NYU &Krzysztof Choromanski
Google Brain Robotics &Michael I. Jordan
UC Berkeley
Equal contribution, correspondance to [email protected]
Abstract

We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over trajectories that can be in turn used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo22 2 Available at https://github.com/behaviorguidedRL/BGRL. We emphasize this is not an exact replica of the code from our experiments, but a demo to build intuition and clarify our methods..

\usetikzlibrary

intersections,shapes.arrows

 

Behavior-Guided Reinforcement Learning


  Aldo Pacchianothanks: Equal contribution, correspondance to [email protected] UC Berkeley Jack Parker-Holder11footnotemark: 1 University of Oxford Yunhao Tang11footnotemark: 1 Columbia University Anna Choromanska NYU Krzysztof Choromanski Google Brain Robotics Michael I. Jordan UC Berkeley

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

One of the key challenges in reinforcement learning (RL) is to efficiently incorporate the behavioral characteristics of learned policies into optimization algorithms [b1, meyerson, conti]. The fundamental question we aim to shed light on in this paper is:

What is the right measure of similarity between two policies acting on the same underlying MDP and how can we devise algorithms to leverage this information for reinforcement learning?

In simple terms, the main thesis motivating the methods we propose is that:

Two policies may perform similar actions at a local level but result in very different global behaviors.

We propose to define behaviors via so-called Behavioral Embedding Maps (BEMs), which are functions mapping trajectories (realizations of policies) into latent behavioral spaces representing trajectories in a compact way. BEMs enable us to identify policies with their Probabilistic Policy Embeddings (PPEs), which we define as the pushforward distributions over trajectory embeddings as a result of applying a BEM to a policy’s trajectories. Importantly, two policies with distinct distributions over trajectories may result in the same probabilistic embedding. PPEs provide us a way to rigorously define dissimilarity between policies. We do this by equipping them with metrics defined on the manifold of probabilistic measures, namely a class of Wasserstein distances (WDs, [villani]). There are several reasons for choosing WDs:

  • Flexibility. We can use any cost function between embeddings of trajectories, allowing the distance between PPEs to arise organically from an interpretable distance between embedding points.

  • Non-injective BEMs. Different trajectories may be mapped to the same embedding point (for example in the case of the last-state embedding). This precludes the use of likelihood-based distances such as the KL divergence [kullback1951], which we discuss in Section 2.

  • Behavioral Test Functions. Solving the dual formulation of the WD objective yields a pair of test functions over the space of embeddings that can be used to score trajectories.

The behavioral test functions underpin all our algorithms, directing optimization towards desired behaviors. To learn them it suffices to define the BEM and the cost function between points in the PPE space. To mitigate the computational burden of computing WDs, we rely on their entropy-regularized formulations. This allows us to update the learned test functions in a computationally efficient manner via stochastic gradient descent (SGD) on a Reproducing Kernel Hilbert Space (RKHS). We develop a novel method for stochastic optimal transport based on random feature maps [randomfeatures2007] to produce compact and memory-efficient representations of learned behavioral test functions. Finally, having laid the groundwork for comparing trajectories via behavior-driven trajectory scores, we address our core question by introducing two new on-policy RL algorithms:

  • Behavior Guided Policy Gradients (BGPG): We propose to replace the KL-based trust region from [trpo] with a WD-based trust region in the PPE space.

  • Behavior Guided Evolution Strategies (BGES): Inspired by the NSR-ES algorithm from [conti], BGES jointly optimizes for reward and novelty using the WD in the PPE space.

In addition, we also demonstrate a way to harness our methodology for imitation learning (Section 6.3) and repulsion learning (Section 8.4).

The paper is organized as follows. In Section 2 we discuss related work. In Section 3 we introduce Behavioral Embedding Maps (BEMs). In Section 4 we overview Wasserstein distance, in Section 5 we present our algorithms with accompanying theory in Section 9. Experiments are presented in Section 6 and we conclude in Section 7.

2 Related Work

Our work is related to research in multiple areas in neuroevolution and machine learning:

Behavior Characterizations:

The idea of directly optimizing for behavioral diversity was introduced by [Lehman08] and [lehman:thesis], who proposed to search directly for novelty, rather than simply assuming it would naturally arise in the process of optimizing an objective function. This approach has been applied to deep RL [conti] and meta-learning [evolvabilityES]. In all of this work, the policy is represented via a behavioral characterization (BC), typically chosen with knowledge of the environment, for example the final (x,y) coordinate for a locomotion task. Additionally, in most cases these BCs are considered to be deterministic, with Euclidean distances used to compare BCs. In our setting, we move from deterministic BCs to stochastic PPEs, thus requiring the use of metrics capable of comparing probabilistic distributions.

Distance Metrics:

WDs have been used in many different applications in machine learning where guarantees based on distributional similarity are required [jiang2019wasserstein, wgan]. We make use of WDs in our setting for a variety of reasons. First and foremost, the dual formulation of the WD allows us to recover Behavioral Test Functions, thus providing us with behavior-driven trajectory scores. In contrast to KL divergences, WDs are sensitive to user-defined costs between pairs of samples instead of relying only on likelihood ratios. Furthermore, as opposed to KL divergences, it is possible to take SGD steps using entropy-regularized Wasserstein objectives. Computing an estimator of the KL divergence is hard without a density model. Since in our framework multiple unknown trajectories may map to the same behavioral embedding, the likelihood ratio between two embedding distributions may be ill-defined.

WDs for RL:

We are not the first to propose using WDs in RL. [zhang] have recently introduced Wasserstein Gradient Flows (WGFs) for finding efficient RL policies. This approach casts policy optimization as gradient descent flow on the manifold of corresponding probability measures, where geodesic lengths are given as second-order WDs. We note that computing WGFs is a nontrivial task. In [zhang] this is done via particle approximation methods. We show in Section 6 that RL algorithms using these techniques are substantially slower than our methods. The WD has also been employed to replace KL terms in standard Trust Region Policy Optimization [maginnis]. This is a very special case of our more generic framework (cf. Section 5.2). In [maginnis] it is suggested to solve the corresponding RL problems via Fokker-Planck equations and diffusion processes, yet no empirical evidence of the feasibility of this approach is provided. We propose general practical algorithms and provide extensive empirical evaluation.

Distributional RL

Distributional RL (DRL, [Bellemare2017]) expands on traditional off-policy methods [dqn2013] by attempting to learn a distribution of the return from a given state, rather than just the expected value. These approaches have impressive experimental results [Bellemare2017, dabney18a], with a growing body of theory [rowland18a, qu19b, bellemare19a, rowland19statistics]. Superficially it may seem that learning a distribution of returns is similar to our approach to PPEs, when the BEM is a distribution over rewards. Indeed, reward-driven embeddings used in DRL can be thought of as special cases of the general class of BEMs. We note two key differences: 1) DRL methods are off-policy whereas our BGES and BGPG algorithms are on-policy, and 2) DRL is typically designed for discrete domains, since Q-Learning with continuous action spaces is generally much harder. Furthermore, we note that while the WD is used in DRL, it is only for the convergence analysis of the DRL algorithm—the algorithm itself does not use WDs [Bellemare2017].

3 Defining Behavior in Reinforcement Learning

A Markov Decision Process (MDP) is a tuple (𝒮,𝒜,P,R). Here 𝒮 and 𝒜 stand for the sets of states and actions respectively, such that for s,s𝒮 and a𝒜: P(s|a,s) is the probability that the system/agent transitions from s to s given action a and R(s,a,s) is a reward obtained by an agent transitioning from s to s via a. A policy πθ:𝒮𝒜 is a (possibly randomized) mapping (parameterized by θd) from 𝒮 to 𝒜. Let Γ={τ=s0,a0,r0,sH,aH,rH s.t. si𝒮,ai𝒜,ri} be the set of possible trajectories enriched by sequences of partial rewards under some policy π. The undiscounted reward function :Γ (which expectation is to be maximized by optimizing θ) satisfies (τ)=i=0Hri, where ri=R(si+1,ai,si).

3.1 Behavioral Embeddings

Figure 1: Behavioral Embedding Maps (BEMs) map trajectories to points in the behavior embedding space . Two trajectories may map to the same point in .

We start with a Behavioral Embeddng Space (BES) which we denote as and a Behavioral Embedding Map (BEM), Φ:Γ, mapping trajectories to embeddings in (Fig.  1). Importantly, the mapping does not need to be surjective. We will provide examples of BESs and BEMs at the end of the section. Given a policy π, we let π denote the distribution induced over the spaces of trajectories Γ and by πΦ the corresponding pushforward distribution on induced by Φ. We call PπΦ the Probabilistic Policy Embedding (PPE) of a policy π. A policy π can be fully characterized by the distribution π.

Additionally, we require the BES to be equipped with a metric (or cost function) C:×. Given two trajectories τ1,τ2 in Γ, C(Φ(τ1),Φ(τ2)) measures how different these trajectories are in the behavior space. The following are examples of BEMs (with the corresponding BESs) categorized into three main types (we will use examples from all three types in our experiments in Section 6):

  1. 1.

    State-based: the final state Φ1(τ)=sH, the visiting frequency of a fixed state Φ2s(τ)=t=0H𝟏(st=s), the frequency vector of visited states Φ3(τ)=t=0Hest (where es|𝒮| is the one-hot vector corresponding to state s); see also Section 6.2.

  2. 2.

    Action-based: the concatenation of actions Φ4(τ)=[a0,,aH]; see also Section 6.1.

  3. 3.

    Reward-based: the total reward Φ5(τ)=t=0Hrt, reward-to-go vector Φ6(τ)=t=0Hrt(i=0tei) (where eiH+1 is a one-hot vector corresponding to i and with dimensions indexed from 0 to H); see also Section 6.1 and Section 6.3.

For instance, πΦ3 is the frequency with which different states are visited under π. Note that some of the above embeddings are only for the tabular case (|𝒮|,|𝒜|<) while others are universal.

4 Wasserstein Distance & Optimal Transport Problem

Let μ,ν be (Radon) probability measures over domains 𝒳m,𝒴n and let 𝒞:𝒳×𝒴 be a cost function. For γ>0, a smoothed Wasserstein Distance is defined as:

WDγ(μ,ν):=minπΠ(μ,ν)𝒳×𝒴C(𝐱,𝐲)𝑑π(𝐱,𝐲)+γKL(π|ξ), (1)

where Π(μ,ν) is the space of couplings (joint distributions) over 𝒳×𝒴 with marginal distributions μ and ν, KL(|) denotes the KL divergence between distributions π and ρ with support 𝒳×𝒴 defined as: KL(π|ρ)=𝒳×𝒴(log(dπdξ(𝐱,𝐲)))𝑑π(𝐱,𝐲) and ξ is a reference measure over 𝒳×𝒴. When the cost is an p distance and γ=0, WDγ is also known as the Earth mover’s distance and the corresponding optimization problem is known as the optimal transport problem (OTP).

4.1 Wasserstein Distance: Dual Formulation

We will use smoothed WDs to derive efficient regularizers for RL algorithms. To arrive at this goal, we first need to consider the dual form of Equation 1. Under the subspace topology [bourbaki1966general] for 𝒳 and 𝒴, let 𝒞(𝒳) denote the space of continuous functions on 𝒳 and let 𝒞(𝒴) denote the space of continuous functions over 𝒴. The choice of the subspace topology ensures our discussion encompasses the discrete case.

Let C:𝒳×𝒴 be a cost function, interpreted as the “ground cost” to move a unit of mass from x to y. Define 𝕀 as the (0,) indicator function, where the value 0 denotes set membership. Using Fenchel duality, we can obtain the following dual formulation of the problem in Eq. 1:

WDγ(μ,ν)=maxλμ𝒞(𝒳),λν𝒞(𝒴)𝒳λμ(𝐱)𝑑μ(𝐱)-𝒴λν(𝐲)𝑑ν(𝐲)-EC(λμ,λν), (2)

where EC(λμ,λν) is defined as:

EC(λμ,λν):={γ𝒳×𝒴exp(λμ(𝐱)-λν(𝐲)-C(𝐱,𝐲)γ)𝑑ξ(𝐱,𝐲)if γ>0𝕀((λν,λν){(u,v) s.t. (𝐱,𝐲)𝒳×𝒴 u(𝐱)-v(𝐲)C(𝐱,𝐲)})if γ=0. (3)

We will set dξ(𝐱,𝐲)1 for discrete domains and dξ(𝐱,𝐲)=dμ(𝐱)dν(𝐲) otherwise.

If λμ*,λν* are the functions achieving the maximum in Eq. 2, and γ is sufficiently small then WDγ(μ,ν)𝔼μ[λμ*(𝐱)]-𝔼ν[λν*(𝐲)], with equality when γ=0. When for example γ=0, 𝒳=𝒴, and C(x,x)=0 for all x𝒳, it is easy to see λμ*(x)=λν*(x)=λ*(x) for all x𝒳. In this case the difference between 𝔼μ[λ*(𝐱)] and 𝔼μ[λ*(𝐲)] equals the WD. In other words, the function λ* gives higher scores to regions of the space 𝒳 where μ has more mass. This observation is key to the success of our algorithms in guiding optimization towards desired behaviors.

4.2 Computing λμ* and λν*

Figure 2: Behavioral embedding functions corresponding to two policies π1 (green) and π2 (blue) whose BEMs map trajectories to points in the real line.

We combine several techniques to make the optimization of objective from Eq. 2 tractable. First, we replace 𝒳 and 𝒴 with the functions from a RKHS corresponding to universal kernels [micchelli]. This is justified since those function classes are dense in the set of continuous functions of their ambient spaces. In this paper we choose the Gaussian kernel and approximate it using random Fourier feature maps [randomfeatures2007] to increase efficiency. Consequently, the functions λ learned by our algorithms have the following form: λ(𝐱)=(𝐩λ)ϕ(𝐱), where ϕ is a random feature map with m standing for the number of random features and 𝐩λm. For the Gaussian kernel, ϕ is defined as follows: ϕ(𝐳)=1mcos(𝐆𝐳+𝐛) for 𝐳d, where 𝐆m×d is Gaussian with iid entries taken from 𝒩(0,1), bm with iid bis such that biUnif[0,2π] and the cos function acts elementwise.

Input: kernels κ, over 𝒳,𝒴 respectively with corresponding random feature maps ϕκ,ϕ, smoothing parameter γ, gradient step size α, number of optimization rounds M, initial dual vectors 𝐩0μ,𝐩0ν.
for t=0,,M  do

       1. Sample (xt,yt)μν.
2. Update (𝐩tμ𝐩tν)=(𝐩t-1μ𝐩t-1ν)+αt(1-exp((𝐩t-1μ)ϕκ(xt)-(𝐩t-1ν)ϕ(xt)-C(xt,yt)γ))(ϕκ(xt)-ϕ(yt))
Return: 𝐩Mμ,𝐩Mν.
Algorithm 1 Random Features Wasserstein SGD

Henceforth, when we refer to optimization over λ, we mean optimizing over corresponding dual vectors 𝐩λ associated with λ. We can solve for the optimal dual functions by performing SGD over the dual objective in Eq. 2. Algorithm 1 is the random features equivalent of Algorithm 3 in [genevay2016stochastic] and will be a prominent subroutine of our methods. An explanation and proof of why this is the right stochastic gradient is in Lemma 9.2 in the Appendix.

If 𝐩*μ,𝐩*ν are the optimal dual vectors and (x1,y1),,(xk,yk)i.i.dμν, then Algorithm 1 can be used to get an estimator of WDγ(μ,ν) as follows:

WD^γ(μ,ν)=1ki=1k𝐩*μ,ϕκ(xi)-𝐩*ν,ϕ(yi)+1γexp(ϕκ(xi)𝐩*μ-ϕ(yi)𝐩*ν-C(xi,yi)γ) (4)

5 Behavior-Guided Reinforcement Learning

Here we introduce the framework which allows us to incorporate our behavioral approach to reinforcement learning into practical on-policy algorithms. Denote by πθ a policy parameterized by θd. The goal of policy optimization algorithms is to find a policy maximizing, as a function of the policy parameters, the expected total reward (θ):=𝔼τπθ[(τ)].

5.1 Behavioral Test Functions

If C:× is a cost function defined over behavior space , and π1,π2 are two policies, then:

WDγ(π1Φ,π2Φ)𝔼τπ1[λ1*(Φ(τ))]-𝔼τπ2[λ2*(Φ(τ))], (5)

where λ1*,λ2* are the optimal dual functions. The maps s1:=λ1*Φ:Γ and s2:=λ2*Φ:Γ define score functions over the space of trajectories. If γ is close to zero, the score function si gives higher scores to trajectories from πi whose behavioral embedding is common under πi but rarely appears under πj for ji (Fig. 2).

5.2 Algorithms

We propose to solve a WD-regularized objective to tackle behavior-guided policy optimization. All of our algorithms hinge on trying to maximize an objective of the form:

F(θ)=(θ)+βWDγ(πθΦ,bΦ), (6)

where bΦ is a base distribution over behavioral embeddings (possibly dependent on θ) and β could be positive or negative. Although the base distribution bΦ could be arbitrary, our algorithms will instantiate bΦ=1|𝒮|π𝒮πΦ for some family of policies 𝒮 (possibly satisfying |𝒮|=1) we want the optimization to attract to / repel from.

In order to compute approximate gradients for F, we rely on the dual formulation of the WD. After substituting the composition maps resulting from Eq. 5 into Eq. 2, we obtain:

F(θ)𝔼τπθ[(τ)+βs1(τ)]-β𝔼ϕbΦ[λ2*(ϕ)], (7)

where s1:Γ equals s1=λ1*Φ, the Behavioral Test Function of policy πθ and λ2* is the optimal dual function of embedding distribution 𝐛Φ. Consequently θF(θ)θ𝔼τπθ[(τ)+βs1(τ)]. We learn a score function s1 over trajectories that can guide our optimization by favoring those trajectories that show desired global behaviors.

Eq. 7 is an approximation to the true objective from Eq. 2 whenever γ>0. In practice, the entropy regularization requires a damping term as defined in Equation 3. If ξ(πθΦ,bΦ) is the joint distribution of choice then F(θ)=(θ)+βV for

V=maxλπθ𝒞(),λb𝒞()𝔼τπθ[λπθ(Φ(τ))]-𝔼ϕbΦ[λb(ϕ)]+γ𝔼ϕ1,ϕ2ξ(πθΦ,bΦ)[Λ(ϕ1,ϕ2)],

where Λ(ϕ1,ϕ2)=exp(λπθ(ϕ1)-λb(ϕ2)-C(ϕ1,ϕ2)γ). When the embedding space is not discrete and bΦ=πΦ for some policy π, we let ξ(πθΦ,bΦ)=πθΦπΦ, otherwise ξ(πθΦ,bΦ)=1||21, a uniform distribution over ×.

All of our methods perform a version of alternating SGD optimization: we take certain number of SGD steps over the internal dual Wasserstein objective, followed by more SGD steps over the outer objective having fixed the current dual functions. Although in practice the different components that make up the optimization objectives we consider here could be highly nonconvex, in the cases these functions satisfy some convexity assumptions, we can provide a sharp characterization for the convergence rates of our algorithms. Details are given in Section 9 in the Appendix.

We consider two distinct approaches to optimizing this objective, by exploring in the action space and backpropagating, as in policy gradient methods [trpo, schulman2017proximal], and by considering a black-box optimization problem as in Evolution Strategies (ES, [ES]). These two different approaches lead to two new algorithms: Behavior-Guided Policy Gradient (BGPG) and Behavior-Guided Evolution Strategies (BGES), that we discuss next.

5.3 Behavior-Guided Policy Gradient (BGPG)

Our first algorithm seeks to solve the optimization problem in Section 5.2 with policy gradients. We refer to this method as the Behavior-Guided Policy Gradient (BGPG) algorithm (see Algorithm 2 below).

Input: Initialize stochastic policy π0 parametrized by θ0, β<0,η>0, M,L
for t=1,,T  do

       1. Run πt-1 in the environment to get advantage values Aπt-1(s,a) and trajectories {τi(t)}i=1M
2. Update policy and test functions via several alternating gradient steps over the objective:
F(θ) =𝔼τ1,τ2πt-1πθ[i=1HAπt-1(si,ai)πθ(ai|si)πt-1(ai|si)+βλ1(Φ(τ1))
-βλ2(Φ(τ2))+βγexp(λ1(Φ(τ1))-λ2(Φ(τ2))-C(Φ(τ1)),Φ(τ2))γ)]

Where τ1=s0,a0,r0,,sH,aH,rH. Let θt-1(0)=θt-1.  
for  =1,,L  do
             a. Approximate πt-1πθ via 1M{τi(t)}i=1M1M{τiθ}i=1M:=P^πt,πθ where τiθi.i.dπθ
b. Take SGA step θt-1()=θt-1(-1)+η^θF^(θt-1(-1)) using samples from P^πt-1,πθ.  
c. Use samples from P^πt-1,πθ and Algorithm 1 to update λ1,λ2.  
      
Set θt=θt-1(M).
Algorithm 2 Behvaior-Guided Policy Gradient

Specifically, we maintain a stochastic policy πθ and compute policy gradients as in prior work [trpo]. To optimize the Wasserstein distance WDγ, we approximate the gradient of this term via the random-feature Wasserstein SGD . Importantly, this stochastic gradient can be approximated by samples collected from the policy πθ. In its simplest form, the ^θF^ in Step b. in Algorithm 2 can be computed by the vanilla policy gradient over the advantage component and using the reinforce estimator through the components involving Behavioral Test Functions acting on trajectories from πθ. We explain in Appendix 8.1 a lower-variance gradient estimator alternative.

BGPG can be thought of as a variant of Trust Region Policy Optimization with a Wasserstein penalty. As opposed to vanilla TRPO, the optimization path of BGPG flows through policy parameter space while encouraging it to follow a smooth trajectory through the geometry of the PPE space. We proceed to show that given the right embedding and cost function, we can prove a monotonic improvement theorem for BGPG, showing that our methods satisfy at least similar guarantees as TRPO.

For a given policy π, we denote as: Vπ, Qπ and Aπ(s,a)=Qπ(s,a)-Vπ(s) the: value function, Q-function and advantage function (see Appendix: Section 9.5). Furthermore, let V(π) be the expected reward of policy π and ρπ(s)=𝔼τπ[t=0T𝟏(st=s)] be the visitation measure.

Two distinct policies π and π~ can be related via the equation (see: [sutton1998introduction]) V(π~)=V(π)+𝒮ρπ~(s)(𝒜π~(a|s)Aπ(s,a)𝑑a)𝑑s and the linear approximations to V around π via: L(π~)=V(π)+𝒮ρπ(s)(𝒜π~(a|s)Aπ(s,a)𝑑a)𝑑s (see: [kakade2002approximately]). Let 𝒮 be a finite set. Consider the following embedding Φs:Γ|𝒮| defined by (Φ(τ))s=t=0T𝟏(st=s) and related cost function defined as: C(𝐯,𝐰)=𝐯-𝐰1. Then WD0(π~Φs,πΦs) is related to visitation frequencies since WD0(π~Φs,πΦs)s𝒮|ρπ(s)-ρπ~(s)| (see Section 9.5 for the proof). These observations enable us to prove an analogue of Theorem 1 from [trpo], namely:

Theorem 5.1.

If WD0(Pπ~Φs,PπΦs)δ and ϵ=maxs,a|Aπ(s,a)|, then V(π~)L(θ~)-δϵ.

As in [trpo], Theorem 5.1 implies a policy improvement guarantee for BGPG.

5.4 Behavior Guided Evolution Strategies (BGES)

ES takes a black-box optimization approach to RL, by considering a rollout of a policy, parameterized by θ as a black-box function F. This approach has gained in popularity recently [ES, horia, rbo2019]. If we take this approach to optimizing the objective in Eq. 2, the result is a black-box optimization algorithm which seeks to maximize the reward and simultaneously maximizes or minimizes the difference in behavior from the base embedding distribution bΦ. We call this method the Behavior-Guided Evolution Strategies (BGES) algorithm (see Algorithm 3 below).

Input: learning rate η, noise standard deviation σ, iterations T, BEM Φ, β
Initialize: Initial policy π0 parametrized by θ0, Behavioral Test Functions λ1,λ2. Evaluate policy π0 to return trajectory τ0 and subsequently use the BEM to produce an initial PPE ^π0Φ.  
for t=1,,T-1 do

       1. Sample ϵ1,,ϵn independently.  
2. Evaluate policies {πtk}k=1n parameterized by {θt+σϵk}k=1n to return rewards Rk and trajectories τk for all k.  
3. Use BEM to map trajectories τk to produce empirical PPEs ^πtkΦ for all k.  
4. Update λ1 and λ2 using Algorithm 1, where μ=1nk=1n^πt-1kΦ and ν=1nk=1n^πtkΦ are the uniform distribution over the set of PPEs from 3 for t-1 and t.  
5. Approximate WD^γ(πtkΦ,πtΦ) plugging in λ1,λ2 into Eq. 4 for each perturbed policy πk
6. Update Policy: θt+1=θt+ηESF, where:  
ESF=1σk=1n[(1-β)(Rk-Rt)+βWD^γ(πtkΦ,πtΦ)]ϵk
Algorithm 3 Behavior-Guided Evolution Strategies

When β>0, and we take bΦ=πt-1Φ, BGES resembles the NSR-ES algorithm from [conti], an instantiation of novelty search [Lehman08]. The positive weight on the WD-term enforces newly constructed policies to be behaviorally different from the previous ones (improving exploration) while the -term drives the optimization to achieve its main objective, i.e., maximize the reward. The key difference in our approach is the probabilistic embedding map, with WD rather than Euclidean distance. We show in Section 6.2 that BGES outperforms NSR-ES for challenging exploration tasks. The approximation introduced by Step 5 bypasses the need of computing a different pair of behavioral test functions λ1,λ2 for each perturbed policy πk.

If we take β<0, and assume bΦ=πΦ to correspond to embedded trajectories from an oracle or expert policy, we can perform imitation learning. Despite not accessing the expert’s policy (just the trajectories it generates), we show in Section 6.3 that this approach dramatically improves learning.

6 Experiments

Here we seek to test whether our behavior-guided approach to RL translates to performance gains for simulated environments. We individually evaluate our two proposed algorithms, BGPG and BGES, versus their respective baselines for a range of benchmark tasks. We also include a study of using our method for imitation learning. For each subsection we provide additional details in the Appendix.

6.1 Behavior-Guided Policy Gradient

Our key question is whether BGPG can outperform baseline TRPO methods using KL divergence. In Fig. 3, we see this is clearly the case for four continuous control tasks from OpenAI Gym or the DeepMind Control Suite [tassa2018deepmind]. For the BEM, we use the concatenation-of-actions (Section 5). We also confirm results from [trpo] that a trust region greatly improves performance, as we see the black curve (without one) often fails to learn.

(a) Pendulum
(b) Hopper: Stand
(c) Hopper: Hop
(d) Walker: Stand
Figure 3: BGPG vs. TRPO: We compare BGPG and TRPO (KL divergence) on several continuous control tasks. As a baseline we also include results without a trust region (β=0 in Algorithm 2). Plots show the mean±std across 5 random seeds. BGPG consistently outperforms other methods.
Wall Clock Time:

To illustrate computational benefits of alternating optimization (AO) of WD in BGPG, we compare it to the particle approximation (PA) method introduced in [zhang] in Fig. 4. In practice, the WD across different state samples can be optimized in a batched manner using AO (see Appendix for details). We see that AO is substantially faster than PA.

(a) Pendulum (b) Hopper: Stand (c) Hopper: Hop (d) Walker: Stand
Figure 4: The clock-time comparison (in sec) of BGPG (alternating optimization) with particle approximation.

6.2 Behavior-Guided Evolution Strategies

As a novelty-search method, BGES is designed to actively explore the environment by behaving differently for previous policies. With that in mind, we seek to evaluate the ability to solve two key challenges in exploration for RL: deceptive rewards and local maxima.

Deceptive Rewards

A common challenge in model-free RL is deceptive rewards. These arise since agents can only learn from data gathered via exploration in the environment. To test BGES in this setting, we created two intentionally deceptive environments where agents may easily be fooled into learning suboptimal policies. In both cases the agent is penalized at each time step for being far away from a goal. The deception comes from a wall situated in the middle, which means that initially positive rewards from moving directly forward will lead to a suboptimal policy.

Figure 5: Efficient Exploration. On the left we show a visualization of the simulated environment, with the deceptive barrier between the (quadruped) agent and the goal. On the right, we show two plots with the median curve across five seeds, with the IQR shaded for the quadruped and point environment respectively.

We consider two types of agents—a two-dimensional point and a much larger quadruped. Details are provided in the Appendix (Section 8). We compare with state-of-the-art on-policy methods for efficient exploration: NSR-ES from [conti], which assumes the BEM is deterministic and uses the Euclidean distance to compare policies, and NoisyNet-TRPO from [noisynet18]. Results are presented on Fig. 5. Policies avoiding the wall correspond to rewards: R>-5000 and R>-800 for the quadruped and point respectively. In the prior case an agent needs to first learn how to walk and the presence of the wall is enough to prohibit vanilla ES from even learning basic forward locomotion. BGES is the only method that drives the agent to the goal in both settings. For the quadruped agent the BEM is the reward-to-go while for the point agent we used the final state.

Figure 6: Escaping Local Maxima. A comparison of BGES with those using different distances on PPEs.
Escaping Local Maxima.

In Fig. 6 we compare our methods with methods using regularizers based on other distances or divergences (specifically, Hellinger, Jensen-Shannon (JS), KL and Total Variation (TV) distances), as well as vanilla ES (i.e., with no distance regularizer). Experiments were performed on a Swimmer environment from OpenAI Gym [brockman2016openai], where the number of samples of the ES optimizer was drastically reduced. BGES is the only one that manages to obtain good policies which also proves that the benefits come here not just from introducing the regularizer, but from its particular form.

6.3 Imitation Learning

Figure 7: Imitation Learning.

As discussed in Section 5.3, we can also utilize the BGES algorithm for imitation learning, by setting β<0, and using an expert’s trajectories for the PPE. For this experiment we use the reward-to-go BEM (Section 5). In Fig. 7, we show that this approach significantly outperforms vanilla ES on the Swimmer task. Although conceptually simple, we believe this could be a powerful approach with potential extensions, for example in designing safer algorithms.

7 Conclusion and Future Work

In this paper we proposed a new paradigm for on-policy learning in RL, where policies are embedded into expressive latent behavioral spaces and the optimization is conducted by utilizing the repelling/attraction signals in the corresponding probabilistic distribution spaces. The use of Wasserstein distances (WDs) guarantees flexibility in choosing cost funtions between embedded policy trajectories, enables stochastic gradient steps through corresponding regularized objectives (as opposed to KL divergence methods) and provides an elegant method, via their dual formulations, to quantify behaviorial difference of policies through the behavioral test functions. Furthermore, the dual formulations give rise to efficient algorithms optimizing RL objectives regularized with WDs.

We also believe the presented methods shed new light on several other challenging problems of modern RL, including: learning with safety guarantees (a repelling signal can be used to enforce behaviors away from dangerous ones) or anomaly detection for reinforcement learning agents (via the above score functions). We are also excited by the possibility of scaling this approach to a population setting, learning the behavioral embedding maps from data, or adapting the degree of repulsion/attraction during optimization (parameter β).

References

Appendix: Behavior-Guided Reinforcement Learning

8 Further Experimental Details

8.1 BGPG

A Lower-variance Gradient Estimator:

As explained in Section 5.2, the BGPG considers an objective which involves two parts: the conventional surrogate loss function for policy optimization [schulman2017proximal], and a loss function that involves the Behavior Test Functions. Though we could apply vanilla reinforced gradients on both parts, it is straightforward to notice that the second part can be optimized with reparameterized gradients [kingma2013auto], which arguably have lower variance compared to the reinforced gradients. In particular, we note that under random feature approximation (4), as well as the action-concatenation embedding, the Wasserstein distance loss WD^γ(PπθΦ,PbΦ) is a differentiable function of θ. To see this more clearly, notice that under a Gaussian policy a𝒩(μθ(s),σθ(s)2) the actions a=μθ(s)+σθ(s)ϵ are reparametrizable for ϵ being standard Gaussian noises. We can directly apply the reparametrization trick to this second objective to obtain a gradient estimator with potentially much lower variance. In our experiments, we applied this lower-variance gradient estimator.

Trust Region Policy Optimization:

Though the original TRPO [trpo] construct the trust region based on KL-divergence, we propose to construct the trust region with WD. For convenience, we adopt a dual formulation of the trust region method and aim to optimize the augmented objective 𝔼τπθ[R(τ)]-βWDγ(πΦ,πθΦ). We apply the concatenation-of-actions embedding and random feature maps to calculate the trust region. We identify several important hyperparameters: the RKHS (for the test function) is produced by RBF kernel k(x,y)=exp(x-y22/σ2) with σ=0.1; the number of random features is D=100; recall the embedding is Φ(τ)=[a1,a2aH] where H is the horizon of the trajectory, here we take 10 actions per state and embed them together, this is equivalent to reducing the variance of the gradient estimator by increasing the sample size; the regularized entropy coefficient in the WD definition as γ=0.1; the trust region trade-off constant β{0.1,1,10}. The alternate gradient descent is carried out with T=100 alternating steps and test function coefficients 𝐩D are updated with learning rate α𝐩=0.01.

The baseline algorithms are: No trust region, and trust region with KL-divergence. The KL-divergence is identified by a maximum KL-divergence threshold per update, which we set to ϵ=0.01.

Across all algorithms, we adopt the open source implementation [baselines]. Hyper-parameters such as number of time steps per update as well as implementation techniques such as state normalization are default in the original code base.

The additional experiment results can be found in Figure 8 where we show comparison on additional continuous control benchmarks: Tasks with DM are from DeepMind Contol Suites [tassa2018deepmind]. We see that the trust region constructed from the WD consistently outperforms other baselines (importantly, trust region methods are always better than the baseline without trust region, this confirms that trust region methods are critical in stabilizing the updates).

(a) Reacher
(b) MountainCar
(c) Acrobot
Figure 8: Additional Experiment on TRPO. We compare No Trust Region with two alternative trust region constructions: KL-divergence and Wassertein distance (ours).
Wasserstein AO vs. Particle Approximation:

To calculate the regularized Wasserstein distance, we propose a gradient descent method that iteratively updates the test function. The alternting optimization (AO) scheme consists of updating both the test function and the distribution parameters such that the regularized Wasserstein distance of the trainable distribution against the reference distribution is minimized. Alternatively, we can also adopt a particle approximation method to calculate the Wasserstein distance and update the distribution parameters using an approximate gradient descent method [zhang].

One major advantage of AO against particle approximation is its ease of parallization. In particular, when using the concatenation-of-actions embedding, the aggregate Wasserstein distance can be decomposeed into an average of a set of Wasserstein distances over states. To calculate this aggregated gradient, AO can easily leverage the matrix multiplication; on the other hand, particle approximation requires that the dual optimal variables of each subproblem be computed, which is not straightforward to parallelize.

We test both methods in the context of trust region policy search, in which we explicitly calculate the Waserstein distance of consecutive policies and enforce the constraints using a line search as in [trpo]. Both methods require the trust region trade-off parameter β{0.1,1,10}. We adopt the particle method in [zhang] where for each state there are M=16 particles. The gradients are derived based a RKHS where we adaptively adjust the coefficient of the RBF kernel based on the mean distance between particles. For the AO, we find that it suffices to carry out T{1,5,10} gradient descents to approximate the regularized Wasserstein distance.

8.2 BGES

Efficient Exploration:

To demonstrate the effectiveness of our method in exploring deceptive environments, we constructed two new environments using the MuJoCo simulator. For the point environment, we have a 6 dimensional state and 2 dimensional action, with the reward at each timestep calculated as the distance between the agent and the goal. We use a horizon of 50 which is sufficient to reach the goal. The quadruped environment is based on Ant from the Open AI Gym [brockman2016openai], and has a similar reward structure to the point environment but a much larger state space (113) and action space (8). For the quadruped, we use a horizon length of 400.

To leverage the trivially parallelizable nature of ES algorithms, we use the ray library, and distribute the rollouts across 72 workers using AWS. Since we are sampling from an isotropic Gaussian, we are able to pass only the seed to the workers, as in [ES]. However we do need to return trajectory information to the master worker.

For both the point and quadruped agents, we use random features with dimensionality m=1000, and 100 warm-start updates for the WD at each iteration. For point, we use the final state embedding, learning rate η=0.1 and σ = 0.01. For the quadruped, we use the reward-to-go embedding, as we found this was needed to learn locomotion, as well as a learning rate of η=0.02 and σ = 0.02. The hyper-parameters were the same for all ES algorithms. When computing the WD, we used the previous 2 policies, θt-1 and θt-2. An ablation study for the point environment for both the choice of embedding and number of prior policies is shown in Fig 11.

(a) Embeddings (b) Previous Policies
Figure 9: A sensitivity analysis investigating a) the impact of the embedding and b) the number of previous policies θt-i,i1,2,5

For embeddings, we compare the reward-to-go (RTG), concatenation of states (SV) and final state (SF). In both the RTG and SF case the agent learns to navigate past the wall (>-800). For the number of previous policies, we use the SF embedding, and using 2 appears to work best, but both 1 and 5 do learn the correct behavior.

Escaping Local Maxima:

We also demonstrated that our method leads to faster training even in more standard settings, where exploration is not that crucial, but the optimization can be trapped in local maxima. To show it, we compared baseline ES algorithm for ES optimization from [ES] with its enhancements, where regularizers using different metrics on the space of probabilistic distributions corresponding to policy embeddings were used, as in the previous paragraph. We noticed that adding Wasserstein regularizers drastically improved optimization, whereas regularizers based on other distances/divergencies, namely: Hellinger, Jensen-Shannon, KL and TV did not have any impact. We considered Swimmer task from OpenAI Gym and to make it challenging, reduced the number of perturbations per iteration to 80. In that setting our method was the only one that was not trapped in local maxima and managed to learn effective policies.

8.3 Imitation Learning:

For the Imitation Learning experiment we used the reward-to-go embedding, with learning rate η=0.1 and σ = 0.01. We use one oracle policy, which achieves >360 on the environment. The only information provided to the algorithm is the embedded trajectory, used to compute the WD. This has exciting future applications since no additional information about the oracle is required in order to significantly improve learning.

8.4 Repulsion Learning

Input: β,η>0, M
Initialize: Initial stochastic policies π0𝐚,π0𝐛, parametrized by θ0𝐚,θ0𝐛 respectively, Behavioral Test Functions λ1𝐚,λ2𝐛
for t=1,,T  do

       1. Collect M trajectories {τi𝐚}i=1M from πt-1𝐚 and M trajectories {τi𝐛}i=1M from πt-1𝐛. Approximate πt-1𝐚πt-1𝐛 via 1M{τi𝐚}i=1M1M{τi𝐛}i=1M:=P^πt-1𝐚,πt-1𝐛
2. Form two distinct surrogate rewards for joint trajectories of agents 𝐚 and 𝐛:
R~𝐚(τ1,τ2) =(τ1)+βλ1𝐚(Φ(τ1))+βγexp(λ1𝐚(Φ(τ1))-λ2𝐛(Φ(τ2))-C(Φ(τ1)),Φ(τ2))γ)
R~𝐛(τ1,τ2) =(τ2)-βλ2𝐛(Φ(τ2))+βγexp(λ1𝐚(Φ(τ1))-λ2𝐛(Φ(τ2))-C(Φ(τ1)),Φ(τ2))γ)

3. For 𝐜{𝐚,𝐛} use the Reinforce estimator to take gradient steps:
θt𝐜=θt-1𝐜+η𝔼τ𝐚,τ𝐛P^πt-1𝐚,πt-1𝐛[~𝐜(τ𝐚,τ𝐛)(i=0H-1θt-1𝐜log(πt-1𝐜(ai𝐜|si𝐜)))]
Where τ𝐚=s0𝐚,a0𝐚,r0𝐚,,sH𝐚,aH𝐚,rH𝐚 and τ𝐛=s0𝐛,a0𝐛,r0𝐛,,sH𝐛,aH𝐛,rH𝐛.  
5. Use samples from P^πt-1𝐚,πt-1𝐛 and Algorithm 1 to update the Behavioral Test Functions λ1𝐚,λ2𝐛.
Algorithm 4 Behvaior-Guided Repulsion Learning

Although this was not discussed in the main section of the paper, it is possible to use our behavioral approach to simultaneously learn multiple policies exhibiting different behaviors all of which are able to solve the same task. This is not the main focus of the main paper, but we chose to include these results in an attempt to provide the readers with a better understanding of Behavioral Test functions.

Algorithm 4 maintains two policies π𝐚 and π𝐛. Each policy is optimized by taking a policy gradient step (using the Reinforce gradient estimator) in the direction optimizing surrogate rewards ~𝐚 and ~𝐛 that combines the signal from the task’s reward function and the repulsion score encoded by the behavioral test functions λ𝐚 and λ𝐛.

We conducted experiments testing Algorithm 4 on a simple Mujoco environment consisting of a particle that moves on the plane and whose objective is to learn a policy that allows it to reach one of two goals. Each policy outputs a velocity vector and stochasticity is achieved by adding Gaussian noise to the mean velocity encoded by a neural network with two size 5 hidden layers and ReLu activations. If an agent performs action a at state s, it moves to state a+s. The reward of an agent after performing action a at state s equals -a2*30-min(d(s,Goal1),d(s,Goal2))2 where d(x,y) denotes the distance between x and y in 2. The initial state is chosen by sampling a Gaussian distribution with mean (00) and diagonal variance 0.1. In each iteration step we sample 100 trajectories. In the following pictures we plot the policies’ behavior by plotting 100 trajectories of each. The embedding Φ:Γ maps trajectories τ to their mean displacement in the x-axis. We use the squared absolute value difference as the cost function.

(a) π0𝐚 (b) π0𝐛 (c) λ𝐚 and -λ𝐛 at t=0
Figure 10: Initial state of policies π𝐚,π𝐛 and Behavioral Test functions λ𝐚,λ𝐛 in the Multigoal environment.

There are two optimal policies, moving the particle to the left goal or moving it to the right goal. We now plot how the policies’ behavior and evolves throughout optimization and how the Behavioral Test Functions guide the optimization by favouring the two policies to be far apart.

(a) π22𝐚 (b) π22𝐛 (c) λ𝐚 and -λ𝐛 at t=22 (d) π118𝐚 (e) π118𝐛 (f) λ𝐚 and -λ𝐛 at t=118
Figure 11: Evolution of the policies and Behavioral Test Functions throughout optimization.

Let 𝒳 and 𝒴 be the domains of two measures μ, and ν. Recall that in case γ=0, 𝒳=𝒴, and C(x,x)=0 for all x𝒳, then λμ*(x)=λν*(x)=λ*(x) for all x𝒳. In the case of regularized Wasserstein distances with γ>0, this relationship may not hold true even if the cost satisfies the same diagonal assumption. For example when the regularizing measure is the product measure, and μ,ν have disjoint supports, since the soft constraint γexp(λμ(𝐱)-λν(𝐲)-C(𝐱,𝐲)γ) is enforced in expectation over the product measure there may exist optimal solutions λμ*,λν* that do not satisfy λμ*=λν*.

9 Theoretical results

We start by exploring some properties of the Wasserstein distance and its interaction with some simple classes of embeddings. The first lemma we show has the intention to show conditions under which two policies can be shown to be equal provided the Wasserstein distance between its trajectory embeddings is zero. This result implies that our framework is capable of capturing equality of policies when the embedding space equals the space of trajectories.

Lemma 9.1.

Let S and A be finite sets, the MDP be episodic (i.e. of finite horizon H), and Φ(τ)=t=0Hest,at with es,aR|S|+|A| the indicator vector for the state action pair (s,a). Let C(v,w)=v-wpp for p1. If γ=0 and WDγ(PπΦ,PπΦ)=0 then π=π.

Proof.

If WDγ(πΦ,πΦ)=0, there exists a coupling Π between πΦ and πΦ such that:

𝔼u,vΠ[u-vpp]=0

Consequently:

𝔼u,vΠ[(s,a)𝒮×𝒜|us,a-vs,a|p]=(s,a)𝒮×𝒜𝔼u,vΠ[|us,a-vs,a|p]=0

Therefore for all (s,a)𝒮×𝒜:

|𝔼uπΦ[us,a]-𝔼vπΦ[vs,a]|p𝔼u,vΠ[|us,a-vs,a|p]=0

Where us,a and vs,a denote the (s,a) entries of u and v respectively. Notice that for all (s,a)𝒮×𝒜:

πΦ(s,a)=πΦ(s,a) (8)

Since for all s𝒮 and p1:

|a𝒜us,a-vs,a|pa𝒜|us,a-vs,a|p

Therefore for all s𝒮:

|𝔼uπΦ[a𝒜us,a]-𝔼vπΦ[a𝒜vs,a]|p𝔼u,vΠ[a𝒜|us,a-vs,a|p]=0

Consequently πΦ(s)=πΦ(s) for all s𝒮. By Bayes rule, this plus equation 8 yields:

πΦ(a|s)=πΦ(a|s)

And therefore: π=π. ∎

These results can be extended in the following ways:

  1. 1.

    In the case of a continuous state space, it is possible to define embeddings using Kernel density estimators. Under the appropriate smoothness conditions on the visitation frequencies, picking an adequate bandwidth and using the appropriate norm to compare different embeddings it is possible to derive similar results to those in Lemma 9.1 for continuous state spaces.

  2. 2.

    For embeddings such as Φ5 in Section 3.1 or Φ(τ)=t=0Hest,at, when γ=0, if WDγ(πΦ,πΦ)ϵ then |V(π)-V(π)|ϵR for R=maxτΓ(τ) thus implying that a small Wasserstein distance between π and πs PPEs implies a small difference in their value functions.

9.1 Random features stochastic gradients

Let ϕκ and ϕ be two feature maps over 𝒳 and 𝒴 and corresponding to kernels κ and respectively. For this and the following sections we will make use of the following expression:

G(𝐩μ,𝐩ν) =β𝒳(𝐩μ)ϕκ(𝐱)𝑑μ(𝐱,θ)-β𝒴(𝐩ν)ϕ(𝐲)𝑑ν(𝐲)+ (9)
γβ𝒳×𝒴exp((𝐩μ)ϕκ(𝐱)-(𝐩ν)ϕ(𝐲)-C(𝐱,𝐲)γ)𝑑μ(𝐱)𝑑ν(𝐲)

We now show how to compute gradients with respect to the random feature maps:

Lemma 9.2.

The gradient (pμpν)G(pμ,pν) of the objective function from Equation 9 with respect to the parameters (pμpν) satisfies:

(𝐩μ𝐩ν)G(𝐩μ,𝐩ν)=β𝔼(𝐱,𝐲)μν[(1-exp((𝐩μ)ϕκ(𝐱)-(𝐩ν)ϕ-C(𝐱,𝐲)γ))(ϕκ(𝐱)-ϕ(𝐲))]
Proof.

A simple use of the chain rule, taking the gradients inside the expectation, and the fact that 𝐩μ and 𝐩ν are vectors yields the desired result. ∎

The main consequence of this formulation is the stochastic gradients we use in Algorithm 1.

9.2 BGPG, BGES and their theoretical guarantees

Here we provide some theoretical guarantees for our algorithms. Both proposed methods BGPG and BGES follow the alternating optimization algorithmic template. We start by noting that stripped to their bare bones components our algorithms satisfy two paradigms:

  1. 1.

    Min-Max optimization. When β<0 the algorithms we propose for solving the optimization objective in Equation , turn into an Alternating Min-Max optimization procedure. Henceforth whenever we refer to the problem defined by setting β<0 we will call it the Min-Max problem.

  2. 2.

    Max-Max optimization. When β>0 the algorithms we propose for solving the optimization objective in Equation turn into an alternating Max-Max procedure. Henceforth whenever we refer to the problem defined by setting β<0 we will call it the Max-Max problem.

Our theoretical results will therefore center around providing guarantees for optimizing an objective of the form:

F(θ,𝐩πθ,𝐩π) =𝔼τ1,τ2πθπ[(π1)+𝐩πθ,ϕκ(Φ(τ1))-β𝐩π,ϕκ(Φ(τ2))+ (10)
γβexp(𝐩πθ,ϕκ(Φ(τ1))-ϕπ,ϕκ(Φ(τ2))-C(Φ(τ1),Φ(τ2))γ)]

Where πθ is a policy parametrized by θ, π is a target policy, and 𝐩πθ and 𝐩π are the feature vectors corresponding to the Behavioral Test functions from πθ and π.

9.3 Max-Max Problem: theoretical analysis

We will analyze our AO algorithm for the Max-Max optimization problem. We show that obtained solutions converge to the local maxima of the objective function. Consider the function F(θ,𝐩λπθ,𝐩λπ) as in 10. We denote by (θ*,𝐩*λπθ,𝐩*λπ) some of its local maxima. Define F~(θ)=F(θ,𝐩*λπθ,𝐩*λπ), i.e. F~ is F as a function of θ for locally optimal values of 𝐩λπθ and 𝐩λπ.

We will assume that F~ is locally ζ-strongly concave and δ-smooth for some fixed ζ,δ>0 in the neighborhood N(θ*,r) of its optimal value θ*. We will also assume that gradient of F~ is L2-Lipschitz with Lipschitz coefficient ϕ in that neighborhood. The following convergence theorem holds:

Our goal in this section is to prove the following Theorem:

Theorem 9.3.

For the entropy coefficient γ, denote: ϕγ=12γe4γ, and uγ=83γe4γ. Denote ϕ*=max(ϕ,ϕγ) and ξ=min(2δζδ+ζ,uγ). Let s(t)=(θ(t),pλπθ(t),pλπ(t)) be the solution from iteration t and s* the local maximum considered above. Assume that optimization starts in θ0N(θ*,r). If ϕ*<2ξ3, then the error at iteration t+1 of the presented AO algorithm for the Max-Max problem with decaying gradient step size αt=3/2[2ξ-3ϕ*](t+2)+32ϕ* satisfies:

𝔼[s(t+1)-s*22]𝔼[s(0)-s*22](2t+3)32+σ29[2ξ-3ϕ*]2(t+3), (11)

where σ=2(1+e2γ)2+supN(θ*),r)θF~(θ)2.

We will need several auxiliary technical results. We will use the following notation: Wγ(θ,β1,β2)=F(θ,𝐩λπθ,𝐩λπ), where F is the objective function from the main body of the paper parameterised by entropy coefficient γ>0, β1=𝐩λπθ and β2=𝐩λπ. We will apply this notation also in the next section regarding the Min-Max Problem. For completeness, the definitions of strong concavity, smoothness and Lipschitz condition from Theorem 9.3 are given in Section 9.3.1.

We consider the dual optimization problem:

maxθmaxλ1𝒞(𝒳),λ2𝒞(𝒴)Wγ(θ,β1,β2)
=maxθmaxλ1𝒞(𝒳),λ2𝒞(𝒴)𝔼(x,y,κx,1,,κx,D,κy,1,,κy,D)μ×ν×ω××ω
        [(θ)+λ1(Φ(x))+λ2(Φ(y))-γexp(λ1(Φ(x))+λ2(Φ(y))-C(Φ(x),Φ(y))γ)],

where γ is a parameter, μ=πθ, ν=π, Φ is a fixed trajectories’ embedding and furthermore:

λ1(z) = β1f(z)
λ2(z) = β2f(z),

such that f() is a random feature vector. The ith entry of the feature vector is constructed as follows: [f(z)]i=2Dcos(zwz,i+bz,i), where wz,i(0,𝕀n1ρ2) and bz,iUniform(0,2π). For the ease of notation we denote κz,i=(wz,i,bz,i)ω. We consider stochastic gradient ascent optimization strategy of the following form:

  • at time t we receive a single sample (xt,yt,κtxt,1,,κtxt,D,κtyt,1,,κtyt,D)μ×ν×ω××ω, then we form feature vectors [f(xt)]i=2Dcos(xtwtxt,i+btxt,i) and [f(yt)]i=2Dcos(ytwtyt,i+btyt,i), and finally update:

    • β1t+1Π1[β1t+αtf(xt)(1-exp((β1t)f(xt)+(β2t)f(yt)-C(x,y)γ))]

    • β2t+1Π2[β2t+αtf(yt)(1-exp((β1t+1)f(xt)+(β2t)f(yt)-C(x,y)γ))].

Πd (d=1,2) denotes the projection onto the Euclidean ball B2(rd2,βd0) of some given radius rd2 centered at the initial iterate βd0.

Let {β1*,β2*} denote the global optimum of Wγ(β1,β2) computed on the entire data population, i.e. given access to an infinite number of samples (“oracle”). Let B2(r,β) denote the Euclidean ball of radius r centered at β. For the sake of the theoretical analysis we assume that (a lower-bound on) the radii of convergence r1,r2 for β1,β2, respectively, is known to the algorithm (this assumption is potentially easy to eliminate with a more careful choice of the step size in the first iterations). To be more specific, if at any point in time parameter β1 or β2 falls outside the ball B2(r1,β1*) or B2(r2,β2*), respectively, the projection is applied that pushes the parameter of interest to stay in the ball. Also, let β1Wγ1(β1,β2) and β2Wγ1(β1,β2) denote the gradients of Wγ with respect to β1 and β2, respectively, computed for a single sample. Similarly, β1Wγ(β1,β2) and β2Wγ(β1,β2) be the gradient of Wγ with respect to β1 and β2, respectively, computed for the entire data population, i.e. infinite number of samples.

Note that given any initial vector βd0 in the ball of radius rd2 centered at βd*, we are guaranteed that all iterates remain within an rd-ball of βd*. This is true for all d=1,2. The projection is necessary for theoretical analysis but in practice makes little difference. The above is a two-step alternated optimization scheme.

Let the population gradient operator, 𝒢d(β1,β2), where d=1,2, be defined as

𝒢d(β1,β2)βd+αβdWγ(β1,β2).

9.3.1 Assumptions

Let Wγ,1*(β1)=Wγ(β1,β2*) and Wγ,2*(β2)=Wγ(β1*,β2). Let Ω1,Ω2 denote non-empty compact convex sets such β1Ω1 and β2Ω2. The following assumptions are made:

Assumption 9.1 (Strong concavity).

For d=1,2, the function Wγ,d*(βd) is ζd-strongly concave near βd*, i.e. for all pairs (ad,bd) in the neighborhood of βd* the following holds

Wγ,d*(ad)-Wγ,d*(bd)-βdWγ,d*(bd),ad-bd-ζd2ad-bd22,

where ζd>0 is the strong concavity modulus.

Assumption 9.2 (Smoothness).

For d=1,2, the function Wγ,d*(βd) is δd-smooth, i.e. for all pairs (ad,bd) the following holds

Wγ,d*(ad)-Wγ,d*(bd)-βdWγ,d*(bd),ad-bd-δd2ad-bd22,

where δd>0 is the smoothness constant.

Assumption 9.3 (Gradient stability (GS) / Lipschitz condition).

We assume Wγ(β1,β2) satisfy GS (ϕd) condition, for all d=1,2, over Euclidean balls β1B2(r1,β1*),β2B2(r2,β2*) given as follows

βdWγ,d*(βd)-βdWγ(β1,β2)2ϕdβd¯-βd¯*2,

where ϕd>0 and d¯=(dmod2)+1.

Finally, define the bound σ that considers the expected value of the norm of gradients of our objective function as follows: σ=σ12+σ22, where σd2=sup{𝔼[βdWγ1(β1,β2)22]:β1B2(r1,β1*),β2B2(r2,β2*)} for d=1,2.

9.3.2 Main theorems

Theorem 9.4.

Given the stochastic gradient iterates of Max-Max method with decaying step size {αt}t=0 and with ϕ<2ξ3 the error at iteration t+1 satisfies the recursion

𝔼[β1t+1-β1*22+β2t+1-β2*22](1-qt)𝔼[β1t-β1*22+β2t-β2*22]+(αt)21-αtϕσ2,

where ϕ=maxd=1,2(ϕd), qt=1-1-2αtξ+2αtϕ1-αtϕ, and ξ=mind=1,2(2δdζdδd+ζd).

The recursion in Theorem 9.4 is expanded yielding the convergence theorem:

Theorem 9.5.

Given the stochastic gradient iterates of Max-Max method with decaying step size αt=3/2[2ξ-3ϕ](t+2)+32ϕ and assuming that ϕ<2ξ3, the error at iteration t+1 satisfies

𝔼[β1t+1-β1*22+β2t+1-β2*22]𝔼[β10-β1*22+β20-β2*22](2t+3)32+σ29[2ξ-3ϕ]2(t+3),

where ϕ=maxd=1,2(ϕd) and ξ=mind=1,2(2δdζdδd+ζd).

9.3.3 Analysis

The theoretical analysis we provide below is an extension of the analysis in [balakrishnan2017] to the two-step alternated optimization scheme.

Proof of Theorem 9.5 relies on Theorem 9.4, which in turn relies on Theorem 9.7 and Lemma 9.6, both of which are stated below. Proofs of the lemma and theorems follow in the subsequent subsections.

The next result is a standard result from convex optimization (Theorem 2.1.14 in [Nesterov:2014:ILC:2670022]) and is used in the proof of Theorem 9.7 below.

Lemma 9.6.

The gradient operator G1(β1,β2*) under Assumption 9.1 (strong concavity) and Assumption 9.2 (smoothness) with constant step size choice 0<α2δ1+ζ1 is contractive, i.e.

𝒢1(β1,β2*)-β1*2(1-2αδ1ζ1δ1+ζ1)β1-β1*2 (12)

for all β1B2(r1,β1*).

Similarly, the gradient operator G2(β1*,β2) under Assumption 9.1 (strong concavity) and Assumption 9.2 (smoothness) with constant step size choice 0<α2δ2+ζ2 is contractive, i.e.

𝒢2(β1*,β2)-β2*2(1-2αδ2ζ2δ2+ζ2)β2-β2*2 (13)

for all β2B2(r2,β2*).

The next theorem also holds for d=1,2. Let r1,r2>0 and β1B2(r1,β1*),β2B2(r2,β2*).

Theorem 9.7.

For some radius rd>0 (d=1,2) and a triplet (ϕd,ζd,δd) such that 0ϕd<ζdδd, suppose that the function Wγ,d*(βd) is ζd-strongly concave and δd-smooth, and that the GS (ϕd) condition holds. Then the population gradient operator Gd(β1,β2) with step α such that 0<αmind=1,22δd+ζd is contractive over a ball B2(rd,βd*), i.e.

𝒢d(β1,β2)-βd*2(1-ξα)βd-βd*2+αϕβd¯-βd¯*2 (14)

where d¯=(dmod2)+1, ϕmaxd=1,2ϕd, and ξmind=1,22δdζdδd+ζd.

Proof.
𝒢d(β1,β2)-βd*2=βd+αβdWγ(β1,β2)-βd*2
by the triangle inequality (and with dd¯) we further get
βd+αβdWγ,d*(βd)-βd*2+αβdWγ(β1,β2)-βdWγ,d*(βd)2
by the contractivity from Lemma 9.6 and GS condition
(1-2αδdζdδd+ζd)βd-βd*2+αϕdβd¯-βd¯*2.

Proof of Theorem 9.4

Let βdt+1=Πd(β~dt+1), where β~1t+1β1t+αtβ1Wγ1(β1t,β2t) and β~2t+1β2t+αtβ2Wγ1(β1t+1,β2t), where βdWγ1 is the gradient computed with respect to a single sample, β~1 and β~2 are the updates prior to the projection onto a ball B2(rd2,βd0). Let Δdt+1βdt+1-βd* and Δ~dt+1β~dt+1-βd*. Thus

Δdt+122-Δdt22 Δ~dt+122-Δdt22
= β~dt+1-βd*-βdt-βd*
= β~dt+1-βdt,β~dt+1+βdt-2βd*.

Let Q^1tβ1Wγ1(β1t,β2t) and Q^2tβ2Wγ1(β1t+1,β2t). Then we have that β~dt+1-βdt=αtQ^dt. We combine it with Equation 9.4.3 and obtain:

Δdt+122-Δdt22
αtQ^dt,αtQ^dt+2(βdt-βd*)
= (αt)2(Q^dt)Q^dt+2αt(Q^dt)(βdt-βd*)
= (αt)2Q^dt22+2αtQ^dt,Δdt.

Let Q1tβ1Wγ(β1t,β2t) and Q2tβ2Wγ(β1t+1,β2t). By the properties of martingales, i.e. iterated expectations and tower property:

𝔼[Δdt+122]𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2αt𝔼[Qdt,Δdt] (15)

Let Qd*βdWγ(β1*,β2*). By self-consistency, i.e. βd*=argmaxβdΩdWγ,d*(βd), and convexity of Ωd we have that

Qd*,Δdt=βdWγ(β1*,β2*),Δdt=0.

Combining this with Equation 15 we have

𝔼[Δdt+122] 𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2αt𝔼[Qdt-Qd*,Δdt].

Define 𝒢dtβdt+αtQdt and 𝒢dt*βd*+αtQd*. Thus

αtQdt-Qd*,Δdt
= 𝒢dt-𝒢dt*-(βdt-βd*),βdt-βd*
= 𝒢dt-𝒢dt*,βdt-βd*-βdt-βd*22
by the fact that 𝒢dt*=βd*+αtQd*=βd* (since Qd*=0):
= 𝒢dt-βd*,βdt-βd*-βdt-βd*22
by the contractivity of 𝒢t from Theorem 9.7:
{(1-αtξ)βdt-βd*+αtϕ(i=1d-1βit+1-βi*2+i=d+12βit-βi*2)}βdt-βd*2-βdt-βd*22
{(1-αtξ)Δdt2+αtϕ(i=1d-1Δit+12+i=d+12Δit2)}Δdt2-Δdt22

Combining this result with Equation 9.3.3 gives

𝔼[Δdt+122] 𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2𝔼[{(1-αtξ)Δdt2+αtϕ(i=1d-1Δit+12+i=d+12Δit2)}
Δdt2-Δdt22]
𝔼[Δdt22]+(αt)2σd2+2𝔼[{(1-αtξ)Δdt2+αtϕ(i=1d-1Δit+12+i=d+12Δit2)}
Δdt2-Δdt22].

After re-arranging the terms we obtain

𝔼[Δdt+122](αt)2σd2+(1-2αtξ)𝔼[Δdt22]+2αtϕ𝔼[(i=1d-1Δit+12+i=d+12Δit2)Δdt2]
apply 2aba2+b2:
(αt)2σd2+(1-2αtξ)𝔼[Δdt22]+αtϕ𝔼[i=1d-1(Δit+122+Δdt22)]+αtϕ𝔼[i=d+12(Δit22+Δdt22)]
= (αt)2σd2+𝔼[Δdt22][1-2αtξ+αtϕ]+αtϕ𝔼[i=1d-1Δit+122]+αtϕ𝔼[i=d+12Δit22]

We obtained

𝔼[Δdt+122](αt)2σd2+[1-2αtξ+αtϕ]𝔼[Δdt22]+αtϕ𝔼[i=1d-1Δit+122]+αtϕ𝔼[i=d+12Δit22]
we next re-group the terms as follows
𝔼[Δdt+122]-αtϕ𝔼[i=1d-1Δit+122][1-2αtξ+αtϕ]𝔼[Δdt22]+αtϕ𝔼[i=d+12Δit22]+(αt)2σd2
and then sum over d from 1 to 2
𝔼[d=12Δdt+122]-αtϕ𝔼[d=12i=1d-1Δit+122]
[1-2αtξ+αtϕ]𝔼[d=12Δdt22]+αtϕ𝔼[d=12i=d+12Δit22]+(αt)2d=12σd2

Let σ=σ12+σ22. Also, note that

𝔼[d=12Δdt+122]-αtϕ𝔼[d=12Δdt+122]𝔼[d=12Δdt+122]-αtϕ𝔼[d=12i=1d-1Δit+122]

and

[1-2αtξ+αtϕ]𝔼[d=12Δdt22]+αtϕ𝔼[d=12i=d+12Δit22]+(αt)2σ2
[1-2αtξ+αtϕ]𝔼[d=12Δdt22]+αtϕ𝔼[d=12Δdt22]+(αt)2σ2

Combining these two facts with our previous results yields:

[1-αtϕ]𝔼[d=12Δdt+122]
[1-2αtξ+αtϕ]𝔼[d=12Δdt22]+αtϕ𝔼[d=12Δdt22]+(αt)2σ2
= [1-2αtξ+2αtϕ]𝔼[d=12Δdt22]+(αt)2σ2

Thus:

𝔼[d=12Δdt+122] 1-2αtξ+2αtϕ1-αtϕ𝔼[d=12Δdt22]
+ (αt)21-αtϕσ2.

Since ϕ<2ξ3, 1-2αtξ+2αtϕ1-αtϕ<1.

Proof of Theorem 9.5

To obtain the final theorem we need to expand the recursion from Theorem 9.4. We obtained

𝔼[d=12Δdt+122]
1-2αt[ξ-ϕ]1-αtϕ𝔼[d=12Δdt22]+(αt)21-αtϕσ2
=(1-αt[2ξ-3ϕ]1-αtϕ)𝔼[d=12Δdt22]+(αt)21-αtϕσ2

Recall that we defined qt in Theorem 9.4 as

qt=1-1-2αtξ+2αtϕ1-αtϕ=αt[2ξ-3ϕ]1-αtϕ

and denote

ft=(αt)21-αtϕ.

Thus we have

𝔼[d=12Δdt+122](1-qt)𝔼[d=12Δdt22]+ftσ2
(1-qt){(1-qt-1)𝔼[d=12Δdt-122]+ft-1σ2}+ftσ2
= (1-qt)(1-qt-1)𝔼[d=12Δdt-122]+(1-qt)ft-1σ2+ftσ2
(1-qt)(1-qt-1){(1-qt-2)𝔼[d=12Δdt-222]+ft-2σ2}+(1-qt)ft-1σ2+ftσ2
= (1-qt)(1-qt-1)(1-qt-2)𝔼[d=12Δdt-222]
+(1-qt)(1-qt-1)ft-2σ2+(1-qt)ft-1σ2+ftσ2

We end-up with the following

𝔼[d=12Δdt+122] 𝔼[d=12Δd022]i=0t(1-qi)+σ2i=0t-1fij=i+1t(1-qj)+ftσ2.

Set qt=32t+2 and

αt = qt2ξ-3ϕ+qtϕ
= 32[2ξ-3ϕ](t+2)+32ϕ.

Denote A=2ξ-3ϕ and B=32ϕ. Thus

αt=32A(t+2)+B

and

ft=(αt)21-23Bαt=94A(t+2)[A(t+2)+B].
𝔼[d=12Δdt+122]
𝔼[d=12Δd022]i=0t(1-32i+2)+σ2i=0t-194A(i+2)[A(i+2)+B]j=i+1t(1-32j+2)
+σ294A(t+2)[A(t+2)+B]
= 𝔼[d=12Δd022]i=2t+2(1-32i)+σ2i=2t+194Ai[Ai+B]j=i+1t+2(1-32j)+σ294A(t+2)[A(t+2)+B]

Since A>0 and B>0 thus

𝔼[d=12Δdt+122]
𝔼[d=12Δd022]i=2t+2(1-32i)+σ2i=2t+194Ai[Ai+B]j=i+1t+2(1-32j)+σ294A(t+2)[A(t+2)+B]
𝔼[d=12Δd022]i=2t+2(1-32i)+σ2i=2t+194(Ai)2j=i+1t+2(1-32j)+σ294[A(t+2)]2

We can next use the fact that for any a(1,2):

i=τ+1t+2(1-ai)(τ+1t+3)a.

The bound then becomes

𝔼[d=12Δdt+122]
𝔼[d=12Δd022]i=2t+2(1-32i)+σ2i=2t+194(Ai)2j=i+1t+2(1-32j)+σ294[A(t+2)]2
𝔼[d=12Δd022](2t+3)32+σ2i=2t+194(Ai)2(i+1t+3)32+σ294[A(t+2)]2
= 𝔼[d=12Δd022](2t+3)32+σ2i=2t+294(Ai)2(i+1t+3)32

Note that (i+1)322i for i=2,3,, thus

𝔼[d=12Δdt+122]
𝔼[d=12Δd022](2t+3)32+σ294A2(t+3)32i=2t+2(i+1)32i2
𝔼[d=12Δd022](2t+3)32+σ292A2(t+3)32i=2t+21i12
finally note that i=2t+21i121t+21x12𝑑x2(t+3)12. Thus
𝔼[d=12Δd022](2t+3)32+σ29A2(t+3)
substituting A=2ξ-3ϕ gives
= 𝔼[d=12Δd022](2t+3)32+σ29[2ξ-3ϕ]2(t+3)

This leads us to the final theorem.

Proof of Theorem 9.3

In order to prove Theorem 9.3 it suffices to apply Theorem 9.5 and notice that:

  • function h𝐯,C:d defined as follows: h𝐯,C(𝐰)=𝐰𝐯-Ae𝐰𝐯λ for A=γe-Cγ is 2γe4γ-smooth, 2γ-strongly concave and its gradient is Lipschitz with Lipschitz coefficient 12γe4γ (with respect to L2-norm) for C>0 and 𝐯 satisfying: |𝐰𝐯|1,

  • h𝐯,C(𝐯)222(1+e2γ)2 under above conditions.

9.4 MinMax Problem: theoretical analysis

In this section we aim to obtain similar results for Min-Max Problem as for Max-Max problem. We will use the same notation as in the main body of the paper. We prove the following results:

Theorem 9.8.

Denote ϕ*=max(ϕ,ϕγ) and ξ=min(2δζδ+ζ,uγ), where ϕγ and uγ are as in Theorem 9.3. Let s(t)=(θ(t),pλπθ(t),pλπ(t)) be the solution obtained in iteration t and s* the local optimum. Assume that optimization starts in θ0N(θ*,r). If ϕ*<ξ3, then the error at iteration t+1 of the alternating optimization algorithm for the Min-Max problem with decaying gradient step size αt=3/2[2ξ-6ϕ*](t+2)+3ϕ* satisfies:

𝔼[s(t+1)-s*22]𝔼[s(0)-s*22](2t+3)32+σ29[2ξ-6ϕ*]2(t+3), (16)

where σ=2(1+e2γ)2+supN(θ*),r)θF~(θ)2.

We use the same technical notation as in the previous section, with the only exception that this time we denote: Wγ(θ,β1,β2)=-F(θ,𝐩λπθ,𝐩λπ). We consider the MinMax problem of the following form

minθmaxλ1𝒞(𝒳),λ2𝒞(𝒴)Wγ(θ,β1,β2)
=minθmaxλ1𝒞(𝒳),λ2𝒞(𝒴)𝔼(x,y,κx,1,,κx,D,κy,1,,κy,D)μ×ν×ω××ω
        [-(θ)+λ1(Φ(x))+λ2(Φ(y))-γexp(λ1(Φ(x))+λ2(Φ(y))-C(Φ(x),Φ(y))γ)],

where γ is a parameter and the remaining notation is analogous to the Max-Max case.

We consider mixed stochastic gradient descent/ascend optimization strategy of the following form:

  • at time t we receive a single sample (xt,yt,κtxt,1,,κtxt,D,κtyt,1,,κtyt,D)πθt×ν×ω××ω, then we form feature vectors [f(xt)]i=2Dcos(xtwtxt,i+btxt,i) and [f(yt)]i=2Dcos(ytwtyt,i+btyt,i), and finally update:

    • θt+1Π1[θt-αtθ=θtWγ(θ,β1t,β2t)]11 1 later we also use alternative notation for gradient θ=θtWγ(θ,β1t,β2t) as θWγ(θt,β1t,β2t)

    • β1t+1Π2[β1t+αtf(xt)(1-exp((β1t)f(xt)+(β2t)f(yt)-C(x,y)γ))]

    • β2t+1Π3[β2t+αtf(yt)(1-exp((β1t+1)f(xt)+(β2t)f(yt)-C(x,y)γ))].

Π1 denotes the projection onto the Euclidean ball B2(r12,θ0) and Πd (d=1,2) denotes the projection onto the Euclidean ball B2(r12,βd0).

Let {θ*,β1*,β2*} denote the the global optimal solution of the saddle point problem minθmaxβ1,β2Wγ(θ,β1,β2) computed on the entire data population, i.e. given access to an infinite number of samples (“oracle”). As before, we assume that (a lower-bound on) the radii of convergence r1,r2,r3 for θ,β1,β2, respectively, is known to the algorithm and thus the projection is applied to control θ,β1,β2 to stay in their respective balls. Also, let θWγ1(θ,β1,β2), β1Wγ1(θ,β1,β2) and β2Wγ1(θ,β1,β2) denote the gradients of Wγ with respect to θ,β1 and β2, respectively, computed for a single sample. Similarly, θWγ(θ,β1,β2), β1Wγ(θ,β1,β2) and β2Wγ(θ,β1,β2) be the gradient of Wγ with respect to θ, β1 and β2, respectively, computed for the entire data population, i.e. infinite number of samples.

Note that given any initial vector θ0 in the ball of radius r12 centered at θ*, we are guaranteed that all iterates remain within an r1-ball of θ* and given any initial vector βd0 (d=1,2) in the ball of radius rd2 centered at βd*, we are guaranteed that all iterates remain within an rd-ball of βd*. The projection is necessary for theoretical analysis but in practice makes little difference. The above is a three-step alternated optimization scheme.

Let the population gradient operator, 𝒢d(θ,β1,β2), where d=1,2,3, be defined as

𝒢1(θ,β1,β2)θ-αθWγ(θ,β1,β2)

and

𝒢d(θ,β1,β2)βd+αβiWγ(θ,β1,β2)ford=2,3.

9.4.1 Assumptions

Let Wγ,1*(θ)=Wγ(θ,β1*,β2*), Wγ,2*(β1)=Wγ(θ*,β1,β2*) and Wγ,3*(β2)=Wγ(θ*,β1*,β2). Let Ω1,Ω2,Ω3 denote non-empty compact convex sets such θΩ1,β1Ω2 and β2Ω3. The following assumptions are made:

Assumption 9.4 (Strong convexity/concavity).

The function Wγ,1*(θ) is ζ1-strongly convex near θ* and the functions Wγ,2*(β1) and Wγ,3*(β2) are ζ2- and ζ3-strongly concave, respectively, near β1* and β2*, respectively, where ζ1,ζ2,ζ3>0.

Assumption 9.5 (Smoothness).

The functions Wγ,1*(θ),Wγ,2*(β1), and Wγ,3*(β2) are δ1-,δ2-, and δ3-smooth, respectively, where δ1,δ2,δ3>0 are the smoothness constants.

Assumption 9.6 (Gradient stability (GS) / Lipschitz condition).

We assume Wγ(θ,β1,β2) satisfy GS (ϕd) condition, for all d=1,2,3, over Euclidean balls θB2(r1,θ*),β1B2(r2,β1*),β2B2(r3,β2*) given as follows

θWγ,1*(θ)-θWγ(θ,β1,β2)2ϕ1d=12βd-βd*2,

and for d=1,2

βdWγ,d+1*(βd)-βdWγ(θ,β1,β2)2ϕd(θ-θ*2+βd¯-βd¯*2),

where ϕd>0 and d¯=(dmod2)+1.

Finally, as before, define the bound σ that considers the expected value of the norm of gradients of our objective function as follows: σ=σ12+σ22+σ32, where σ12=sup{𝔼[θWγ1(θ,β1,β2)22]:θB2(r1,θ*),β1B2(r2,β1*),β2B2(r3,β2*)} and for d=1,2 σd+12=sup{𝔼[βdWγ1(θ,β1,β2)22]:θB2(r1,θ*),β1B2(r2,β1*),β2B2(r3,β2*)}.

9.4.2 Main theorems

Theorem 9.9.

Given the stochastic gradient iterates of MinMax method with decaying step size {αt}t=0 and with ϕ<ξ3 the error at iteration t+1 satisfies the recursion

𝔼[θt+1-θ*22+β1t+1-β1*22+β2t+1-β2*22]
       (1-qt)𝔼[θt-θ*22+β1t-β1*22+β2t-β2*22]+(αt)21-2αtϕσ2,

where ϕ=maxd=1,2,3(ϕd), qt=1-1-2αtξ+4αtϕ1-2αtϕ, and ξ=mind=1,2,3(2δdζdδd+ζd).

The recursion in Theorem 9.4 is expanded yielding the convergence theorem:

Theorem 9.10.

Given the stochastic gradient iterates of MinMax method with decaying step size αt=3/2[2ξ-6ϕ](t+2)+3ϕ and assuming that ϕ<ξ3, the error at iteration t+1 satisfies

𝔼[θt+1-θ*22+β1t+1-β1*22+β2t+1-β2*22]
       𝔼[θ0-θ*22+β10-β1*22+β20-β2*22](2t+3)32+σ29[2ξ-6ϕ]2(t+3),

where ϕ=maxd=1,2,3(ϕd) and ξ=mind=1,2,3(2δdζdδd+ζd).

Proof of Theorem 9.10 relies on Theorem 9.9, which in turn relies on Theorem 9.12 and Lemma 9.11, both of which are stated below. Proofs of the lemma and theorems follow in the subsequent subsections.

9.4.3 Analysis

The next result is a standard result from convex optimization (Theorem 2.1.14 in [Nesterov:2014:ILC:2670022]) and is used in the proof of Theorem 9.12 below.

Lemma 9.11.

The gradient operator G1(θ,β1*,β2*) under strong convexity and smoothness assumptions with constant step size choice 0<α2δ1+ζ1 is contractive, i.e.

𝒢1(θ,β1*,β2*)-θ*2(1-2αδ1ζ1δ1+ζ1)θ-θ*2

for all θB2(r1,θ*).

Similarly, the gradient operator G1(θ*,β1,β2*) under strong concavity and smoothness assumptions with constant step size choice 0<α2δ2+ζ2 is contractive, i.e.

𝒢1(θ*,β1,β2*)-β1*2(1-2αδ2ζ2δ2+ζ2)β1-β1*2

for all β1B2(r2,β1*).

And similarly, the gradient operator G2(θ*,β1*,β2) under strong concavity and smoothness assumptions with constant step size choice 0<α2δ3+ζ3 is contractive, i.e.

𝒢2(θ*,β1*,β2)-β2*2(1-2αδ3ζ3δ3+ζ3)β2-β2*2

for all β2B2(r3,β2*).

The next theorem holds for d=1,2,3. Let r1,r2,r3>0 and θB2(r1,θ*),β1B2(r2,β1*),β2B2(r3,β2*).

Theorem 9.12.

For some radius r1 and a triplet (ϕ1,ζ1,δ1) such that 0ϕ1<ζ1δ1, suppose that the function Wγ,1*(θ) is ζ1-strongly convex and δ1-smooth and that the GS (ϕ1) condition holds. Then the population gradient operator G1(θ,β1,β2) with step α such that 0<αmind=1,2,32δd+ζd is contractive over a ball B2(r1,θ*), i.e.

𝒢1(θ,β1,β2)-θ*2(1-ξα)θ-θ*2+αϕd=12βd-βd*2

where ϕmaxd=1,2,3ϕd and ξmind=1,2,32δdζdδd+ζd.

For some radius rd (d=2,3) and a triplet (ϕd,ζd,δd) such that 0ϕd<ζdδd, suppose that the function Wγ,d*(βd-1) is ζd-strongly concave and δd-smooth and that the GS (ϕd) condition holds. Then the population gradient operator Gd(θ,β1,β2) with step α such that 0<αmind=1,2,32δd+ζd is contractive over a ball B2(rd,βd*), i.e.

𝒢d(θ,β1,β2)-βd*2(1-ξα)βd-βd*2+αϕ(βd¯-βd¯*2+θ-θ*2).

where d¯=((d-1)mod2)+1, ϕmaxd=1,2,3ϕd, and ξmind=1,2,32δdζdδd+ζd.

Proof.
𝒢1(θ,β1,β2)-θ*2=θ-αθWγ(θ,β1,β2)-θ*2
by the triangle inequality we further get
θ-αθWγ,1*-θ*2+αθWγ(θ,β1,β2)-θWγ,1*2
by the contractivity from Lemma 9.11 and GS condition
(1-2αδ1ζ1δ1+ζ1)θ-θ*2+αϕ1d=12βd-βd*2.

The proof of the rest of the theorem is analogous to the proof of Theorem 9.7.

Proof of Theorem 9.9

Let θ1=θ, θ2=β1, and θ3=β2.

Let θdt+1=Πd(θ~dt+1), where θ~1t+1θ1t-αtθ1Wγ1(θ1t,θ2t,θ3t), θ~2t+1θ2t+αtθ2Wγ1(θ1t+1,θ2t,θ3t), and θ~3t+1θ3t+αtθ3Wγ1(θ1t+1,θ2t+1,θ3t), where θdWγ1 is the gradient computed with respect to a single sample, θ~1, θ~2, and θ~3 are the updates prior to the projection. Let Δ1t+1-θ1t+1+θ1* and for d=2,3, Δ~dt+1θ~dt+1-θd*. Thus

Δdt+122-Δdt22 Δ~dt+122-Δdt22
= θ~dt+1-θd*-θdt-θd*
= θ~dt+1-θdt,θ~dt+1+θdt-2θd*.

Let Q^1tθ1Wγ1(θ1t,θ2t,θ3t), Q^2tθ2Wγ1(θ1t+1,θ2t,θ3t), and Q^3tθ3Wγ1(θ1t+1,θ2t+1,θ3t). Thus:

Δdt+122-Δdt22
αtQ^dt,αtQ^dt+2(θdt-θd*)
= (αt)2(Q^dt)Q^dt+2αt(Q^dt)(θdt-θd*)
= (αt)2Q^dt22+2αtQ^dt,Δdt.

Let Q1tθ1Wγ(θ1t,θ2t,θ3t), Q2tθ2Wγ(θ1t+1,θ2t,θ3t), and Q3tθ3Wγ(θ1t+1,θ2t+1,θ3t). By the properties of martingales, i.e. iterated expectations and tower property:

𝔼[Δdt+122] 𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2αt𝔼[Qdt,Δdt]

Let Qd*θdWγ(θ1*,θ2*,θ3*). By self-consistency, i.e. θd*=argmaxθdΩdWγ,d*(θd), and convexity of Ωd we have that

Qd*,Δdt=θdWγ(θ1*,θ2*,θ3*),Δdt=0.

Combining this with the above inequality yields

𝔼[Δdt+122] 𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2αt𝔼[Qdt-Qd*,Δdt].

Define 𝒢1tθ1t-αtQ1t and 𝒢1t*θ1*-αtQ1*. Also, for d=2,3 define 𝒢dtθdt+αtQdt and 𝒢dt*θd*+αtQd*. Thus

αtQdt-Qd*,Δdt
= 𝒢dt-𝒢dt*-(θdt-θd*),θdt-θd*
= 𝒢dt-𝒢dt*,θdt-θd*-θdt-θd*22
by the fact that 𝒢dt*=θd*+αtQd*=θd* (since Qd*=0):
= 𝒢dt-θd*,θdt-θd*-θdt-θd*22
by the contractivity of 𝒢t from Theorem 9.7:
{(1-αtξ)θdt-θd*+αtϕ(i=1d-1θit+1-θi*2+i=d+13θit-θi*2)}θdt-θd*2-θdt-θd*22
{(1-αtξ)Δdt2+αtϕ(i=1d-1Δit+12+i=d+13Δit2)}Δdt2-Δdt22

Thus

𝔼[Δdt+122] 𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2αt𝔼[Qdt-Qd*,Δdt]
= 𝔼[Δdt22]+(αt)2𝔼[Q^dt22]+2𝔼[{(1-αtξ)Δdt2+αtϕ(i=1d-1Δit+12+i=d+13Δit2)}
                                                       Δdt2-Δdt22]
𝔼[Δdt22]+(αt)2σd2+2𝔼[{(1-αtξ)Δdt2+αtϕ(i=1d-1Δit+12+i=d+13Δit2)}
                                                       Δdt2-Δdt22].

After re-arranging the terms we obtain

𝔼[Δdt+122](αt)2σd2+(1-2αtξ)𝔼[Δdt22]+2αtϕ𝔼[(i=1d-1Δit+12+i=d+13Δit2)Δdt2]
apply 2aba2+b2:
(αt)2σd2+(1-2αtξ)𝔼[Δdt22]+αtϕ𝔼[i=1d-1(Δit+122+Δdt22)]+αtϕ𝔼[i=d+13(Δit22+Δdt22)]
= (αt)2σd2+𝔼[Δdt22][1-2αtξ+2αtϕ]+αtϕ𝔼[i=1d-1Δit+122]+αtϕ𝔼[i=d+13Δit22]

We obtained

𝔼[Δdt+122](αt)2σd2+[1-2αtξ+2αtϕ]𝔼[Δdt22]+αtϕ𝔼[i=1d-1Δit+122]+αtϕ𝔼[i=d+13Δit22]
we next re-group the terms as follows
𝔼[Δdt+122]-αtϕ𝔼[i=1d-1Δit+122][1-2αtξ+2αtϕ]𝔼[Δdt22]+αtϕ𝔼[i=d+13Δit22]+(αt)2σd2
and then sum over d from 1 to 3
𝔼[d=13Δdt+122]-αtϕ𝔼[d=13i=1d-1Δit+122]
[1-2αtξ+2αtϕ]𝔼[d=13Δdt22]+αtϕ𝔼[d=13i=d+13Δit22]+2(αt)2d=13σd2

Note that

𝔼[d=13Δdt+122]-2αtϕ𝔼[d=13Δdt+122]𝔼[d=13Δdt+122]-αtϕ𝔼[d=13i=1d-1Δit+122]

and

[1-2αtξ+2αtϕ]𝔼[d=13Δdt22]+αtϕ𝔼[d=13i=d+13Δit22]+(αt)2σ2
[1-2αtξ+2αtϕ]𝔼[d=13Δdt22]+2αtϕ𝔼[d=13Δdt22]+(αt)2σ2

Combining these two facts with our previous results yields:

[1-2αtϕ]𝔼[d=13Δdt+122]
[1-2αtξ+2αtϕ]𝔼[d=13Δdt22]+2αtϕ𝔼[d=13Δdt22]+(αt)2σ2
= [1-2αtξ+2αtϕ]𝔼[d=13Δdt22]+(αt)2σ2

Thus:

𝔼[d=13Δdt+122] 1-2αtξ+4αtϕ1-2αtϕ𝔼[d=13Δdt22]
+ (αt)21-2αtϕσ2.

Since ϕ<ξ3, 1-2αtξ+4αtϕ1-2αtϕ<1.

Proof of Theorem 9.10

To obtain the final theorem we need to expand the recursion from Theorem 9.4. We obtained

𝔼[d=13Δdt+122]
1-2αt[ξ-2ϕ]1-2αtϕ𝔼[d=13Δdt22]+(αt)21-2αtϕσ2
=(1-αt[2ξ-6ϕ]1-2αtϕ)𝔼[d=13Δdt22]+(αt)21-2αtϕσ2

Recall that we defined qt in Theorem 9.4 as

qt=1-1-2αtξ+4αtϕ1-2αtϕ=αt[2ξ-6ϕ]1-2αtϕ

and denote

ft=(αt)21-2αtϕ.

Thus we have

𝔼[d=13Δdt+122](1-qt)𝔼[d=13Δdt22]+ftσ2
(1-qt){(1-qt-1)𝔼[d=13Δdt-122]+ft-1σ2}+ftσ2
= (1-qt)(1-qt-1)𝔼[d=13Δdt-122]+(1-qt)ft-1σ2+ftσ2
(1-qt)(1-qt-1){(1-qt-2)𝔼[d=13Δdt-222]+ft-2σ2}+(1-qt)ft-1σ2+ftσ2
= (1-qt)(1-qt-1)(1-qt-2)𝔼[d=13Δdt-222]
+(1-qt)(1-qt-1)ft-2σ2+(1-qt)ft-1σ2+ftσ2

We end-up with the following

𝔼[d=13Δdt+122] 𝔼[d=13Δd022]i=0t(1-qi)+σ2i=0t-1fij=i+1t(1-qj)+ftσ2.

Set qt=32t+2 and

αt = qt2ξ-6ϕ+2qtϕ
= 32[2ξ-6ϕ](t+2)+3ϕ.

Denote A=2ξ-6ϕ and B=3ϕ. Thus

αt=32A(t+2)+B

and

ft=(αt)21-23Bαt=94A(t+2)[A(t+2)+B].
𝔼[d=12Δdt+122]
𝔼[d=12Δd022]i=0t(1-32i+2)+σ2i=0t-194A(i+2)[A(i+2)+B]j=i+1t(1-32j+2)
+σ294A(t+2)[A(t+2)+B]
= 𝔼[d=12Δd022]i=2t+2(1-32i)+σ2i=2t+194Ai[Ai+B]j=i+1t+2(1-32j)+σ294A(t+2)[A(t+2)+B]

Since A>0 and B>0 thus

𝔼[d=13Δdt+122]
𝔼[d=13Δd022]i=2t+2(1-32i)+σ2i=2t+194Ai[Ai+B]j=i+1t+2(1-32j)+σ294A(t+2)[A(t+2)+B]
𝔼[d=13Δd022]i=2t+2(1-32i)+σ2i=2t+194(Ai)2j=i+1t+2(1-32j)+σ294[A(t+2)]2

We can next use the fact that for any a(1,2):

i=τ+1t+2(1-ai)(τ+1t+3)a.

The bound then becomes

𝔼[d=13Δdt+122]
𝔼[d=13Δd022]i=2t+2(1-32i)+σ2i=2t+194(Ai)2j=i+1t+2(1-32j)+σ294[A(t+2)]2
𝔼[d=13Δd022](2t+3)32+σ2i=2t+194(Ai)2(i+1t+3)32+σ294[A(t+2)]2
= 𝔼[d=13Δd022](2t+3)32+σ2i=2t+294(Ai)2(i+1t+3)32

Note that (i+1)322i for i=2,3,, thus

𝔼[d=13Δdt+122]
𝔼[d=12Δd022](2t+3)32+σ294A2(t+3)32i=2t+2(i+1)32i2
𝔼[d=13Δd022](2t+3)32+σ292A2(t+3)32i=2t+21i12
finally note that i=2t+21i121t+21x12𝑑x2(t+3)12. Thus
𝔼[d=13Δd022](2t+3)32+σ29A2(t+3)
substituting A=2ξ-6ϕ gives
= 𝔼[d=13Δd022](2t+3)32+σ29[2ξ-6ϕ]2(t+3)

This leads us to the final theorem. To obtain Theorem 9.8, we proceed in an analogous way as form Theorem 9.3, but this time applying Theorem 9.10 that we have just proved.

9.5 Behavior Guided Policy Gradient and Wasserstein trust region

The chief goal of this section is to prove Theorem 5.1. We restate the section’s definitions here for the reader’s convenience: To ease the discussion we make the following assumptions:

  1. 1.

    Finite horizon T.

  2. 2.

    Undiscounted MDP.

  3. 3.

    States are time indexed. In other words, states visited at time t can’t be visited at any other time.

  4. 4.

    𝒮 and 𝒜 are finite sets.

The third assumption is solely to avoid having to define a time indexed Value function. It can be completely avoided. We chose not to do this in the spirit of notational simplicity. These assumptions can be relaxed, most notably we can show similar results for the discounted and infinite horizon case. We chose to present the finite horizon proof because of the nature of our experimental results.

Let Φ=id be the identity embedding so that =Γ. In this case πΦ denotes the distribution of trajectories corresponding to policy π. We define the value function Vπ:𝒮 as

Vπ(st=s)=𝔼τπid[=tTR(s+1,a,s)|st=s]

The Q-function Qπ:𝒮×𝒜 as:

Qπ(st,at=a)=𝔼τπid[=tTR(s+1,a,s)]

Similarly, the advantage function is defined as:

Aπ(s,a)=Qπ(s,a)-Vπ(s)

We denote by V(π)=𝔼τπid[t=0TR(st+1,at,st)] the expected reward of policy π and define the visitation frequency as:

ρπ(s)=𝔼τπid[t=0T𝟏(st=s)]

The first observation in this section is the following lemma:

Lemma 9.13.

two distinct policies π and π~ can be related via the following equation :

V(π~)=V(π)+s𝒮(ρπ~(s)(a𝒜π~(a|s)Aπ(s,a)))
Proof.

Notice that Aπ(s,a)=𝔼sP(s|a,s)[R(s,a,s)+Vπ(s)-Vπ(s)]. Therefore:

𝔼τπ~id[t=0TAπ(st,at)] =𝔼τπ~id[t=0TR(st+1,at,st)+Vπ(st+1)-Vπ(st)]
=𝔼τπ~id[t=0TR(st+1,at,st)]-𝔼s0[Vπ(s0)]
=-V(π)+V(π~)

The result follows. ∎

See [sutton1998introduction] for an alternative proof. We also consider the following linear approximation to V around policy π (see: [kakade2002approximately]):

L(π~)=V(π)+s𝒮(ρπ(s)(a𝒜π~(a|s)Aπ(s,a)))

Where the only difference is that ρπ~ was substituted by ρπ. Consider the following embedding Φs:Γ|𝒮| defined by (Φ(τ))s=t=0T𝟏(st=s), and related cost function defined as: C(𝐯,𝐰)=𝐯-𝐰1.

Lemma 9.14.

The Wasserstein distance WD0(Pπ~Φs,PπΦs) is related to visit frequencies since:

WD0(π~Φs,πΦs)s𝒮|ρπ(s)-ρπ~(s)|
Proof.

Let Π be the optimal coupling between π~Φs and πΦs. Then:

WD0(π~Φs,πΦs) =𝔼u,vΠ[u-v1]
=s𝒮𝔼u,vΠ[|us-vs|]

Where us and vs denote the s𝒮 indexed entry of the u and v vectors respectively. Notice that for all s𝒮 the following is true:

|𝔼uπΦs[us]ρπ(s)-𝔼vπΦs[vs]ρπ(s)|𝔼u,vΠ[|us-vs|]

The result follows.

These observations enable us to prove an analogue of Theorem 1 from [trpo], namely:

Theorem 9.15.

If WD0(Pπ~Φs,PπΦs)δ and ϵ=maxs,a|Aπ(s,a)|, then V(π~)L(θ~)-δϵ.

As in [trpo], Theorem 5.1 implies a policy improvement guarantee for BGPG from Section 5.3.

Proof.

Notice that:

V(π~)-L(π~)=s𝒮((ρπ~(s)-ρπ(s))(a𝒜π~(a|s)Aπ(s,a)))

Therefore by Holder inequality:

|V(π~)-L(π~)|(s𝒮|ρπ(s)-ρπ~(s)|)WD0(π~Φs,πΦs)δ(sups𝒮|a𝒜π~(a|s)Aπ(s,a)|)ϵ

The result follows. ∎

We can leverage the results of Theorem 9.15 to show wasserstein trust regions methods with embedding Φs give a monotonically improving sequence of policies. The proof can be concluded by following the logic of Section 3 in [trpo].