A Reduction from Reinforcement Learning to No-Regret Online Learning

  • 2019-11-14 00:47:47
  • Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
  • 1

Abstract

We present a reduction from reinforcement learning (RL) to no-regret onlinelearning based on the saddle-point formulation of RL, by which "any" onlinealgorithm with sublinear regret can generate policies with provable performanceguarantees. This new perspective decouples the RL problem into two parts:regret minimization and function approximation. The first part admits astandard online-learning analysis, and the second part can be quantifiedindependently of the learning algorithm. Therefore, the proposed reduction canbe used as a tool to systematically design new RL algorithms. We demonstratethis idea by devising a simple RL algorithm based on mirror descent and thegenerative-model oracle. For any $\gamma$-discounted tabular RL problem, withprobability at least $1-\delta$, it learns an $\epsilon$-optimal policy usingat most$\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right)$samples. Furthermore, this algorithm admits a direct extension to linearlyparameterized function approximators for large-scale applications, withcomputation and sample complexities independent of$|\mathcal{S}|$,$|\mathcal{A}|$, though at the cost of potential approximationbias.

 

Quick Read (beta)

Abstract

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which any online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any γ-discounted tabular RL problem, with probability at least 1-δ, it learns an ϵ-optimal policy using at most O~(|SS||𝒜|log(1δ)(1-γ)4ϵ2) samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of |SS|,|𝒜|, though at the cost of potential approximation bias.

 

A Reduction from Reinforcement Learning to
No-Regret Online Learning


 


Ching-An Cheng                        Remi Tachet des Combes                        Byron Boots                        Geoff Gordon

Georgia Tech                        Microsoft Research Montreal                        UW                        Microsoft Research Montreal

1 INTRODUCTION

Reinforcement learning (RL) is a fundamental problem for sequential decision making in unknown environments. One of its core difficulties, however, is the need for algorithms to infer long-term consequences based on limited, noisy, short-term feedback. As a result, designing RL algorithms that are both scalable and provably sample efficient has been challenging.

In this work, we revisit the classic linear-program (LP) formulation of RL [26, 13] in an attempt to tackle this long-standing question. We focus on the associated saddle-point problem of the LP (given by Lagrange duality), which has recently gained traction due to its potential for computationally efficient algorithms with theoretical guarantees [34, 7, 36, 24, 35, 25, 12, 6, 23]. But in contrast to these previous works based on stochastic approximation, here we consider a reformulation through the lens of online learning, i.e. regret minimization. Since the pioneering work of Gordon [14], Zinkevich [37], online learning has evolved into a ubiquitous tool for systematic design and analysis of iterative algorithms. Therefore, if we can identify a reduction from RL to online learning, we can potentially leverage it to build efficient RL algorithms.

We will show this idea is indeed feasible. We present a reduction by which any no-regret online algorithm, after observing N samples, can find a policy π^N in a policy class Π satisfying Vπ^N(p)Vπ*(p)-o(1)-ϵΠ, where Vπ(p) is the accumulated reward of policy π with respect to some unknown initial state distribution p, π* is the optimal policy, and ϵΠ0 is a measure of the expressivity of Π (see creftype 4.2 for definition).

Our reduction is built on a refinement of online learning, called Continuous Online Learning (COL), which was proposed to model problems where loss gradients across rounds change continuously with the learner’s decisions [9]. COL has a strong connection to equilibrium problems (EPs) [4, 3], and any monotone EP (including our saddle-point problem of interest) can be framed as no-regret learning in a properly constructed COL problem [9]. Using this idea, our reduction follows naturally by first converting an RL problem to an EP and then the EP to a COL problem.

Framing RL as COL reveals new insights into the relationship between approximate solutions to the saddle-point problem and approximately optimal policies. Importantly, this new perspective shows that the RL problem can be separated into two parts: regret minimization and function approximation. The first part admits standard treatments from the online learning literature, and the second part can be quantified independently of the learning process. For example, one can accelerate learning by adopting optimistic online algorithms [31, 10] that account for the predictability in COL, without worrying about function approximators. Because of these problem-agnostic features, the proposed reduction can be used to systematically design efficient RL algorithms with performance guarantees.

As a demonstration, we design an RL algorithm based on arguably the simplest online learning algorithm: mirror descent. Assuming a generative model11 1 In practice, it can be approximated by running a behavior policy with sufficient exploration [22]., we prove that, for any tabular Markov decision process (MDP), with probability at least 1-δ, this algorithm learns an ϵ-optimal policy for the γ-discounted accumulated reward, using at most O~(|SS||𝒜|log(1δ)(1-γ)4ϵ2) samples, where |SS|,|𝒜| are the sizes of state and action spaces, and γ is the discount rate. Furthermore, thanks to the separation property above, our algorithm admits a natural extension with linearly parameterized function approximators, whose sample and per-round computation complexities are linear in the number of parameters and independent of |SS|,|𝒜|, though at the cost of policy performance bias due to approximation error.

This sample complexity improves the current best provable rate of the saddle-point RL setup [34, 7, 36, 24] by a large factor of |SS|2(1-γ)2, without making any assumption on the MDP.22 2 [36] has the same sample complexity but requires the MDP to be ergodic under any policy. This improvement is attributed to our new online-learning-style analysis that uses a cleverly selected comparator in the regret definition. While it is possible to devise a minor modification of the previous stochastic mirror descent algorithms, e.g. [36], achieving the same rate with our new analysis, we remark that our algorithm is considerably simpler and removes a projection required in previous work [34, 7, 36, 24].

Finally, we do note that the same sample complexity can also be achieved, e.g., by model-based RL and (phased) Q-learning [22, 21]. However, these methods either have super-linear runtime, with no obvious route for improvement, or could become unstable when using function approximators without further assumption.

2 SETUP & PRELIMINARIES

Let SS and 𝒜 be state and action spaces, which can be discrete or continuous. We consider γ-discounted infinite-horizon problems for γ[0,1). Our goal is to find a policy π(a|s) that maximizes the discounted average return Vπ(p)𝔼sp[Vπ(s)], where p is the initial state distribution,

Vπ(s)(1-γ)𝔼ξρπ(s)[t=0γtr(st,at)] (1)

is the value function of π at state s, r:SS×𝒜[0,1] is the reward function, and ρπ(s) is the distribution of trajectory ξ=s0,a0,s1, generated by running π from s0=s in an MDP. We assume that the initial distribution p(s0), the transition 𝒫(s|s,a), and the reward function r(s,a) in the MDP are unknown but can be queried through a generative model, i.e. we can sample s0 from p, s from 𝒫, and r(s,a) for any sSS and a𝒜. We remark that the definition of Vπ in (1) contains a (1-γ) factor. We adopt this setup to make writing more compact. We denote the optimal policy as π* and its value function as V* for short.

2.1 Duality in RL

Our reduction is based on the linear-program (LP) formulation of RL. We provide a short recap here (please see creftype A and [30] for details).

To show how maxπVπ(p) can be framed as a LP, let us define the average state distribution under π, dπ(s)(1-γ)t=0γtdtπ(s), where dtπ is the state distribution at time t visited by running π from p (e.g. d0π=p). By construction, dπ satisfies the stationarity property,

dπ(s)=(1-γ)p(s)+γ𝔼sdπ𝔼aπ|s[𝒫(s|s,a)]. (2)

With dπ, we can write Vπ(p)=𝔼sdπ𝔼aπ|s[r(s,a)] and our objective maxπVπ(p) equivalently as:

max𝝁|SS||𝒜|:𝝁𝟎𝐫𝝁s.t.(1-γ)𝐩+γ𝐏𝝁=𝐄𝝁 (3)

where 𝐫|SS||𝒜|, 𝐩|SS|, and 𝐏|SS||𝒜|×|SS| are vector forms of r, p, and 𝒫, respectively, and 𝐄=𝐈𝟏|SS||𝒜|×|SS| (we use || to denote the cardinality of a set, the Kronecker product, 𝐈|SS|×|SS| is the identity, and 𝟏|𝒜| the vector of ones). In (3), SS and 𝒜 may seem to have finite cardinalities, but the same formulation extends to countable or even continuous spaces (under proper regularity assumptions; see [17]). We adopt this abuse of notation (emphasized by bold-faced symbols) for compactness.

The variable 𝝁 of the LP in (3) resembles a joint distribution dπ(s)π(a|s). To see this, notice that the constraint in (3) is reminiscent of (2), and implies 𝝁1=1, i.e. 𝝁 is a probability distribution. Then one can show μ(s,a)=dπ(s)π(a|s) when the constraint is satisfied, which implies that (3) is the same as maxπVπ(p) and its solution 𝝁* corresponds to μ*(s,a)=dπ*(s)π*(a|s) of the optimal policy π*.

As (3) is a LP, it suggests looking at its dual, which turns out to be the classic LP formulation of RL33 3 Our setup in (4) differs from the classic one in the (1-γ) factor in the constraint due to the average setup.,

min𝐯|SS|𝐩𝐯s.t.(1-γ)𝐫+γ𝐏𝐯𝐄𝐯. (4)

It can be verified that for all 𝐩>0, the solution to (4) satisfies the Bellman equation [2] and therefore is the optimal value function 𝐯* (the vector form of V*). We note that, for any policy π, Vπ by definition satisfies a stationarity property

Vπ(s)=𝔼aπ|s[(1-γ)r(s,a)+γ𝔼s𝒫|s,a[Vπ(s)]] (5)

which can be viewed as a dual equivalent of (2) for dπ. Because, for any sSS and a𝒜, r(s,a) is in [0,1], (5) implies Vπ(s) lies in [0,1] too.

2.2 Toward RL: the Saddle-Point Setup

The LP formulations above require knowing the probabilities p and 𝒫 and are computationally inefficient. When only generative models are available (as in our setup), one can alternatively exploit the duality relationship between the two LPs in (3) and (4), and frame RL as a saddle-point problem [34]. Let us define

𝐚𝐯𝐫+11-γ(γ𝐏-𝐄)𝐯 (6)

as the advantage function with respect to 𝐯 (where 𝐯 is not necessarily a value function). Then the Lagrangian connecting the two LPs can be written as

(𝐯,𝝁)𝐩𝐯+𝝁𝐚𝐯, (7)

which leads to the saddle-point formulation,

min𝐯𝒱max𝝁(𝐯,𝝁), (8)

where the constraints are

𝒱 ={𝐯|SS|:𝐯0,𝐯1} (9)
={𝝁|SS||𝒜|:𝝁0,𝝁11}. (10)

The solution to (8) is exactly (𝐯*,𝝁*), but notice that extra constraints on the norm of 𝝁 and 𝐯 are introduced in 𝒱,, compared with (3) and (4). This is a common practice, which uses known bound on the solutions of (3) and (4) (discussed above) to make the search spaces 𝒱 and in (8) compact and as small as possible so that optimization converges faster.

Having compact variable sets allows using first-order stochastic methods, such as stochastic mirror descent and mirror-prox [28, 19], to efficiently solve the problem. These methods only require using the generative model to compute unbiased estimates of the gradients 𝐯=𝐛𝝁 and 𝝁=𝐚𝐯, where we define

𝐛𝝁𝐩+11-γ(γ𝐏-𝐄)𝝁 (11)

as the balance function with respect to 𝝁. 𝐛𝝁 measures whether 𝝁 violates the stationarity constraint in (3) and can be viewed as the dual of 𝐚𝐯. When the state or action space is too large, one can resort to function approximators to represent 𝐯 and 𝝁, which are often realized by linear basis functions for the sake of analysis [6].

2.3 COL and EPs

Finally, we review the COL setup in [9], which we will use to design the reduction from the saddle-point problem in (8) to online learning in the next section.

Recall that an online learning problem describes the iterative interactions between a learner and an opponent. In round n, the learner chooses a decision xn from a decision set 𝒳, the opponent chooses a per-round loss function ln:𝒳 based on the learner’s decisions, and then information about ln (e.g. its gradient ln(xn)) is revealed to the learner. The performance of the learner is usually measured in terms of regret with respect to some x𝒳,

RegretN(x)n=1Nln(xn)-n=1Nln(x).

When ln is convex and 𝒳 is compact and convex, many no-regret (i.e. RegretN(x)=o(N)) algorithms are available, such as mirror descent and follow-the-regularized-leader [5, 33, 15].

COL is a subclass of online learning problems where the loss sequence changes continuously with respect to the played decisions of the learner [9]. In COL, the opponent is equipped with a bifunction f:(x,x)fx(x), where any fixed x𝒳, fx(x) is continuous in x𝒳. The opponent selects per-round losses based on f, but the learner does not know f: in round n, if the learner chooses xn, the opponent sets

ln(x)=fxn(x), (12)

and returns, e.g., a stochastic estimate of ln(xn) (the regret is still measured in terms of the noise-free ln).

In [9], a natural connection is shown between COL and equilibrium problems (EPs). As EPs include the saddle-point problem of interest, we can use this idea to turn (8) into a COL problem. Recall an EP is defined as follows: Let 𝒳 be compact and F:(x,x)F(x,x) be a bifunction s.t. x,x𝒳, F(,x) is continuous, F(x,) is convex, and F(x,x)0.44 4 We restrict ourselves to this convex and continuous case as it is sufficient for our problem setup. The problem EP(𝒳,F) aims to find x𝒳 s.t.

F(x,x)0,x𝒳. (13)

By its definition, a natural residual function to quantify the quality of an approximation solution x to EP is rep(x)-minx𝒳F(x,x) which describes the degree to which (13) is violated at x. We say a bifunction F is monotone if, x,x𝒳, F(x,x)+F(x,x)0, and skew-symmetric if the equality holds.

EPs with monotone bifunctions represent general convex problems, including convex optimization problems, saddle-point problems, variational inequalities, etc. For instance, a convex-concave problem miny𝒴maxz𝒵ϕ(y,z) can be cast as EP(𝒳,F) with 𝒳=𝒴×𝒵 and the skew-symmetric bifunction [18]

F(x,x)ϕ(y,z)-ϕ(y,z), (14)

where x=(y,z) and x=(y,z). In this case, rep(x)=maxz𝒵ϕ(y,z)-miny𝒴ϕ(y,z) is the duality gap.

Cheng et al. [9] show that a learner achieves sublinear dynamic regret in COL if and only if the same algorithm can solve EP(𝒳,F) with F(x,x)=fx(x)-fx(x). Concretely, they show that, given a monotone EP(𝒳,F) with F(x,x)=0 (which is satisfied by (14)), one can construct a COL problem by setting fx(x)F(x,x), i.e. ln(x)=F(xn,x), such that any no-regret algorithm can generate an approximate solution to the EP.

Proposition 1.

[9] If F is skew-symmetric and ln(x)=F(xn,x), then rep(x^N)1N𝑅𝑒𝑔𝑟𝑒𝑡N, where 𝑅𝑒𝑔𝑟𝑒𝑡N=maxxX𝑅𝑒𝑔𝑟𝑒𝑡N(x), and x^N=1Nn=1Nxn; the same guarantee holds also for the best decision in {xn}n=1N.

3 AN ONLINE LEARNING VIEW

We present an alternate online-learning perspective on the saddle-point formulation in (8). This analysis paves a way for of our reduction in the next section. By reduction, we mean realizing the two steps below:

  1. 1.

    Define a sequence of online losses such that any algorithm with sublinear regret can produce an approximate solution to the saddle-point problem.

  2. 2.

    Convert the approximate solution in the first step to an approximately optimal policy in RL.

Methods to achieve these two steps individually are not new. The reduction from convex-concave problems to no-regret online learning is well known [1]. Likewise, the relationship between the approximate solution of (8) and policy performance is also available; this is how the saddle-point formulation [36] works in the first place. So couldn’t we just use these existing approaches? We argue that purely combining these two techniques fails to fully capture important structure that resides in RL. While this will be made precise in the later analyses, we highlight the main insights here.

Instead of treating (8) as an adversarial two-player online learning problem [1], we adopt the recent reduction to COL [9] reviewed in creftype 2.3. The main difference is that the COL approach takes a single-player setup and retains the Lipschitz continuity in the source saddle-point problem. This single-player perspective is in some sense cleaner and, as we will show in creftype 4.2, provides a simple setup to analyze effects of function approximators. Additionally, due to continuity, the losses in COL are predictable and therefore make designing fast algorithms possible.

With the help of the COL reformulation, we study the relationship between the approximate solution to (8) and the performance of the associated policy in RL. We are able to establish a tight bound between the residual and the performance gap, resulting in a large improvement of |SS|2(1-γ)2 in sample complexity compared with the best bounds in the literature of the saddle-point setup, without adding extra constraints on 𝒳 and assumptions on the MDP. Overall, this means that stronger sample complexity guarantees can be attained by simpler algorithms, as we demonstrate in creftype 5.

The missing proofs of this section are in creftype B.

3.1 The COL Formulation of RL

First, let us exercise the above COL idea with the saddle-point formulation of RL in (8). To construct the EP, we can let 𝒳={x=(𝐯,𝝁):𝐯𝒱,𝝁}, which is compact. According to (14), the bifunction F of the associated EP(𝒳,F) is naturally given as

F(x,x) (𝐯,𝝁)-(𝐯,𝝁)
=𝐩𝐯+𝝁𝐚𝐯-𝐩𝐯-𝝁𝐚𝐯 (15)

which is skew-symmetric, and x*(𝒗*,𝝁*) is a solution to EP(𝒳,F). This identification gives us a COL problem with the loss in the nth round defined as

ln(x)𝐩𝐯+𝝁n𝐚𝐯-𝐩𝐯n-𝝁𝐚𝐯n (16)

where xn=(𝐯n,𝝁n). We see ln is a linear loss. Moreover, because of the continuity in , it is predictable, i.e. ln can be (partially) inferred from past feedback as the MDP involved in each round is the same.

3.2 Policy Performance and Residual

By creftype 1, any no-regret algorithm, when applied to (16), provides guarantees in terms of the residual function rep(x) of the EP. But this is not the end of the story. We also need to relate the learner decision x𝒳 to a policy π in RL and then convert bounds on rep(x) back to the policy performance Vπ(p). Here we follow the common rule in the literature and associate each x=(𝐯,𝝁)𝒳 with a policy π𝝁 defined as

π𝝁(a|s)μ(s,a). (17)

In the following, we relate the residual rep(x) to the performance gap V*(p)-Vπ𝝁(p) through a relative performance measure defined as

rep(x;x)F(x,x)-F(x,x)=-F(x,x) (18)

for x,x𝒳, where the last equality follows from the skew-symmetry of F in (3.1). Intuitively, we can view rep(x;x) as comparing the performance of x with respect to the comparator x under an optimization problem proposed by x, e.g. we have ln(xn)-ln(x)=rep(xn;x). And by the definition in (18), it holds that rep(x;x)maxx𝒳-F(x,x)=rep(x).

We are looking for inequalities in the form V*(p)-Vπ𝝁(p)κ(rep(x;x)) that hold for all x𝒳 with some strictly increasing function κ and some x𝒳, so we can get non-asymptotic performance guarantees once we combine the two steps described at the beginning of this section. For example, by directly applying results of [9] to the COL in (16), we get V*(p)-Vπ^N(p)κ(RegretNN), where π^N is the policy associated with the average/best decision in {xn}1=nN.

3.2.1 The Classic Result

Existing approaches (e.g. [7, 36, 24]) to the saddle-point point formulation in (8) rely on the relative residual rep(x;x*) with respect to the optimal solution to the problem x*, which we restate in our notation.

Proposition 2.

For any x=(v,𝛍)X, if E𝛍(1-γ)p, rep(x;x*)(1-γ)minsp(s)v*-vπ𝛍.

Therefore, although the original saddle-point problem in (8) is framed using 𝒱 and , in practice, an extra constraint, such as 𝐄𝝁(1-γ)𝐩, is added into , i.e. these algorithms consider instead

={𝝁|SS||𝒜|:𝝁,𝐄𝝁(1-γ)𝐩}, (19)

so that the marginal of the estimate 𝝁 can have the sufficient coverage required in creftype 2. This condition is needed to establish non-asymptotic guarantees on the performance of the policy generated by 𝝁 [34, 36, 24], but it can sometimes be impractical to realize, e.g., when 𝐩 is unknown. Without it, extra assumptions (like ergodicity [36]) on the MDP are needed.

However, creftype 2 is undesirable for a number of reasons. First, the bound is quite conservative, as it concerns the uniform error 𝐯*-𝐯π𝝁 whereas the objective in RL is about the gap V*(p)-Vπ𝝁(p)=𝐩(𝐯*-𝐯π𝝁) with respect to the initial distribution p (i.e. a weighted error). Second, the constant term (1-γ)minsp(s) can be quite small (e.g. when p is uniform, it is 1-γ|SS|) which can significantly amplify the error in the residual. Because a no-regret algorithm typically decreases the residual in O(N-1/2) after seeing N samples, the factor of 1-γ|SS| earlier would turn into a multiplier of |SS|2(1-γ)2 in sample complexity. This makes existing saddle-point approaches sample inefficient in comparison with other RL methods like Q-learning [21]. Lastly, enforcing 𝐄𝝁(1-γ)𝐩 requires knowing 𝐩 (which is unavailable in our setup) and adds extra projection steps during optimization. When 𝐩 is unknown, while it is possible to modify this constraint to use a uniform distribution, this might worsen the constant factor and could introduce bias.

One may conjecture that the bound in creftype 2 could perhaps be tightened by better analyses. However, we prove this is impossible in general.

Proposition 3.

There is a class of MDPs such that, for some xX, creftype 2 is an equality.

We note that creftype 3 does not hold for all MDPs. Indeed, if one makes stronger assumptions on the MDP, such as that the Markov chain induced by every policy is ergodic [36], then it is possible to show, for all x𝒳, rep(x;x*)=c𝐯*-𝐯π𝝁 for some constant c independent of γ and |SS|, when one constrains 𝐄𝝁(1-γ+γc)𝐩. Nonetheless, this construct still requires adding an undesirable constraint to 𝒳.

3.2.2 Curse of Covariate Shift

Why does this happen? We can view this issue as a form of covariate shift, i.e. a mismatch between distributions. To better understand it, we notice a simple equality, which has often been used implicitly, e.g. in the technical proofs of [36].

Lemma 1.

For any x=(v,𝛍), if xX satisfies (2) and (5) (i.e. v and 𝛍 are the value function and state-action distribution of policy π𝛍), rep(x;x)=-𝛍av.

creftype 1 implies rep(x;x*)=-𝝁𝐚𝐯*, which is non-negative. This term is similar to an equality called the performance difference lemma [29, 20].

Lemma 2.

Let vπ and 𝛍π denote the value and state-action distribution of some policy π. Then for any function v, it holds that p(vπ-v)=(𝛍π)av. In particular, it implies Vπ(p)-Vπ(p)=(𝛍π)avπ.

From creftypeplural 2\crefpairconjunction1, we see that the difference between the residual rep(x;x*)=-𝝁𝐚𝐯* and the performance gap Vπ𝝁(p)-Vπ*(p)=(𝝁π𝝁)𝐚𝐯* is due to the mismatch between 𝝁 and 𝝁π𝝁, or more specifically, the mismatch between the two marginals 𝐝=𝐄𝝁 and 𝐝π𝝁=𝐄𝝁π𝝁. Indeed, when 𝐝=𝐝π𝝁, the residual is equal to the performance gap. However, in general, we do not have control over that difference for the sequence of variables {xn=(𝐯n,𝝁n)𝒳} an algorithm generates. The sufficient condition in creftype 2 attempts to mitigate the difference, using the fact 𝐝π𝝁=(1-γ)𝐩+γ𝐏π𝝁𝐝π𝝁 from (2), where 𝐏π𝝁 is the transition matrix under π𝝁. But the missing half γ𝐏π𝝁𝐝π𝝁 (due to the long-term effects in the MDP) introduces the unavoidable, weak constant (1-γ)minsp(s), if we want to have an uniform bound on 𝐯*-𝐯π𝝁. The counterexample in creftype 3 was designed to maximize the effect of covariate shift, so that 𝝁 fails to captures state-action pairs with high advantage. To break the curse, we must properly weight the gap between 𝐯* and 𝐯π𝝁 instead of relying on the uniform bound on 𝐯*-𝐯π𝝁 as before.

4 THE REDUCTION

The analyses above reveal both good and bad properties of the saddle-point setup in (8). On the one hand, we showed that approximate solutions to the saddle-point problem in (8) can be obtained by running any no-regret algorithm in the single-player COL problem defined in (16); many efficient algorithms are available from the online learning literature. On the other hand, we also discovered a root difficulty in converting an approximate solution of (8) to an approximately optimal policy in RL (creftype 2), even after imposing strong conditions like (19). At this point, one may wonder if the formulation based on (8) is fundamentally sample inefficient compared with other approaches to RL, but this is actually not true.

Our main contribution shows that learning a policy through running a no-regret algorithm in the COL problem in (16) is, in fact, as sample efficient in policy performance as other RL techniques, even without the common constraint in (19) or extra assumptions on the MDP like ergodicity imposed in the literature.

Theorem 1.

Let XN={xnX}n=1N be any sequence. Let π^N be the policy given by x^N via (17), which is either the average or the best decision in XN. Define yN*(vπ^N,𝛍*). Then Vπ^N(p)V*(p)-𝑅𝑒𝑔𝑟𝑒𝑡N(yN*)N.

creftype 1 shows that if XN has sublinear regret, then both the average policy and the best policy in XN converge to the optimal policy in performance with a rate O(RegretN(yN*)/N). Compared with existing results obtained through creftype 2, the above result removes the factor (1-γ)minsp(s) and impose no assumption on XN or the MDP. Indeed creftype 1 holds for any sequence. For example, when XN is generated by stochastic feedback of ln, creftype 1 continues to hold, as the regret is defined in terms of ln, not of the sampled loss. Stochasticity only affects the regret rate.

In other words, we have shown that when 𝝁 and 𝐯 can be directly parameterized, an approximately optimal policy for the RL problem can be obtained by running any no-regret online learning algorithm, and that the policy quality is simply dictated by the regret rate. To illustrate, in creftype 5 we will prove that simply running mirror descent in this COL produces an RL algorithm that is as sample efficient as other common RL techniques. One can further foresee that algorithms leveraging the continuity in COL—e.g. mirror-prox [19] or PicCoLO [10]—and variance reduction can lead to more sample efficient RL algorithms.

Below we will also demonstrate how to use the fact that COL is single-player (see creftype 2.3) to cleanly incorporate the effects of using function approximators to model 𝝁 and 𝐯. We will present a corollary of creftype 1, which separates the problem of learning 𝝁 and 𝐯, and that of approximating and 𝒱 with function approximators. The first part is controlled by the rate of regret in online learning, and the second part depends on only the chosen class of function approximators, independently of the learning process. As these properties are agnostic to problem setups and algorithms, our reduction leads to a framework for systematic synthesis of new RL algorithms with performance guarantees. The missing proofs of this section are in creftype C.

4.1 Proof of creftype 1

The main insight of our reduction is to adopt, in defining rep(x;x), a comparator x𝒳 based on the output of the algorithm (represented by x), instead of the fixed comparator x* (the optimal pair of value function and state-action distribution) that has been used conventionally, e.g. in creftype 2. While this idea seems unnatural from the standard saddle-point or EP perspective, it is possible, because the regret in online learning is measured against the worst-case choice in 𝒳, which is allowed to be selected in hindsight. Specifically, we propose to select the following comparator to directly bound V*(p)-Vπ^N(p) instead of the conservative measure V*-Vπ^N used before.

Proposition 4.

For x=(v,𝛍)X, define yx*(vπ𝛍,𝛍*)X. It holds rep(x;yx*)=V*(p)-Vπ𝛍(p).

To finish the proof, let x^N be either 1Nn=1Nxn or argminxXNrep(x;yx*), and let π^N denote the policy given by (17). First, V*(p)-Vπ^N(p)=rep(x^N;yN*) by creftype 4. Next we follow the proof idea of creftype 1 in [9]: because F is skew-symmetric and F(yN*,) is convex, we have by (18)

V*(p)-Vπ^N(p)=rep(x^N;yN*)=-F(x^N,yN*)
=F(yN*,x^N)1Nn=1NF(yN*,xn)
=1Nn=1N-F(xn,yN*)=1NRegretN(yN*).

4.2 Function Approximators

When the state and action spaces are large or continuous, directly optimizing 𝐯 and 𝝁 can be impractical. Instead we can consider optimizing over a subset of feasible choices parameterized by function approximators

𝒳Θ={𝐱θ=(ϕθ,𝝍θ):𝝍θ,θΘ}, (20)

where ϕθ and 𝝍θ are functions parameterized by θΘ, and Θ is a parameter set. Because COL is a single-player setup, we can extend the previous idea and creftype 1 to provide performance bounds in this case by a simple rearrangement (see creftype C), which is a common trick used in the online imitation learning literature [32, 8, 11]. Notice that, in (20), we require only 𝝍θ, but not ϕθ𝒱, because for the performance bound in our reduction to hold, we only need the constraint (see creftype 4 in proof of creftype 4).

Corollary 1.

Let XN={xnXθ}n=1N be any sequence. Let π^N be the policy given either by the average or the best decision in XN. It holds that

Vπ^N(p)V*(p)-𝑅𝑒𝑔𝑟𝑒𝑡N(Θ)N-ϵΘ,N

where ϵΘ,N=minxθXθrep(x^N;yN*)-rep(x^N;xθ) measures the expressiveness of Xθ, and 𝑅𝑒𝑔𝑟𝑒𝑡N(Θ)n=1Nln(xn)-minxXΘn=1Nln(x).

We can quantify ϵΘ,N with the basic Hölder’s inequality.

Proposition 5.

Let x^N=(v^N,𝛍^N). Under the setup in creftype 1, regardless of the parameterization, it is true that ϵΘ,N is no larger than

min(𝐯θ,𝝁θ)𝒳Θ𝝁θ-𝝁*11-γ+min𝐰:𝐰1𝐛𝝁^N1,𝐰𝐯θ-𝐯π^N,1/𝐰
min(𝐯θ,𝝁θ)𝒳Θ11-γ(𝝁θ-𝝁*1+2𝐯θ-𝐯π^N).

where the norms are defined as x1,w=iwi|xi| and x,1/w=maxiwi-1|xi|.

creftype 5 says ϵΘ,N depends on how well 𝒳Θ captures the value function of the output policy 𝐯π^N and the optimal state-action distribution 𝝁*. We remark that this result is independent of how 𝐯π^N is generated. Furthermore, creftype 5 makes no assumption whatsoever on the structure of function approximators. It even allows sharing parameters θ between 𝐯=ϕθ and 𝝁=𝝍θ, e.g., they can be a bi-headed neural network, which is common for learning shared feature representations. More precisely, the structure of the function approximator would only affect whether ln((ϕθ,𝝍θ)) remains a convex function in θ, which determines the difficulty of designing algorithms with sublinear regret.

In other words, the proposed COL formulation provides a reduction which dictates the policy performance with two separate factors: 1) the rate of regret RegretN(Θ) which is controlled by the choice of online learning algorithm; 2) the approximation error ϵΘ,N which is determined by the choice of function approximators. These two factors can almost be treated independently, except that the choice of function approximators would determine the properties of ln((ϕθ,𝝍θ)) as a function of θ, and the choice of Θ needs to ensure (20) is admissible.

5 SAMPLE COMPLEXITY OF MIRROR DESCENT

{algorithm}

[t] Mirror descent for RL \[email protected]@algorithmic [1] \ENSUREϵ optimality of the γ-average return
      δ maximal failure probability
      generative model of an MDP \REQUIREπ^N=π𝝁^N \STATEx1=(𝐯1,𝝁1) where 𝝁1 is uniform and 𝐯1𝒱 \STATESet N=Ω~(|SS||𝒜|log(1δ)(1-γ)2ϵ2) and η=(1-γ)(|SS||𝒜|N)-1/2 \STATESet the Bregman divergence as (22) \FORn=1N-1 \STATESample gn according to (24) \STATEUpdate to xn+1 according to (21)
\ENDFOR\STATESet (𝐯^N,𝝁^N)=x^N=1Nn=1Nxn

We demonstrate the power of our reduction by applying perhaps the simplest online learning algorithm, mirror descent, to the proposed COL problem in (16) with stochastic feedback (creftype 5). For transparency, we discuss the tabular setup. We will show a natural extension to basis functions at the end.

Recall that mirror descent is a first-order algorithm, whose update rule can be written as

xn+1=argminx𝒳gn,x+1ηBR(x||xn) (21)

where η>0 is the step size, gn is the feedback direction, and BR(x||x)=R(x)-R(x)-R(x),x-x is the Bregman divergence generated by a strictly convex function R. Based on the geometry of 𝒳=𝒱×, we consider a natural Bregman divergence of the form

BR(x||x)=12|SS|𝐯-𝐯22+KL(𝝁||𝝁) (22)

This choice mitigates the effects of dimension (e.g. if we set x1=(𝐯1,𝝁1) with 𝝁1 being the uniform distribution, it holds BR(x||x1)=O~(1) for any x𝒳).

To define the feedback direction gn, we slightly modify the per-round loss ln in (16) and consider a new loss

hn(x)𝐛𝝁n𝐯+𝝁(11-γ𝟏-𝐚𝐯n) (23)

that shifts ln by a constant, where 𝟏 is the vector of ones. One can verify that ln(x)-ln(x)=hn(x)-hn(x), for all x,x𝒳 when 𝝁,𝝁 in x and x satisfy 𝝁1=𝝁1 (which holds for creftype 5). Therefore, using hn does not change regret. The reason for using hn instead of ln is to make 𝝁hn((𝐯,𝝁)) (and its unbiased approximation) a positive vector, so the regret bound can have a better dimension dependency. This is a common trick used in online learning (e.g. EXP3 [16]) for optimizing variables living in a simplex (𝝁 here).

We set the first-order feedback gn as an unbiased sampled estimate of hn(xn). In round n, this is realized by two independent calls of the generative model:

gn=[𝐩~n+11-γ(γ𝐏~n-𝐄n)𝝁~n|SS||𝒜|(11-γ𝟏^n-𝐫^n-11-γ(γ𝐏^n-𝐄^n)𝐯n)] (24)

Let gn=[𝐠n,v;𝐠n,μ]. For 𝐠n,v, we sample 𝐩, sample 𝝁n to get a state-action pair, and query the transition 𝐏 at the state-action pair sampled from 𝝁n. (𝐩~n, 𝐏~n, and 𝝁~n denote the single-sample estimate of these probabilities.) For 𝐠n,μ, we first sample uniformly a state-action pair (which explains the factor |SS||𝒜|), and then query the reward 𝐫 and the transition 𝐏. (𝟏^n, 𝐫^n, 𝐏^n, and 𝐄^n denote the single-sample estimates.) To emphasize, we use ~ and ^ to distinguish the empirical quantities obtained by these two independent queries. By construction, we have 𝐠n,μ0. It is clear that this direction gn is unbiased, i.e. 𝔼[gn]=hn(xn). Moreover, it is extremely sparse and can be computed using O(1) sample, computational, and memory complexities.

Below we show this algorithm, despite being extremely simple, has strong theoretical guarantees. In other words, we obtain simpler versions of the algorithms proposed in [34, 36, 6] but with improved performance.

Theorem 2.

With probability 1-δ, creftype 5 learns an ϵ-optimal policy with O~(|SS||A|log(1δ)(1-γ)2ϵ2) samples.

Note that the above statement makes no assumption on the MDP (except the tabular setup for simplifying analysis). Also, because the definition of value function in (1) is scaled by a factor (1-γ), the above result translates into a sample complexity in O~(|SS||𝒜|log(1δ)(1-γ)4ϵ2) for the conventional discounted accumulated rewards.

5.1 Proof Sketch of creftype 2

The proof is based on the basic property of mirror descent and martingale concentration. We provide a sketch here; please refer to creftype D for details. Let yN*=(𝐯π^N,𝝁*). We bound the regret in creftype 1 by the following rearrangement, where the first equality below is because hn is a constant shift from ln.

RegretN(yN*)=n=1Nhn(xn)-n=1Nhn(yN*)
(n=1N(hn(xn)-gn)xn)+(maxx𝒳n=1Ngn(xn-x))
+(n=1N(gn-hn(xn))yN*)

We recognize the first term is a martingale, because xn does not depend on gn. Therefore, we can appeal to a Bernstein-type martingale concentration and prove it is in O~(N|SS||𝒜|log(1δ)1-γ). For the second term, by treating gnx as the per-round loss, we can use standard regret analysis of mirror descent and show a bound in O~(N|SS||𝒜|1-γ). For the third term, because 𝐯π^N in yN*=(𝐯π^N,𝝁*) depends on {gn}n=1N, it is not a martingale. Nonetheless, we are able to handle it through a union bound and show it is again no more than O~(N|SS||𝒜|log(1δ)1-γ). Despite the union bound, it does not increase the rate because we only need to handle 𝐯π^N, not 𝝁* which induces a martingale. To finish the proof, we substitute this high-probability regret bound into creftype 1 to obtain the desired claim.

5.2 Extension to Function Approximators

The above algorithm assumes the tabular setup for illustration purposes. In creftype E, we describe a direct extension of creftype 5 that uses linearly parameterized function approximators of the form xθ=(𝚽𝜽v,𝚿𝜽μ), where columns of bases 𝚽,𝚿 belong to 𝒱 and , respectively, and (𝜽v,𝜽μ)Θ.

Overall the algorithm stays the same, except the gradient is computed by chain-rule, which can be done in O(dim(Θ)) time and space. While this seems worse, the computational complexity per update actually improves to O(dim(Θ)) from the slow O(|SS||𝒜|) (required before for the projection in (21)), as now we only optimize in Θ. Moreover, we prove that its sample complexity is also better, though at the cost of bias ϵΘ,N in creftype 1. Therefore, the algorithm becomes applicable to large-scale or continuous problems.

Theorem 3.

Under a proper choice of Θ and BR, with probability 1-δ, creftype 5 learns an (ϵ+ϵΘ,N)-optimal policy with O~(dim(Θ)log(1δ)(1-γ)2ϵ2) samples.

The proof is in creftype E, and mainly follows creftype 5.1. First, we choose some Θ to satisfy (20) so we can use creftype 1 to reduce the problem into regret minimization. To make the sample complexity independent of |SS|,|𝒜|, the key is to uniformly sample over the columns of 𝚿 (instead of over all states and actions like (24)) when computing unbiased estimates of 𝜽μhn((𝜽v,𝜽μ)). The intuition is that we should only focus on the places our basis functions care about (of size dim(Θ)), instead of wasting efforts to visit all possible combinations (of size |SS||𝒜|).

6 CONCLUSION

We propose a reduction from RL to no-regret online learning that provides a systematic way to design new RL algorithms with performance guarantees. Compared with existing approaches, our framework makes no assumption on the MDP and naturally works with function approximators. To illustrate, we design a simple RL algorithm based on mirror descent; it achieves similar sample complexity as other RL techniques, but uses minimal assumptions on the MDP and is scalable to large or continuous problems. This encouraging result evidences the strength of the online learning perspective. As a future work, we believe even faster learning in RL is possible by leveraging control variate for variance reduction and by applying more advanced online techniques [31, 10] that exploit the continuity in COL to predict the future gradients.

Acknowledgements

This research is partially supported by NVIDIA Graduate Fellowship.

References

  • [1] J. Abernethy, P. L. Bartlett, and E. Hazan (2011) Blackwell approachability and no-regret learning are equivalent. In Annual Conference on Learning Theory, pp. 27–46. Cited by: §3, §3.
  • [2] R. Bellman (1954) The theory of dynamic programming. Bulletin of the American Mathematical Society 60 (6), pp. 503–515. Cited by: §A.1, §2.1.
  • [3] M. Bianchi and S. Schaible (1996) Generalized monotone bifunctions and equilibrium problems. Journal of Optimization Theory and Applications 90 (1), pp. 31–43. Cited by: §1.
  • [4] E. Blum (1994) From optimization and variational inequalities to equilibrium problems. Math. student 63, pp. 123–145. Cited by: §1.
  • [5] N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: §2.3.
  • [6] Y. Chen, L. Li, and M. Wang (2018) Scalable bilinear π learning using state and action features. arXiv preprint arXiv:1804.10328. Cited by: §1, §2.2, §5.
  • [7] Y. Chen and M. Wang (2016) Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516. Cited by: §1, §1, §3.2.1.
  • [8] C. Cheng and B. Boots (2018) Convergence of value aggregation for imitation learning. International Conference on Artificial Intelligence and Statistics. Cited by: §4.2.
  • [9] C. Cheng, J. Lee, K. Goldberg, and B. Boots (2019) Online learning with continuous variations: dynamic regret and reductions. arXiv preprint arXiv:1902.07286. Cited by: §1, §2.3, §2.3, §2.3, §2.3, §3.2, §3, §4.1, Proposition 1.
  • [10] C. Cheng, X. Yan, N. Ratliff, and B. Boots (2019) Predictor-corrector policy optimization. In International Conference on Machine Learning, Cited by: §1, §4, §6.
  • [11] C. Cheng, X. Yan, E. A. Theodorou, and B. Boots (2019) Accelerating imitation learning with predictive models. In International Conference onArtificial Intelligence and Statistics, Cited by: §4.2.
  • [12] B. Dai, A. Shaw, N. He, L. Li, and L. Song (2018) Boosting the actor with dual critic. In International Conference on Learning Representation, Cited by: §1.
  • [13] E. V. Denardo and B. L. Fox (1968) Multichain Markov renewal programs. SIAM Journal on Applied Mathematics 16 (3), pp. 468–487. Cited by: §A.2, §1.
  • [14] G. J. Gordon (1999) Regret bounds for prediction problems. In Conference on Learning Theory, Vol. 99, pp. 29–40. Cited by: §1.
  • [15] E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §2.3.
  • [16] E. Hazan (2016) Introduction to online convex optimization.. Foundations and Trends in Optimization 2 (3-4), pp. 157–325. External Links: Link Cited by: §5.
  • [17] O. Hernández-Lerma and J. B. Lasserre (2012) Discrete-time markov control processes: basic optimality criteria. Vol. 30, Springer Science & Business Media. Cited by: §2.1.
  • [18] A. Jofré and R. J. Wets (2014) Variational convergence of bifunctions: motivating applications. SIAM Journal on Optimization 24 (4), pp. 1952–1979. Cited by: §2.3.
  • [19] A. Juditsky, A. Nemirovski, and C. Tauvel (2011) Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems 1 (1), pp. 17–58. Cited by: §2.2, §4.
  • [20] S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, Vol. 2, pp. 267–274. Cited by: §3.2.2.
  • [21] S. M. Kakade et al. (2003) On the sample complexity of reinforcement learning. Ph.D. Thesis, University of London London, England. Cited by: §1, §3.2.1.
  • [22] M. J. Kearns and S. P. Singh (1999) Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in neural information processing systems, pp. 996–1002. Cited by: §1, footnote 1.
  • [23] C. Lakshminarayanan, S. Bhatnagar, and C. Szepesvári (2018) A linearly relaxed approximate linear program for Markov decision processes. IEEE Transactions on Automatic Control 63 (4), pp. 1185–1191. Cited by: §1.
  • [24] D. Lee and N. He (2018) Stochastic primal-dual Q-learning. arXiv preprint arXiv:1810.08298. Cited by: §1, §1, §3.2.1, §3.2.1.
  • [25] Q. Lin, S. Nadarajah, and N. Soheili (2017) Revisiting approximate linear programming using a saddle point based reformulation and root finding solution approach. Technical report working paper, U. of Il. at Chicago and U. of Iowa. Cited by: §1.
  • [26] A. S. Manne et al. (1959) Linear programming and sequential decision models. Technical report Cowles Foundation for Research in Economics, Yale University. Cited by: §A.2, §1.
  • [27] C. McDiarmid (1998) Concentration. In Probabilistic methods for algorithmic discrete mathematics, pp. 195–248. Cited by: §D.1, §D.2, Lemma 6.
  • [28] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro (2009) Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19 (4), pp. 1574–1609. Cited by: §2.2.
  • [29] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In International Conference on Machine Learning, Vol. 99, pp. 278–287. Cited by: §3.2.2.
  • [30] M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.1.
  • [31] A. Rakhlin and K. Sridharan (2012) Online learning with predictable sequences. arXiv preprint arXiv:1208.3728. Cited by: §1, §6.
  • [32] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, pp. 627–635. Cited by: §4.2.
  • [33] S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §2.3.
  • [34] M. Wang and Y. Chen (2016) An online primal-dual method for discounted Markov decision processes. In Conference on Decision and Control, pp. 4516–4521. Cited by: §1, §1, §2.2, §3.2.1, §5.
  • [35] M. Wang (2017) Primal-dual π learning: sample complexity and sublinear run time for ergodic Markov decision problems. arXiv preprint arXiv:1710.06100. Cited by: §1.
  • [36] M. Wang (2017) Randomized linear programming solves the discounted Markov decision problem in nearly-linear running time. ArXiv e-prints. Cited by: §B.3, §1, §1, §3.2.1, §3.2.1, §3.2.1, §3.2.2, §3, §5, footnote 2.
  • [37] M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pp. 928–936. Cited by: §1.

Appendix

Appendix A Review of RL Setups

We provide an extended review of different formulations of RL for interested readers. First, let us recall the problem setup. Let SS and 𝒜 be state and action spaces, and let π(a|s) denote a policy. For γ[0,1), we are interested in solving a γ-discounted infinite-horizon RL problem:

maxπVπ(p),s.t.Vπ(p)(1-γ)𝔼s0p𝔼ξρπ(s0)[t=0γtr(st,at)] (25)

where Vπ(p) is the discounted average return, r:SS×𝒜[0,1] is the reward function, ρπ(s0) denotes the distribution of trajectory ξ=s0,a0,s1, generated by running π from state s0 in a Markov decision process (MDP), and p is a fixed but unknown initial state distribution.

A.1 Coordinate-wise Formulations

RL in terms of stationary state distribution

Let dtπ(s) denote the state distribution at time t given by running π starting from p. We define its γ-weighted mixture as

dπ(s)(1-γ)t=0γtdtπ(s) (26)

We can view dπ in (26) as a form of stationary state distribution of π, because it is a valid probability distribution of state and satisfies the stationarity property below,

dπ(s)=(1-γ)p(s)+γ𝔼sdπ𝔼aπ|s[𝒫(s|s,a)] (2)

where 𝒫(s|s,a) is the transition probability of the MDP. The definition in (26) generalizes the concept of stationary distribution of MDP; as γ1, dπ is known as the limiting average state distribution, which is the same as the stationary distribution of the MDP under π, if one exists. Moreover, with the property in (2), dπ summarizes the Markov structure of RL, and allows us to write (25) simply as

maxπVπ(p),s.t.Vπ(p)=𝔼sdπ𝔼aπ|s[r(s,a)] (27)

after commuting the order of expectation and summation. That is, an RL problem aims to maximize the expected reward under the stationary state-action distribution generated by the policy π.

RL in terms of value function

We can also write (25) in terms of value function. Recall

Vπ(s)(1-γ)𝔼ξρπ(s0)|s0=s[t=0γtr(st,at)] (1)

is the value function of π. By definition, Vπ (like dπ) satisfies a stationarity property

Vπ(s)=𝔼aπ|s[(1-γ)r(s,a)+γ𝔼s𝒫|s,a[Vπ(s)]] (5)

which can be viewed as a dual equivalent of (2). Because r is in [0,1], (5) implies Vπ lies in [0,1].

The value function V* (a shorthand of Vπ*) of the optimal policy π* of the RL problem satisfies the so-called Bellman equation [2]: V*(s)=maxa𝒜(1-γ)r(s,a)+γ𝔼s𝒫|s,a[V*(s)], where the optimal policy π* can be recovered as the argmax. Equivalently, by the definition of max, the Bellman equation amounts to finding the smallest V such that V(s)(1-γ)r(s,a)+γ𝔼s𝒫|s,a[V(s)], sSS,a𝒜. In other words, the RL problem in (25) can be written as

minV𝔼sp[V(s)]  s.t.V(s)(1-γ)r(s,a)+γ𝔼s𝒫|s,a[V(s)],sSS,a𝒜 (28)

A.2 Linear Programming Formulations

We now connect the above two alternate expressions through the classical LP setup of RL [26, 13].

LP in terms of value function

The classic LP formulation55 5 Our setup in (4) differs from the classic one in the (1-γ) factor in the constraint to normalize the problem. is simply a restatement of (28):

min𝐯𝐩𝐯  s.t.(1-γ)𝐫+γ𝐏𝐯𝐄𝐯 (4)

where 𝐩|SS|, 𝐯|SS|, and 𝐫|SS||𝒜| are the vector forms of p, V, r, respectively, 𝐏|SS||𝒜|×|SS| is the transition probability66 6 We arrange the coordinates in a way such that along the |SS||𝒜| indices are contiguous in actions., and 𝐄=𝐈𝟏|SS||𝒜|×|SS| (we use || to denote the cardinality of a set, the Kronecker product, 𝐈|SS|×|SS| is the identity, and 𝟏|𝒜| a vector of ones). It is easy to verify that for all 𝐩>0, the solution to (4) is the same and equal to 𝐯* (the vector form of V*).

LP in terms of stationary state-action distribution

Define the Lagrangian function

(𝐯,𝐟)𝐩𝐯+𝐟((1-γ)𝐫+γ𝐏𝐯-𝐄𝐯) (29)

where 𝐟𝟎|SS||𝒜| is the Lagrangian multiplier. By Lagrangian duality, the dual problem of (4) is given as max𝐟𝟎min𝐯(𝐯,𝐟). Or after substituting the optimality condition of 𝐯 and define 𝝁(1-γ)𝐟, we can write the dual problem as another LP problem

max𝝁𝟎𝐫𝝁  s.t.(1-γ)𝐩+γ𝐏𝝁=𝐄𝝁 (3)

Note that this problem like (4) is normalized: we have 𝝁1=1 because 𝐩1=1, and

𝝁1=𝟏𝐄𝝁=(1-γ)𝟏𝐩+γ𝟏𝐏𝝁=(1-γ)𝐩1+γ𝝁1

where we use the facts that 𝝁𝟎 and 𝐏 is a stochastic transition matrix. This means that 𝝁 is a valid state-action distribution, from which we see that the equality constraint in (3) is simply a vector form (2). Therefore, (3) is the same as (27) if we define the policy π as the conditional distribution based on 𝝁.

Appendix B Missing Proofs of creftype 3

B.1 Proof of creftype 1

See 1

Proof.

First note that F(x,x)=0. Then as x satisfies stationarity, we can use creftype 2 below and write

rep(x;x) =F(x,x)-F(x,x)
=-F(x,x)
=-(𝐩𝐯-𝐩𝐯)-𝝁𝐚𝐯+𝝁𝐚𝐯 ( Definition of F in (3.1))
=-𝝁𝐚𝐯-𝝁𝐚𝐯+𝝁𝐚𝐯 ( creftype 2)
=-𝝁𝐚𝐯

B.2 Proof of creftype 2

See 2

Proof.

This is the well-known performance difference lemma. The proof is based on the stationary properties in (2) and (5), which can be stated in vector form as

(𝝁π)𝐄𝐯π=(𝝁π)((1-γ)𝐫+γ𝐏𝐯π)  and  (1-γ)𝐩+γ𝐏𝝁π=𝐄𝝁π

The proof is a simple application of these two properties.

𝐩(𝐯π-𝐯) =11-γ(𝐄𝝁π-γ𝐏𝝁π)(𝐯π-𝐯)
=11-γ(𝝁π)((𝐄-γ𝐏)𝐯π-(𝐄-γ𝐏)𝐯)
=11-γ(𝝁π)((1-γ)𝐫-(𝐄-γ𝐏)𝐯)=(𝝁π)𝐚𝐯

where we use the stationarity property of 𝝁π in the first equality and that 𝐯π in the third equality. ∎

B.3 Proof of creftype 2

See 2

Proof.

This proof mainly follows the steps in [36] but written in our notation. First creftype 1 shows rep(x;x*)=-𝝁𝐚𝐯*. We then lower bound -𝝁𝐚𝐯* by reversing the proof of the performance difference lemma (creftype 2).

𝝁𝐚𝐯* =11-γ𝝁((1-γ)𝐫-(𝐄-γ𝐏)𝐯*) ( Definition of 𝐚𝐯*)
=11-γ𝝁((𝐄-γ𝐏)𝐯π𝝁-(𝐄-γ𝐏)𝐯*) ( Stationarity of 𝐯π𝝁)
=11-γ𝝁(𝐄-γ𝐏)(𝐯π𝝁-𝐯*)
=11-γ𝐝(𝐈-γ𝐏π𝝁)(𝐯π𝝁-𝐯*)

where we define 𝐝𝐄𝝁 and 𝐏π𝝁 as the state-transition of running policy π𝝁.

We wish to further upper bound this quantity. To proceed, we appeal to the Bellman equation of the optimal value function 𝐯* and the stationarity of 𝐯π𝝁:

𝐯*(1-γ)𝐫π𝝁+γ𝐏π𝝁𝐯*  and  𝐯π𝝁=(1-γ)𝐫π𝝁+γ𝐏π𝝁𝐯π𝝁,

which together imply that (𝐈-γ𝐏π𝝁)(𝐯π𝝁-𝐯*)0. We will also use the stationarity of 𝐝π𝝁 (the average state distribution of π𝝁): 𝐝π𝝁=(1-γ)𝐩+γ𝐏𝝅𝝁𝐝π𝝁.

Since 𝐝(1-γ)𝐩 in the assumption, we can then write

𝝁𝐚𝐯* =11-γ𝐝(𝐈-γ𝐏π𝝁)(𝐯π𝝁-𝐯*)
𝐩(𝐈-γ𝐏π𝝁)(𝐯π𝝁-𝐯*)
-minsp(s)(𝐈-γ𝐏π𝝁)(𝐯π𝝁-𝐯*)
-minsp(s)(1-γ)𝐯π𝝁-𝐯*.

Finally, flipping the sign of the inequality concludes the proof. ∎

B.4 Proof of creftype 3

See 3

Proof.

We show this equality holds for a class of MDPs. For simplicity, let us first consider an MDP with three states 1, 2, 3 and for each state there are three actions (left, right, stay). They correspond to intuitive, deterministic transition dynamics

𝒫(max{s-1,1}|s,left)=1,𝒫(min{s+1,3}|s,right)=1,𝒫(s|s,stay)=1.

We set the reward as r(s,right)=1 for s=1,2,3 and zero otherwise. It is easy to see that the optimal policy is π*(right|s)=1, which has value function 𝐯*=[1,1,1].

Now consider x=(𝐯,𝝁)𝒳. To define 𝝁, let μ(s,a)=d(s)π𝝁(a|s). We set

π𝝁(right|1)=1,π𝝁(stay|2)=1,π𝝁(right|3)=1

That is, π𝝁 is equal to π* except when s=2. One can verify the value function of this policy is 𝐯π𝝁=[(1-γ),0,1].

As far as d is concerned (𝐝=𝐄𝝁), suppose the initial distribution is uniform, i.e. 𝐩=[1/3,1/3,1/3], we choose d as 𝐝=(1-γ)𝐩+γ[1,0,0], which satisfies the assumption in creftype 2. Therefore, we have 𝝁 and we will let 𝐯 be some arbitrary point in 𝒱.

Now we show for this choice x=(𝐯,𝝁)𝒱×, the equality in creftype 2 holds. By creftype 1, we know rep(x;x)=-𝝁𝐚𝐯*. Recall the advantage is defined as 𝐚𝐯*=𝐫+11-γ(γ𝐏-𝐄)𝐯*. Let AV*(s,a) denote the functional form of 𝐚𝐯* and define the expected advantage:

AV*(s,π𝝁)𝔼aπ𝝁[AV*(s,a)].

We can verify it has the following values:

AV*(1,π𝝁)=0,AV*(2,π𝝁)=-1,AV*(3,π𝝁)=0.

Thus, the above construction yields

rep(x;x*)=-𝝁𝐚𝐯*=(1-γ)3=(1-γ)minsp(s)𝐯*-𝐯π𝝁

One can easily generalize this 3-state MDP to an |SS|-state MDP where states are partitioned into three groups. ∎

Appendix C Missing Proofs of creftype 4

C.1 Proof of creftype 4

See 4

Proof.

First we generalize creftype 1.

Lemma 3.

Let x=(v,𝛍) be arbitrary. Consider x~=(v+u,𝛍), where v and 𝛍 are the value function and state-action distribution of policy π𝛍, and u is arbitrary. It holds that rep(x;x~)=-𝛍av-b𝛍u.

To proceed, we write yx*=(𝐯*+(𝐯π𝝁-𝐯*),𝝁*) and use creftype 3, which gives rep(x;yx*)=-𝝁𝐚𝐯*-𝐛𝝁(𝐯π𝝁-𝐯*). To relate this equality to the policy performance gap, we also need the following equality.

Lemma 4.

For 𝛍M, it holds that -𝛍av*=V*(p)-Vπ𝛍(p)+b𝛍(vπ𝛍-v*).

Together they imply the desired equality rep(x;yx*)=V*(p)-Vπ𝝁(p). ∎

C.1.1 Proof of creftype 3

See 3

Proof.

Let x=(𝐯,𝝁). As shorthand, define 𝐟𝐯+𝐮, and 𝐋11-γ(γ𝐏-𝐄) (i.e. we can write 𝐚𝐟=𝐫+𝐋𝐟). Because rep(x;x)=-F(x,x)=-(𝐩𝐯+𝝁𝐚𝐯-𝐩𝐯-𝝁𝐚𝐯), we can write

rep(x;x~) =-𝐩𝐟-𝝁𝐚𝐟+𝐩𝐯+𝝁𝐚𝐯
=(-𝐩𝐯-𝝁𝐚𝐯+𝐩𝐯+𝝁𝐚𝐯)-𝐩𝐮-𝝁𝐋𝐮
=rep(x;x)-𝐩𝐮-𝝁𝐋𝐮
=rep(x;x)-𝐛𝝁𝐮

Finally, by creftype 1, we have also rep(x;x)=-𝝁𝐚𝐯 and therefore the final equality. ∎

C.1.2 Proof of creftype 4

See 4

Proof.

Following the setup in creftype 3, we prove the statement by the rearrangement below:

-𝝁𝐚𝐯 =-(𝝁π𝝁)𝐚𝐯+(𝝁π𝝁)𝐚𝐯-𝝁𝐚𝐯
=Vπ(p)-Vπ𝝁(p)+(𝝁π𝝁-𝝁)𝐚𝐯
=(Vπ(p)-Vπ𝝁(p))+(𝝁π𝝁-𝝁)𝐫+(𝝁π𝝁-𝝁)𝐋𝐯

where the second equality is due to the performance difference lemma, i.e. creftype 2, and the last equality uses the definition 𝐚𝐯=𝐫+𝐋𝐯. For the second term above, let 𝐫π𝝁 and 𝐏π𝝁 denote the expected reward and transition under π𝝁. Because 𝝁, we can rewrite it as

(𝝁π𝝁-𝝁)𝐫 =(𝐄𝝁π𝝁-𝐄𝝁)𝐫π𝝁
=((1-γ)𝐩+γ𝐏𝝁π𝝁-𝐄𝝁)𝐫π𝝁
=(1-γ)𝐛𝝁𝐫π𝝁+γ(𝝁π𝝁-𝝁)𝐏𝐫π𝝁
=(1-γ)𝐛𝝁(𝐫π𝝁+γ𝐏π𝝁𝐫π𝝁+γ2𝐏π𝝁2𝐫π𝝁+)
=𝐛𝝁𝐯π𝝁

where the second equality uses the stationarity of 𝝁𝝅𝝁 given by (2). For the third term, it can be written

(𝝁π𝝁-𝝁)𝐋𝐯=(-𝐩-𝐋𝝁)𝐯=-𝐛𝝁𝐯

where the first equality uses stationarity, i.e. 𝐛𝝁π𝝁=𝐩+𝐋𝝁π𝝁=0. Finally combining the three steps, we have

-𝝁𝐚𝐯 =Vπ(p)-Vπ𝝁(p)+𝐛𝝁(𝐯π𝝁-𝐯)

C.2 Proof of creftype 1

See 1

Proof.

This can be proved by a simple rearrangement

V*(p)-Vπ^N(p)=rep(x^N;yN*)=ϵΘ,N+maxxθ𝒳θrep(x^N;xθ)ϵΘ,N+RegretN(Θ)N

where the first equality is creftype 4 and the last inequality is due to the skew-symmetry of F, similar to the proof of creftype 1. ∎

C.3 Proof of creftype 5

See 5

Proof.

For shorthand, let us set x=(𝐯,𝝁)=x^N and write also π𝝁=π^N as the associated policy. Let yx*=(𝐯π𝝁,𝝁*) and similarly let xθ=(𝐯θ,𝝁θ)𝒳Θ. With rep(x;x)=-F(x,x) and (3.1), we can write

rep(x;yx*)-rep(x;xθ) =(-𝐩𝐯π𝝁-𝝁𝐚𝐯π𝝁+𝐩𝐯+𝝁*𝐚𝐯)-(-𝐩𝐯θ-𝝁𝐚𝐯θ+𝐩𝐯+𝝁θ𝐚𝐯)
=𝐩(𝐯θ-𝐯π𝝁)+(𝝁*-𝝁θ)𝐚𝐯+𝝁(𝐚𝐯θ-𝐚𝐯π𝝁)
=𝐛𝝁(𝐯θ-𝐯π𝝁)+(𝝁*-𝝁θ)𝐚𝐯

Next we quantize the size of 𝐚𝐯 and 𝐛𝝁.

Lemma 5.

For (v,𝛍)X, av11-γ and b𝛍121-γ.

Proof of creftype 5.

Let Δ denote the set of distributions

𝐚𝐯 =11-γ(1-γ)𝐫+γ𝐏𝐯-𝐄𝐯11-γmaxa,b[0,1]|a-b|11-γ
𝐛𝝁1 =11-γ(1-γ)𝐩+γ𝐏𝝁-𝐄𝝁111-γmax𝐪,𝐪Δ𝐪-𝐪121-γ

Therefore, we have preliminary upper bounds:

(𝝁*-𝝁θ)𝐚𝐯 𝐚𝐯𝝁*-𝝁θ111-γ𝝁*-𝝁θ1
𝐛𝝁(𝐯θ-𝐯π𝝁) 𝐛𝝁1𝐯θ-𝐯π𝝁21-γ𝐯θ-𝐯π𝝁

However, the second inequality above can be very conservative, especially when 𝐛𝝁0 which can be likely when it is close to the end of policy optimization. To this end, we introduce a free vector 𝐰1. Define norms 𝐯,1/𝐰=maxi|vi|wi and 𝜹1,𝐰=iwi|δi|. Then we can instead have an upper bound

𝐛𝝁(𝐯θ-𝐯π𝝁)min𝐰:𝐰1𝐛𝝁1,𝐰𝐯θ-𝐯π𝝁,1/𝐰

Notice that when 𝐰=𝟏 the above inequality reduces to 𝐛𝝁(𝐯θ-𝐯π𝝁)𝐛𝝁1𝐯θ-𝐯π𝝁, which as we showed has an upper bound 21-γ𝐯θ-𝐯π𝝁.

Combining the above upper bounds, we have an upper bound on ϵΘ,N:

ϵΘ,N=rep(x;yx*)-rep(x;xθ) 11-γ𝝁θ-𝝁*1+min𝐰:𝐰1𝐛𝝁1,𝐰𝐯θ-𝐯π𝝁,1/𝐰
11-γ(𝝁θ-𝝁*1+2𝐯θ-𝐯π𝝁).

Since it holds for any θΘ, we can minimize the right-hand side over all possible choices. ∎

Appendix D Proof of Sample Complexity of Mirror Descent

See 2

The proof is a combination of the basic property of mirror descent (creftype 9) and the martingale concentration. Define K=|SS||𝒜| and κ=11-γ as shorthands. We first slightly modify the per-round loss used to compute the gradient. Recall ln(x)𝐩𝐯+𝝁n𝐚𝐯-𝐩𝐯n-𝝁𝐚𝐯n and let us consider instead a loss function

hn(x)𝐛𝝁n𝐯+𝝁(κ𝟏-𝐚𝐯n)

which shifts ln by a constant in each round. (Note for all the decisions (𝐯n,𝝁n) produced by creftype 5 𝝁n always satisfies 𝝁n1=1). One can verify that ln(x)-ln(x)=hn(x)-hn(x), for all x,x𝒳, when 𝝁,𝝁 in x and x satisfy 𝝁1=𝝁1 (which holds for creftype 5). As the definition of regret is relative, we may work with hn in online learning and use it to define the feedback.

The reason for using hn instead of ln is to make 𝝁hn((𝐯,𝝁)) (and its unbiased approximation) a positive vector (because κ𝐚𝐯 for any 𝐯𝒱), so that the regret bound can have a better dependency on the dimension for learning 𝝁 that lives in the simplex . This is a common trick used in the online learning, e.g. in EXP3.

To run mirror descent, we set the first-order feedback gn received by the learner as an unbiased estimate of hn(xn). For round n, we construct gn based on two calls of the generative model:

gn=[𝐠n,v𝐠n,μ]=[𝐩~n+11-γ(γ𝐏~n-𝐄n)𝝁~nK(κ𝟏^n-𝐫^n-11-γ(γ𝐏^n-𝐄^n)𝐯n)]

For 𝐠n,v, we sample 𝐩, then sample 𝝁n to get a state-action pair, and finally query the transition dynamics 𝐏 at the state-action pair sampled from 𝝁n. (𝐩~n, 𝐏~n, and 𝝁~n denote the single-sample empirical approximation of these probabilities.) For 𝐠n,μ, we first sample uniformly a state-action pair (which explains the factor K), and then query the reward 𝐫 and the transition dynamics 𝐏. (𝟏^n, 𝐫^n, 𝐏^n, and 𝐄^n denote the single-sample empirical estimates.) To emphasize, we use ~ and ^ to distinguish the empirical quantities obtained by these two independent queries. By construction, we have 𝐠n,μ0. It is clear that this direction gn is unbiased, i.e. 𝔼[gn]=hn(xn). Moreover, it is extremely sparse and can be computed using O(1) sample, computational, and memory complexities.

Let yN*=(𝐯π^N,𝝁*). We bound the regret by the following rearrangement.

RegretN(yN*) =n=1Nln(xn)-n=1Nln(yN*)
=n=1Nhn(xn)-n=1Nhn(yN*)
=n=1Nhn(xn)(xn-yN*)
=(n=1N(hn(xn)-gn)xn)+(n=1Ngn(xn-yN*))+(n=1N(gn-hn(xn))yN*)
(n=1N(hn(xn)-gn)xn)+(maxx𝒳n=1Ngn(xn-x))+(n=1N(gn-hn(xn))yN*), (30)

where the third equality comes from hn being linear. We recognize the first term is a martingale MN=n=1N(hn(xn)-gn)xn, because xn does not depend on gn. Therefore, we can appeal to standard martingale concentration property. For the second term, it can be upper bounded by standard regret analysis of mirror descent, by treating gnx as the per-round loss. For the third term, because yN*=(𝐯π^N,𝝁*) depends on {gn}n=1N, it is not a martingale. Nonetheless, we will be able to handle it through a union bound. Below, we give details for bounding these three terms.

D.1 The First Term: Martingale Concentration

For the first term, n=1N(hn(xn)-gn)xn, we use a martingale concentration property. Specifically, we adopt a Bernstein-type inequality [27, Theorem 3.15]:

Lemma 6.

[27, Theorem 3.15] Let M0,,MN be a martingale and let F0F1Fn be the filtration such that Mn=E|Fn[Mn+1]. Suppose there are b,σ< such that for all n, given Fn-1, Mn-Mn-1b, and V|Fn-1[Mn-Mn-1]σ2 almost surely. Then for any ϵ0,

P(MN-M0ϵ)exp(-ϵ22Nσ2(1+bϵ3Nσ2)).

creftype 6 implies, with probability at least 1-δ,

MN-M02Nσ2(1+o(1))log(1δ),

where o(1) means convergence to 0 as N.

To apply creftype 6, we need to provide bounds on the properties of the martingale difference:

Mn-Mn-1 =(hn(xn)-gn)xn
=(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁n+(𝐛𝝁n-𝐠n,v)𝐯n.

For the first term (κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁n, we use the lemma below:

Lemma 7.

Let 𝛍M be arbitrary, chosen independently from the randomness of gn,μ when Fn-1 is given. Then it holds |(κ1-avn-gn,μ)𝛍|2(1+K)1-γ and V|Fn-1[(κ1-avn-gn,μ)𝛍]4K(1-γ)2.

Proof.

By triangular inequality,

|(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁||(κ𝟏-𝐚𝐯n)𝝁|+|𝐠n,μ𝝁|.

For the deterministic part, using creftype 5 and Hölder’s inequality,

|(κ𝟏-𝐚𝐯n)𝝁|κ+𝐚𝐯n𝝁121-γ.

For the stochastic part, let in be index of the sampled state-action pair and jn be the index of the transited state sampled at the pair given by in. With abuse of notation, we will use in to index both SS×𝒜 and SS. With this notation, we may derive

|𝐠n,μ𝝁| =|K𝝁(κ𝟏^n-𝐫^n-11-γ(γ𝐏^n-𝐄^n)𝐯n)|
=Kμin|κ-rin-γvn,jn-vn,in1-γ|
2Kμin1-γ2K1-γ

where we use the facts that rin,vn,jn,vn,in[0,1] and μin1.

For 𝕍|Fn-1[(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁n], we can write it as

𝕍|Fn-1[(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁] =𝕍|Fn-1[𝐠n,μ𝝁]
𝔼|Fn-1[|𝐠n,μ𝝁n|2]
=in1K𝔼jn|in[K2μin2(κ-rin-γvn,jn-vn,in1-γ)2]
4K(1-γ)2inμin2
4K(1-γ)2(inμin)24K(1-γ)2

where in the second inequality we use the fact that |κ-rin-γvn,jn-vn,in1-γ|21-γ. ∎

For the second term (𝐛𝝁n-𝐠n,v)𝐯n, we use the following lemma.

Lemma 8.

Let vV be arbitrary, chosen independently from the randomness of gn,v when Fn-1 is given.. Then it holds that |(b𝛍n-gn,v)v|41-γ and V|Fn-1[(b𝛍n-gn,v)v]4(1-γ)2.

Proof.

We appeal to creftype 5, which shows 𝐛𝝁n1,𝐠n,v121-γ, and derive

|(𝐛𝝁n-𝐠n,v)𝐯|(𝐛𝝁n1+𝐠n,v1)𝐯41-γ.

Similarly, for the variance, we can write

𝕍|Fn-1[(𝐛𝝁n-𝐠n,v)𝐯] =𝕍|Fn-1[𝐠n,v𝐯]𝔼|Fn-1[(𝐠n,v𝐯)2]4(1-γ)2.

Thus, with helps from the two lemmas above, we are able to show

Mn-Mn-1|(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁n|+|(𝐛𝝁n-𝐠n,v)𝐯n|4+2(1+K)1-γ

as well as (because 𝐠n,μ and 𝐠n,b are computed using independent samples)

𝕍|Fn-1[Mn-Mn-1] 𝔼|Fn-1[|(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁n|2]+𝔼|Fn-1[|(𝐛𝝁n-𝐠n,v)𝐯n|2]4(1+K)(1-γ)2

Now, since M0=0, by martingale concentration in creftype 6, we have

P(n=1N(hn(xn)-gn)xn>ϵ)exp(-ϵ22Nσ2(1+bϵ3Nσ2))

with b=6+2K1-γ and σ2=4(1+K)(1-γ)2. This implies that, with probability at least 1-δ, it holds

n=1N(hn(xn)-gn)xnN8(1+K)(1-γ)2(1+o(1))log(1δ)=O~(NKlog(1δ)1-γ)

D.2 Static Regret of Mirror Descent

Next we move onto deriving the regret bound of mirror descent with respect to the online loss sequence:

maxx𝒳n=1Ngn(xn-x)

This part is quite standard; nonetheless, we provide complete derivations below.

We first recall a basic property of mirror descent

Lemma 9.

Let X be a convex set. Suppose R is 1-strongly convex with respect to some norm . Let g be an arbitrary vector and define, for xX,

y=argminx𝒳g,x+BR(x||x)

Then for all zX,

g,y-z BR(z||x)-BR(z||y)-BR(y||x) (31)
Proof.

Recall the definition BR(x||x)=R(x)-R(x)-R(x),x-x. The optimality of the proximal map can be written as

g+R(y)-R(x),y-z0,z𝒳.

By rearranging the terms, we can rewrite the above inequality in terms of Bregman divergences as follows and derive the first inequality (31):

g,y-z R(x)-R(y),y-z
=BR(z||x)-BR(z||y)+R(x)-R(y),y-R(x),x+R(y),y+R(x)-R(y)
=BR(z||x)-BR(z||y)+R(x),y-x+R(x)-R(y)
=BR(z||x)-BR(z||y)-BR(y||x),

which concludes the lemma. ∎

Let x𝒳 be arbitrary. Applying this lemma to the nth iteration of mirror descent in (21), we get

gn,xn+1-x 1η(BR(x||xn)-BR(x||xn+1)-BR(xn+1||xn))

By a telescoping sum, we then have

n=1Ngn,xn-x 1ηBR(x||x1)+n=1Ngn,xn+1-xn-1ηBR(xn+1||xn).

We bound the right-hand side as follows. Recall that based on the geometry of 𝒳=𝒱×, we considered a natural Bregman divergence of the form:

BR(x||x)=12|SS|𝐯-𝐯22+KL(𝝁||𝝁)

Let x1=(𝐯1,𝝁1) where 𝝁1 is uniform. By this choice, we have:

1ηBR(x||x1)1ηmaxx𝒳BR(x||x1)1η(12+log(K)).

We now decompose each item in the above sum as:

gn,xn+1-xn-1ηBR(xn+1||xn)= (𝐠n,v(𝐯n+1-𝐯n)-12η|SS|𝐯n-𝐯n+122)
+(𝐠n,μ(𝝁n+1-𝝁n)-1ηKL(𝝁n+1||𝝁n))

and we upper bound them using the two lemmas below (recall 𝐠n,μ0 due to the added κ𝟏 term).

Lemma 10.

For any vector x,y,g and scalar η>0, it holds g,x-y-12ηx-y22ηg222.

Proof.

By Cauchy-Swartz inequality, g,x-y-12ηx-y22g2x-y2-12ηx-y22ηg222. ∎

Lemma 11.

Suppose BR(x||y)=KL(x||y) and x,y are probability distributions, and g0 element-wise. Then, for η>0,

-1ηBR(y||x)+g,x-yη2ixi(gi)2=η2gx2.
Proof.

Let Δ denotes the unit simplex.

-BR(y||x)+ηg,x-y ηg,x+maxyΔ-ηg,y-BR(y||x)
=ηg,x+log(ixiexp(-ηgi)) ( convex conjugate of KL divergence)
ηg,x+log(ixi(1-ηgi+12(ηgi)2)) ( ex1+x+12x2 for x0)
=ηg,x+log(1+ixi(-ηgi+12(ηgi)2))
ηg,x+ixi(-ηgi+12(ηgi)2) ( log(x)x-1)
=12ixi(ηgi)2=η22gx2.

Finally, dividing both sides by η, we get the desired result. ∎

Thus, we have the upper bound gn,xn+1-xn-1ηBR(xn+1||xn)=η|SS|𝐠n,v222+η𝐠n,μ𝝁n22. Together with the upper bound on 1ηBR(x||x1), it implies that

n=1Ngn,xn-x 1ηBR(x||x1)+n=1Ngn,xn+1-xn-1ηBR(xn+1||xn)
1η(12+log(K))+η2n=1N|SS|𝐠n,v22+𝐠n,μ𝝁n2. (32)

We can expect, with high probability, n=1N|SS|𝐠n,v22+𝐠n,μ𝝁n2 concentrates toward its expectation, i.e.

n=1N|SS|𝐠n,v22+𝐠n,μ𝝁n2n=1N𝔼[|SS|𝐠n,v22+𝐠n,μ𝝁n2]+o(N).

Below we quantify this relationship using martingale concentration. First we bound the expectation.

Lemma 12.

𝔼[𝐠n,v22]4(1-γ)2 and E[gn,μ𝛍n2]4K(1-γ)2.

Proof.

For the first statement, using the fact that 21 and creftype 5, we can write

𝔼[𝐠n,v22]𝔼[𝐠n,v12]=𝔼[𝐩~n+11-γ(γ𝐏~n-𝐄n)𝝁~n12]4(1-γ)2.

For the second statement, let in be the index of the sampled state-action pair and jn be the index of the transited-to state sampled at the pair given by in. With abuse of notation, we will use in to index both SS×𝒜 and SS.

𝔼[𝐠n,μ𝝁n2] =𝔼[in1K𝔼jn|in[K2μin(κ-rin-γvn,jn-vn,in1-γ)2]]
4K(1-γ)2𝔼[inμin]4K(1-γ)2.

To bound the tail, we resort to the Höffding-Azuma inequality of martingale [27, Theorem 3.14].

Lemma 13 (Azuma-Hoeffding).

Let M0,,MN be a martingale and let F0F1Fn be the filtration such that Mn=E|Fn[Mn+1]. Suppose there exists b< such that for all n, given Fn-1, |Mn-Mn-1|b. Then for any ϵ0,

P(MN-M0ϵ)exp(-2ϵ2Nb2)

To apply creftype 13, we consider the martingale

MN=n=1N|SS|𝐠n,v22+𝐠n,μ𝝁n2-(n=1N𝔼[|SS|𝐠n,v22+𝐠n,μ𝝁n2])

To bound the change of the size of martingale difference |Mn-Mn-1|, we follow similar steps to creftype 12.

Lemma 14.

𝐠n,v224(1-γ)2 and gn,μ𝛍n24K2(1-γ)2.

Note 𝐠n,μ𝝁2 is K-factor larger than 𝔼[𝐠n,μ𝝁2]) and K1. Therefore, we have

|Mn-Mn-1||SS|𝐠n,v22+𝐠n,μ𝝁n2+|SS|𝔼[𝐠n,v22]+𝔼[𝐠n,μ𝝁n2]8(|SS|+K2)(1-γ)2

Combining these results, we have, with probability as least 1-δ,

n=1N|SS|𝐠n,v22+𝐠n,μ𝝁n2 n=1N𝔼[|SS|𝐠n,v22+𝐠n,μ𝝁n2]+42(|SS|+K2)(1-γ)2Nlog(1δ)
4(K+|SS|)(1-γ)2N+42(|SS|+K2)(1-γ)2Nlog(1δ)

Now we suppose we set η=1-γKN. From (E.5), we then have

n=1Ngn,xn-x 1η(12+log(K))+η2n=1N|SS|𝐠n,v22+𝐠n,μ𝝁n2
KN1-γ(12+log(K))+1-γKN(2(K+|SS|)(1-γ)2N+22(|SS|+K2)(1-γ)2Nlog(1δ))
O~(KN1-γ+K3log1δ1-γ).

D.3 Union Bound

Lastly, we provide an upper bound on the last component:

n=1N(gn-hn(xn))yN*.

Because yN* depends on gn, this term does not obey martingale concentration like the first component n=1N(hn(xn)-gn)xn which we analyzed in creftype D.1 To resolve this issue, we utilize the concept of covering number and derive a union bound.

Recall for a compact set 𝒵 in a norm space, the covering number 𝒩(𝒵,ϵ) with ϵ>0 is the minimal number of ϵ-balls that covers 𝒵. That is, there is a set {zi𝒵}i=1𝒩(𝒵,ϵ) such that maxz𝒵minzB(𝒵,ϵ)z-zϵ. Usually the covering number 𝒩(𝒵,ϵ) is polynomial in ϵ and perhaps exponential in the ambient dimension of 𝒵.

The idea of covering number can be used to provide a union bound of concentration over compact sets, which we summarize as a lemma below.

Lemma 15.

Let f,g be two random L-Lipschitz functions. Suppose for some a>0 and some fixed zZ selected independently of f,g, it holds

P(|f(z)-g(z)|>ϵ)exp(-aϵ2)

Then it holds that

P(supz𝒵|f(z)-g(z)|>ϵ)𝒩(𝒵,ϵ4L)exp(-aϵ24)
Proof.

Let 𝒞 denote a set of covers of size 𝒩(𝒵,ϵ4L) Then, for any z𝒵 which could depend on f,g,

|f(z)-g(z)| minz𝒞|f(z)-f(z)|+|f(z)-g(z)|+|g(z)-g(z)|
minz𝒞2Lz-z+|f(z)-g(z)]|
ϵ2+maxz𝒞|f(z)-g(z)|

Thus, supz𝒵|f(z)-g(z)|>ϵmaxz𝒞|f(z)-g(z)|>ϵ2. Therefore, we have the union bound.

P(supz𝒵|f(z)-𝔼[f(z)]|>ϵ) 𝒩(𝒵,ϵ4L)exp(-aϵ24).

We now use creftype 15 to bound the component n=1N(gn-hn(xn))yN*. We recall by definition, for x=(𝐯,𝝁),

(hn(xn)-gn)x=(κ𝟏-𝐚𝐯n-𝐠n,μ)𝝁+(𝐛𝝁n-𝐠n,v)𝐯

Because yN*=(𝐯π^N,𝝁*), we can write the sum of interest as

n=1N(gn-hn(xn))yN* =n=1N(𝐠n,μ-κ𝟏+𝐚𝐯n)𝝁*+n=1N(𝐠n,v-𝐛𝝁n)𝐯π^N

For the first term, because 𝝁* is set beforehand by the MDP definition and does not depend on the randomness during learning, it is a martingale and we can apply the steps in creftype D.1 to show,

n=1N(𝐠n,μ-κ𝟏+𝐚𝐯n)𝝁*=O~(NKlog(1δ)1-γ)

For the second term, because 𝐯π^N depends on the randomness in the learning process, we need to use a union bound. Following the steps in creftype D.1, we see that for some fixed 𝐯𝒱, it holds

P(|n=1N(𝐠n,v-𝐛𝝁n)𝐯|>ϵ)exp(-(1-γ)2Nϵ2)

where some constants were ignored for the sake of conciseness. Note also that it does not have the K factor because of creftype 8. To apply creftype 15, we need to know the order of covering number of 𝒱. Since 𝒱 is an |SS|-dimensional unit cube in the positive orthant, it is straightforward to show 𝒩(𝒱,ϵ)max(1,(1/ϵ)|SS|) (by simply discretizing evenly in each dimension). Moreover, the functions n=1N𝐠n,v𝐯 and n=1N𝐛𝝁n𝐯 are N1-γ-Lipschitz in .

Applying creftype 15 then gives us:

P(supv𝒱|n=1N(𝐠n,v-𝐛𝝁n)𝐯|>ϵ)𝒩(𝒱,ϵ(1-γ)4N)exp(-(1-γ)24Nϵ2).

For a given δ, we thus want to find the smallest ϵ such that:

δ 𝒩(𝒱,ϵ(1-γ)4N)exp(-(1-γ)24Nϵ2).

That is:

log(1δ) (1-γ)24Nϵ2+|SS|min(0,log(ϵ(1-γ)4N)).

Picking ϵ=O(log(N)Nlog(1δ)1-γ)=O~(Nlog(1δ)1-γ) guarantees that the inequality is verified asymptotically.

Combining these two steps, we have shown overall, with probability at least 1-δ,

n=1N(gn-hn(xn))yN*=O~(NKlog(1δ)1-γ).

D.4 Summary

In the previous subsections, we have provided high probability upper bounds for each term in the decomposition

RegretN(yN*)(n=1N(hn(xn)-gn)xn)+(maxx𝒳n=1Ngn(xn-x))+(n=1N(gn-hn(xn))yN*)

implying with probability at least 1-δ,

RegretN(yN*)O~(NKlog(1δ)1-γ)+O~(KN1-γ+K3log1δ1-γ)=O~(N|SS||𝒜|log(1δ)1-γ)

By creftype 1, this would imply with probability at least 1-δ,

Vπ^N(p)V*(p)-RegretN(yN*)NV*(p)-O~(|SS||𝒜|log(1δ)(1-γ)N)

In other words, the sample complexity of mirror descent to obtain an ϵ approximately optimal policy (i.e. V*(p)-Vπ^N(p)ϵ) is at most O~(|SS||𝒜|log(1δ)(1-γ)2ϵ2).

Appendix E Sample Complexity of Mirror Descent with Basis Functions

Here we provide further discussions on the sample complexity of running creftype 5 with linearly parameterized function approximators and the proof of creftype 3. See 3

E.1 Setup

We suppose that the decision variable is parameterized in the form xθ=(𝚽𝜽v,𝚿𝜽μ), where 𝚽,𝚿 are given nonlinear basis functions and (𝜽v,𝜽μ)Θ are the parameters to learn. For modeling the value function, we suppose each column in 𝚽 is a vector (i.e. function) such that its is less than one. For modeling the state-action distribution, we suppose each column in 𝚿 is a state-action distribution from which we can draw samples. This choice implies that every column of 𝚽 belongs to 𝒱, and every column of 𝚿 belongs to .

Considering the geometry of 𝚽 and 𝚿, we consider a compact and convex parameter set

Θ={θ=(𝜽v,𝜽μ):𝜽v2Cvdim(𝜽v),𝜽μ0,𝜽μ11}

where Cv<. The constant Cv acts as a regularization in learning: if it is too small, the bias (captured as ϵΘ,N in creftype 1 restated below) becomes larger; if it is too large, the learning becomes slower.

This choice of Θ makes sure, for θ=(𝜽v,𝜽μ)Θ, 𝚿𝜽μ and 𝚽𝜽v𝜽v1Cv. Therefore, by the above construction, we can verify that the requirement in creftype 1 is satisfied, i.e. for θ=(𝜽v,𝜽μ)Θ, we have (𝚽𝜽v,𝚿μ𝜽μ)𝒳Θ. See 1

E.2 Online Loss and Sampled Gradient

Let θ=(𝜽v,𝜽μ)Θ. In view of the parameterization above, we can identify the online loss in (23) in the parameter space as

hn(θ)𝐛𝝁n𝚽𝜽v+𝜽μ𝚿(11-γ𝟏-𝐚𝐯n) (33)

where we have the natural identification xn=(𝐯n,𝝁n)=(𝚽𝜽v,n,𝚿𝜽μ,n) and θn=(𝜽v,n,𝜽μ,n)Θ is the decision made by the online learner in the nth round. Note that because this extension of creftype 5 makes sure 𝜽μ,n1=1 for every iteration, we can still use hn. For writing convenience, we will continue to overload hn as a function of parameter θ in the following analyses.

Mirror descent requires gradient estimates of hn(θn). Here we construct an unbiased stochastic estimate of hn(θn) as

gn=[𝐠n,v𝐠n,μ]=[𝚽(𝐩~n+11-γ(γ𝐏~n-𝐄n)𝝁~n)dim(𝜽μ)𝚿^n(11-γ𝟏^n-𝐫^n-11-γ(γ𝐏^n-𝐄^n)𝐯n)] (34)

using two calls of the generative model (again we overload the symbol gn for the analyses in this section):

  • The upper part 𝐠n,v is constructed similarly as before in (24): First we sample the initial state from the initial distribution, the state-action pair using the learned state-action distribution, and then the transited-to state at the sampled state-action pair. We evaluate 𝚽’s values at those samples to construct 𝐠n,v. Thus, 𝐠n,v would generally be a dense vector of size dim(𝜽v) (unless the columns of 𝚽 are sparse to begin with).

  • The lower part 𝐠n,μ is constructed slightly differently. Recall for the tabular version in (24), we uniformly sample over the state and action spaces. Here instead we first sample uniformly a column (i.e. a basis function) in 𝚿 and then sample a state-action pair according to the sampled column, which is a distribution by design. Therefore, the multiplier due to uniform sampling in the second row of (34) is now dim(𝜽μ) rather than |SS||𝒜| in (24). The matrix 𝚿^n is extremely sparse, where only the single sampled entry (the column and the state-action pair) is one and the others are zero. In fact, one can think of the tabular version as simply using basis functions 𝚿=𝐈, i.e. each column is a delta distribution. Under this identification, the expression in (34) matches the one in (24).

It is straightforward to verify that 𝔼[gn]=hn(θn) for gn in (34).

E.3 Proof of creftype 3

We follow the same steps of the analysis of the tabular version. We will highlight the differences/improvement due to using function approximations.

First, we use creftype 1 in place of creftype 1. To properly handle the randomness, we revisit its derivation to slightly tighten the statement, which was simplified for the sake of cleaner exposition. Define

yN,θ*=(𝐯N,θ*,𝝁θ*)argmaxxθ𝒳θrep(x^N;xθ).

For writing convenience, let us also denote θN*=(𝜽v,N*,𝜽μ*)Θ as the corresponding parameter of yN,θ*. We remark that 𝝁θ* (i.e. 𝜽μ*), which tries to approximate 𝝁*, is fixed before the learning process, whereas 𝐯N,θ* (i.e. 𝜽v,N*) could depend on the stochasticity in the learning. Using this new notation and the steps in the proof of creftype 1, we can write

V*(p)-Vπ^N(p)=rep(x^N;yN*)
=ϵΘ,N+rep(x^N;yN,θ*)ϵΘ,N+RegretN(yN,θ*)N

where the first equality is creftype 4, the last inequality follows the proof of creftype 1, and we recall the definition ϵΘ,N=rep(x^N;yN*)-rep(x^N;yN,θ*).

The rest of the proof is very similar to that of creftype 1, because linear parameterization does not change the convexity of the loss sequence. Let yN*=(𝐯π^N,𝝁*). We bound the regret by the following rearrangement.

RegretN(yN,θ*) =n=1Nln(xn)-n=1Nln(yN,θ*)
=n=1Nhn(θn)-n=1Nhn(θN*)
=n=1Nhn(θn)(θn-θN*)
=(n=1N(hn(θn)-gn)θn)+(n=1Ngn(θn-θN*))+(n=1N(gn-hn(θn))θN*)
(n=1N(hn(θn)-gn)θn)+(maxθΘn=1Ngn(θn-θ))+(n=1N(gn-hn(θn))θN*) (35)

where the second equality is due to the identifcation in (33).

We will solve this online learning problem with mirror descent

θn+1=argminθΘgn,θ+1ηBR(θ||θn) (36)

with step size η>0 and a Bregman divergence that is a straightforward extension of (22)

BR(θ||θ)=12dim(θv)Cv2𝜽v-𝜽v22+KL(𝜽μ||𝜽μ) (37)

where the constant dim(θv)Cv2 is chosen to make the size of Bregman divergence dimension-free (at least up to log factors). Below we analyze the size of the three terms in (E.3) like what we did for creftype 2.

E.4 The First Term: Martingale Concentration

The first term is a martingale. We will use this part to highlight the different properties due to using basis functions. The proof follows the steps in creftype D.1, but now the martingale difference of interest is instead

Mn-Mn-1 =(hn(θn)-gn)θn
=(𝚿(κ𝟏-𝐚𝐯n)-𝐠n,μ)𝜽μ,n+(𝚽n𝐛𝝁n-𝐠n,v)𝜽v,n

They now have nicer properties due to the way 𝐠n,μ is sampled.

For the first term (𝚿(κ𝟏-𝐚𝐯n)-𝐠n,μ)𝜽μ,n, we use the lemma below, where we recall the filtration Fn is naturally defined as {θ1,,θn}.

Lemma 16.

Let θ=(𝛉v,𝛉μ)Θ be arbitrary that is chosen independent of the randomness of gn,μ when Fn-1 is given. Then it holds |(κ1-avn-gn,μ)𝛉|2(1+dim(𝛉𝛍))1-γ and V|Fn-1[(κ1-avn-gn,μ)𝛉n]4dim(𝛉𝛍)(1-γ)2.

Proof.

By triangular inequality,

|(𝚿(κ𝟏-𝐚𝐯n)-𝐠n,μ)𝜽μ||(κ𝟏-𝐚𝐯n)𝚿𝜽μ|+|𝐠n,μ𝜽μ|

For the deterministic part, using creftype 5 and Hölder’s inequality,

|(κ𝟏-𝐚𝐯n)𝚿𝜽μ|κ+𝐚𝐯n𝚿𝜽μ121-γ

For the stochastic part, let kn denote the sampled column index, in be index of the sampled state-action pair using the column of kn, and jn be the index of the transited state sampled at the pair given by in. With abuse of notation, we will use in to index both SS×𝒜 and SS. Let 𝝁=𝚿μ𝜽μ. With this notation, we may derive

|𝐠n,μ𝜽μ| =|dim(𝜽μ)𝜽𝝁𝚿^n(κ𝟏^n-𝐫^n-11-γ(γ𝐏^n-𝐄^n)𝐯n)|
=dim(𝜽μ)θμ,kn|κ-rin-γvn,jn-vn,in1-γ|
2dim(𝜽μ)θμ,kn1-γ2dim(𝜽μ)1-γ

where we use the facts rin,vn,jn,vn,in[0,1] and θμ,kn1.

Let 𝝍μ(k) denote the kth column of 𝚿. For 𝕍|Fn-1[(κ𝟏-𝐚𝐯n-𝐠n,μ)𝜽n], we can write it as

𝕍|Fn-1[(𝚿(κ𝟏-𝐚𝐯n)-𝐠n,μ)𝜽μ] =𝕍|Fn-1[𝐠n,μ𝜽n]
𝔼|Fn-1[|𝐠n,μ𝜽n|2]
=kn1dim(𝜽μ)inψμ,in(kn)𝔼jn|in[dim(𝜽μ)2θμ,kn2(κ-rin-γvn,jn-vn,in1-γ)2]
4dim(𝜽μ)(1-γ)2knθμ,kn2inψμ,in(kn)
4dim(𝜽μ)(1-γ)2(knθμ,kn)24dim(𝜽μ)(1-γ)2

where in the second inequality we use the fact that |κ-rin-γvn,jn-vn,in1-γ|21-γ. ∎

For the second term (𝚽n𝐛𝝁n-𝐠n,v)𝜽v,n, we use this lemma.

Lemma 17.

Let 𝛉V be arbitrary that is chosen independent of the randomness of gn,v when Fn-1 is given. Then it holds |(Φnb𝛍n-gn,v)𝛉|4Cv1-γ and V|Fn-1[(Φnb𝛍n-gn,v)𝛉]4Cv2(1-γ)2.

Proof.

We appeal to creftype 5, which shows 𝐛𝝁n121-γ and

𝐩~n+11-γ(γ𝐏~n-𝐄n)𝝁~n121-γ

Therefore, overall we can derive

|(𝚽𝐛𝝁n-𝐠n,v)𝜽| (𝐛𝝁n1+𝐩~n+11-γ(γ𝐏~n-𝐄n)𝝁~n1)𝚽𝜽v4Cv1-γ

where we use again each column in 𝚽 has less than one, and 2. Similarly, for the variance, we can write

𝕍|Fn-1[(𝚽n𝐛𝝁n-𝐠n,v)𝜽] =𝕍|Fn-1[𝐠n,v𝜽]𝔼|Fn-1[(𝐠n,v𝜽)2]4Cv2(1-γ)2

From the above two lemmas, we see the main difference from the what we had in creftype D.1 for the tabular case is that, the martingale difference now scales in O(Cv+dim(𝜽𝝁)1-γ) instead of O(|SS||𝒜|1-γ), and its variance scales in O(Cv2+dim(𝜽𝝁)(1-γ)2) instead of O(|SS||𝒜|(1-γ)2). We note the constant Cv is universal, independent of the problem size.

Following the similar steps in creftype D.1, these new results imply that

P(n=1N(hn(θn)-gn)θn>ϵ)exp(-ϵ22Nσ2(1+bϵ3Nσ2))

with b=O(Cv+dim(𝜽𝝁)1-γ) and O(Cv2+dim(𝜽𝝁)(1-γ)2). This implies that, with probability at least 1-δ, it hold

n=1N(hn(θn)-gn)θn=O~(N(Cv2+dim(𝜽𝝁))log(1δ)1-γ)

E.5 Static Regret of Mirror Descent

Again the steps here are very similar to those in creftype D.2. We concern bounding the static regret.

maxθΘn=1Ngn(θn-θ)

From creftype D.2, we recall this can be achieved by the mirror descent’s optimality condition. The below inequality is true, for any θΘ:

n=1Ngn,θn-θ 1ηBR(θ||θ1)+n=1Ngn,θn+1-θn-1ηBR(θn+1||θn)

Based on our choice of Bregman divergence given in (37), i.e.

BR(θ||θ)=12dim(θv)Cv2𝜽v-𝜽v22+KL(𝜽μ||𝜽μ), (37)

we have 1ηBR(θ||θ1)O~(1)η. For each gn,θn+1-θn-1ηBR(θn+1||θn), we will use again the two basic lemmas we proved in creftype D.2. See 10 See 11 Thus, we have the upper bound

gn,θn+1-θn-1ηBR(θn+1||θn)=Cv2dim(θv)η𝐠n,v222+η𝐠n,μ𝜽μ,n22

Together with the upper bound on 1ηBR(x||x1), it implies that

n=1Ngn,xn-x 1ηBR(x||x1)+n=1Ngn,xn+1-xn-1ηBR(xn+1||xn)
O~(1)η+η2n=1NCv2dim(θv)𝐠n,v22+𝐠n,μ𝜽μ,n2 (38)

We can expect, with high probability, n=1NCv2dim(θv)𝐠n,v22+𝐠n,μ𝜽μ,n2 concentrates toward its expectation, i.e.

n=1NCv2dim(θv)𝐠n,v22+𝐠n,μ𝜽μ,n2n=1N𝔼[Cv2dim(θv)𝐠n,v22+𝐠n,μ𝜽μ,n2]+o(N)

To bound the right-hand side, we will use the upper bounds below, which largely follow the proof of creftype 16 and creftype 17.

Lemma 18.

𝔼[𝐠n,v22]4dim(𝜽v)(1-γ)2 and E[gn,μ𝛉μ,n2]4dim(𝛉μ)(1-γ)2.

Lemma 19.

𝐠n,v224dim(𝜽v)(1-γ)2 and gn,μ𝛉μ,n24dim(𝛉μ)2(1-γ)2.

By Azuma-Hoeffding’s inequality in creftype 13,

n=1NCv2dim(θv)𝐠n,v22+𝐠n,μ𝜽𝝁,n2 n=1N𝔼[Cv2dim(θv)𝐠n,v22+𝐠n,μ𝜽𝝁,n2]+O(Cv2+dim(𝜽μ)2(1-γ)2Nlog(1δ))
O(Cv2+dim(𝜽μ)(1-γ)2N)+O(Cv2+dim(𝜽μ)2(1-γ)2Nlog(1δ))

Now we suppose we set η=O(1-γN(Cv2+dim(𝜽μ))). We have

n=1Ngn,θn-θ O~(1)η+η2n=1NCv2dim(θv)𝐠n,v22+𝐠n,μ𝜽μ,n2O~((Cv2+dim(𝜽μ))N1-γ)

E.5.1 Union Bound

Lastly we use an union bound to handle the term

n=1N(gn-hn(θn))θN*

We follow the steps in creftype D.3: we will use again the fact that θN*=(𝜽v,N*,𝜽μ*)Θ, so we can handle the part with 𝜽μ* using the standard martingale concentration, and the part with 𝜽v,N* using the union bound.

Using the previous analyses, we see can first show that the martingale due to the part 𝜽μ* concentrates in O~(Ndim(𝜽𝝁)log(1δ)1-γ). Likewise, using the union bound, we can show the martingale due to the part 𝜽v,N* concentrates in O~(NCv2log(𝒩δ)1-γ) where 𝒩 some proper the covering number of the set {𝜽v:𝜽v2Cvdim(𝜽v)}. Because log𝒩=O(dim(𝜽v)) for an Euclidean ball. We can combine the two bounds and show together

n=1N(gn-hn(θn))θN*=O~(N(Cv2dim(𝜽v)+dim(𝜽𝝁))log(1δ)1-γ)

E.5.2 Summary

Combining the results of the three parts above, we have, with probability 1-δ,

RegretN(yN,θ*)
(n=1N(hn(θn)-gn)θn)+(maxθΘn=1Ngn(θn-θ))+(n=1N(gn-hn(θn))θN*)
=O~(N(dim(𝜽𝝁)+Cv2)log(1δ)1-γ)+O~((Cv2+dim(𝜽μ))N1-γ)+O~(N(Cv2dim(𝜽v)+dim(𝜽𝝁))log(1δ)1-γ)
=O~(Ndim(Θ)log(1δ)1-γ)

where the last step is due to Cv is a universal constant. Or equivalently, the above bounds means a sample complexity in O~(dim(Θ)log(1δ)(1-γ)2ϵ2). Finally, we recall the policy performance has a bias ϵΘ,N in creftype 1 due to using function approximators. Considering this effect, we have the final statement.