Provably Efficient Reinforcement Learning with Linear Function Approximation

  • 2019-08-08 07:23:12
  • Chi Jin, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan
  • 0

Abstract

Modern Reinforcement Learning (RL) is commonly applied to practical problemswith an enormous number of states, where function approximation must bedeployed to approximate either the value function or the policy. Theintroduction of function approximation raises a fundamental set of challengesinvolving computational and statistical efficiency, especially given the needto manage the exploration/exploitation tradeoff. As a result, a core RLquestion remains open: how can we design provably efficient RL algorithms thatincorporate function approximation? This question persists even in a basicsetting with linear dynamics and linear rewards, for which only linear functionapproximation is needed. This paper presents the first provable RL algorithm with both polynomialruntime and polynomial sample complexity in this linear setting, withoutrequiring a "simulator" or additional assumptions. Concretely, we prove that anoptimistic modification of Least-Squares Value Iteration (LSVI)---a classicalalgorithm frequently studied in the linear setting---achieves$\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambientdimension of feature space, $H$ is the length of each episode, and $T$ is thetotal number of steps. Importantly, such regret is independent of the number ofstates and actions.

 

Quick Read (beta)

Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin
University of California, Berkeley
[email protected]
   Zhuoran Yang
Princeton University
[email protected]
   Zhaoran Wang
Northwestern University
[email protected]
   Michael I. Jordan
University of California, Berkeley
[email protected]
Abstract

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed.

This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a “simulator” or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)—a classical algorithm frequently studied in the linear setting—achieves 𝒪~(d3H3T) regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.

1 Introduction

Reinforcement Learning (RL) is a control-theoretic problem in which an agent tries to maximize its expected cumulative reward by interacting with an unknown environment over time [41]. Modern RL commonly engages practical problems with an enormous number of states, where function approximation must be deployed to approximate the (action-)value function—the expected cumulative reward starting from a state-action pair—or the policy—the mapping from a state to its subsequent action. Function approximation, especially based on deep neural networks, lies at the heart of the recent practical successes of RL in domains such as Atari games [30], Go [38], robotics [23], and dialogue systems [27]. Moreover, deep neural networks serve as essential components of generic deep RL algorithms, including Deep Q-Network (DQN) [30], Asynchronous Advantage Actor-Critic (A3C) [31], and Trust Region Policy Optimization (TRPO) [36].

Despite the empirical successes of function approximation in RL, most existing theoretical guarantees apply only to tabular RL [see, e.g., 20, 33, 8, 22], in which the states and actions are discrete, and the value function is represented by a table. Due to the curse of dimensionality, only relatively small problems can be tackled by tabular RL. Thus, researchers have turned to function approximation [see, e.g., 40, 12, 43], in theory and in practice. While function approximation greatly expands the potential reach of RL, particularly via deep RL architectures, it raises a number of fundamental theoretical challenges. For example, while the effective state and action spaces can be much larger when function approximation is used, the neighborhoods of most states are not visited even once during a set of learning episodes, which makes it difficult to obtain reliable estimates of value functions [see, e.g., 41, 42, 26]. To cope with this challenge, relatively simple function classes, including linear function classes, are often used. This introduces, however, a bias, even in the limit of infinite training data, given that the optimal value function and policy may not be linear [see, e.g., 10, 11, 43]. Thus, both in theory and in practice, the design of RL systems must cope with fundamental statistical problems of sparsity and misspecification, all in the context of a dynamical system. Moreover, a core distinguishing feature of RL is that it requires addressing the tradeoff between exploration and exploitation. Addressing this tradeoff algorithmically requires exactly the kinds of statistical estimates that are challenging to obtain in the RL setting due to sparsity, misspecification, and dynamics. Thus the following fundamental question remains open:

Is it possible to design provably efficient RL algorithms in the function approximation setting?

By “efficient” we mean efficient in both runtime and sample complexity—the runtime and the sample complexity should not depend on the number of states, but should depend instead on an intrinsic complexity measure of the function class.

Several recent attempts have been made to attack this fundamental problem. However, they either require the access to a “simulator” [49] which alleviates the difficulty of exploration, or assume the transition dynamics to be deterministic [47, 48], to have a low variance [19], or are parametrizable by a relatively small matrix [50], which alleviates the difficulty in estimating the transition dynamics (see Section 1.1 for more details).

Focusing on a linear setting in which the transition dynamics and reward function are assumed to be linear, we present the first algorithm that is provably efficient in both runtime and sample complexity, without requiring additional oracles or stronger assumptions. Concretely, in the general setting of an episodic Markov Decision Process (MDP), we prove that an optimistic version of Least-Squares Value Iteration (LSVI) [12, 33]—a classical algorithm frequently studied in the linear setting—achieves 𝒪~(d3H3T) regret, where d is the ambient dimension of feature space, H is the length of each episode, T is the total number of steps, and 𝒪~() hides only absolute constant and poly-logarithmic factors. Importantly, such regret is independent of S and A—the number of states and actions. Our algorithm runs in 𝒪(d2AKT) time and 𝒪(d2H+dAT) space, which are again independent of S and thus efficient in practice. In addition, our result is robust to the linear assumption: When the underlying transition model is not linear, but ζ-close to linear in total variation distance (Assumption B), our algorithm achieves 𝒪~(d3H3T+ζdHT) regret. That is, in addition to the standard T regret, the algorithm also suffers from a linear regret term that scales with an error ζ that arises due to the function class misspecification.

1.1 Related Work

Tabular RL:

Tabular RL is well studied in both model-based [20, 33, 8, 17] and model-free settings [39, 22]. See also [24, 6, 7, 25, 37, 45] for a simplified setting with access to a “simulator” (also called a generative model), which is a strong oracle that allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [7].

In the episodic setting with nonstationary dynamics and no “simulators,” the best regrets achieved by existing model-based and model-free algorithms are 𝒪~(H2SAT) [8] and 𝒪~(H3SAT) [22], respectively, both of which (nearly) attain the minimax lower bound Ω(H2SAT) [20, 32, 22]. Here S and A denote the numbers of states and actions, respectively. Although these algorithms are (nearly) minimax-optimal, they can not cope with large state spaces, as their regret scales linearly in S, where S is often exponentially large in practice [see, e.g., 30, 38, 23, 27]. Moreover, the minimax lower bound suggests that, information-theoretically, a large state space cannot be handled efficiently unless further problem-specific structure is exploited. Compared with this line of work, in the current paper we exploit the linear structure of the reward and transition functions and show that the regret of optimistic LSVI scales polynomially in the ambient dimension d rather than the number of states S.

Linear bandits:

To enable function approximation, another line of related work studies stochastic linear bandits or stochastic linear contextual bandits [see, e.g., 5, 16, 28, 35, 14, 2], which is a special case of the linear MDP studied in this paper (Assumption A) with the episode length H set equal to one. See [13, 26] and the references therein for a detailed survey. The best regrets achieved by existing algorithms are 𝒪~(dT) for linear bandits [2] and 𝒪~(dT) for linear contextual bandits [5, 14], both of which scale polynomially in the ambient dimension d. We note, however, that while an MDP has state transition, linear bandits do not. This temporal structure captures the fundamental difference in their difficulties of exploration: a naive adaptation of existing linear bandit algorithms to the linear MDP setting yields a regret exponential in H—the length of each episode.

RL with function approximation:

In the setting of linear function approximation, there is a long line of classical work on the design of algorithms, but this work does not provide polynomial sample efficiency guarantees [see, e.g., 12, 29, 41, 33, 9]. Recently, Yang and Wang [49] revisited the setting of linear transitions and rewards [12, 29] (Assumption A), and presented a sample-efficient algorithm assuming the access to a “simulator”. Similar to the case of tabular setting, the “simulator” greatly alleviates the difficulty of exploration. We also note that their very recent work [50], developed independently of the current paper, provides sample efficiency guarantees for exploration in the linear MDP setting. Compared with the current paper, [50] differs in that requires one additional key assumption—that the transition model can be parameterized by a relatively small matrix. This additional assumption reduces the number of free parameters in the transition model from potentially being infinite (for the case with an infinite number of states) to small and finite, and thus mitigates the challenges in estimating the transition model. As a result, their algorithm and main mechanism are based on estimating the unknown matrix, which differs from our approach. Finally, in a broader context, without the assumption of a linear MDP, sample efficiency guarantees have been established for RL under other assumptions, such as that the transition dynamics are fully deterministic [47, 48], or have low variances [19]. These assumptions can be potentially restrictive in practice, and may not hold even in the tabular setting. In contrast, our results directly cover the standard tabular case with no extra assumptions.

In the setting of general function approximation, Jiang et al. [21] present a generic algorithm Olive, which enjoys sample efficiency if a complexity measure that they refer to as “Bellman rank” is small. It can be shown that Bellman rank is at most d under Assumption A, and thus Olive is sample efficient in our setting. In contrast to our results, Olive is not computationally efficient in general and it does not provide a T regret bound. Meanwhile, a recent line of work [51, 46] studies a nonparametric setting with Hölder smooth reward and transition model. The sample complexities provided therein are exponential in dimensionality in the worst case.

2 Preliminaries

We consider the setting of an episodic Markov decision process, denoted by MDP(𝒮,𝒜,H,,r), where 𝒮 and 𝒜 are the sets of possible states and actions, respectively, H+ is the length of each episode, ={h}h=1H and r={rh}h=1H are the state transition probability measures and the reward functions, respectively. We assume that 𝒮 is a measurable space with possibly infinite number of elements and 𝒜 is a finite set with cardinality A. Moreover, for each h[H], h(|x,a) denotes the transition kernel over the next states if action a is taken for state x at step h[H], and rh:𝒮×𝒜[0,1] is the deterministic reward function at step h.11 1 While we study deterministic reward functions for notational simplicity, our results readily generalize to random reward functions. Also, we assume the reward lies in [0,1] without loss of generality.

An agent interacts with this episodic MDP as follows. In each episode, an initial state x1 is picked arbitrarily by an adversary. Then, at each step h[H], the agent observes the state xh𝒮, picks an action ah𝒜, and receives a reward rh(xh,ah). Moreover, the MDP evolves into a new state xh+1 that is drawn from the probability measure h(|xh,ah). The episode terminates when xH+1 is reached. We note that the agent cannot take an action at xH+1 and hence receives no reward.

A policy π of an agent is a function π:𝒮×[H]𝒜, where π(x,h) is the action that the agent takes at state x and at the hth step in the episode. Moreover, for each h[H], we define the value function Vhπ:𝒮 as the expected value of cumulative rewards received under policy π when starting from an arbitrary state at the hth step. Specifically, we have

Vhπ(x):=𝔼[h=hHrh(xh,π(xh,h))|xh=x],x𝒮,h[H].

Accordingly, we also define the action-value function Qhπ:𝒮×𝒜 which gives the expected value of cumulative rewards when the agent starts from an arbitrary state-action pair at the h-th step and follows policy π afterwards; that is,

Qhπ(x,a):=rh(x,a)+𝔼[h=h+1Hrh(xh,π(xh,h))|xh=x,ah=a],(x,a)𝒮×𝒜,h[H].

Since the action spaces and the episode length are both finite, there always exists an optimal policy π which gives the optimal value Vh(x)=supπVhπ(x) for all x𝒮 and h[H] [see, e.g., 34]). To simplify the notation, we denote [hVh+1](x,a):=𝔼xh(|x,a)Vh+1(x). Using this notation, the Bellman equation associated with a policy π becomes

Qhπ(x,a)=(rh+hVh+1π)(x,a),Vhπ(x)=Qhπ(x,πh(x)),VH+1π(x)=0, (1)

which holds for all (x,a)𝒮×𝒜. Similarly, the Bellman optimality equation is

Qh(x,a)=(rh+hVh+1)(x,a),Vh(x)=maxa𝒜Qh(x,a),VH+1(x)=0. (2)

This implies that the optimal policy π is the greedy policy with respect to the optimal action-value function {Qh}h[H]. Thus, to find the optimal policy π, it suffices to estimate the optimal action-value functions.

Furthermore, under the setting of an episodic MDP, the agent aims to learn the optimal policy by interacting with the environment during a set of episodes. For each k1, at the beginning of the kth episode, the adversary picks the initial state x1k and the agent chooses policy πk. The difference in values between V1πk(x1k) and V1(x1k) serves as the expected regret or the suboptimality of the agent at the k-th episode. Thus, after playing for K episodes, the total (expected) regret is

Regret(K)=k=1K[V1(x1k)-V1πk(x1k)].

2.1 Linear Markov decision processes

We focus on a setting of a linear Markov decision process, where the transition kernels and the reward function are assumed to be linear. This assumption implies that the action-value function is linear, as we will show. Note that this is not the same as the assumption that the policy is a linear function—an assumption that has been the focus of much of the literature. Rather, it is akin to a statistical modeling assumption, in which we make assumptions about how data are generated and then study various estimators. Formally, we make the following definition.

Assumption A (Linear MDP [12, 29]).

MDP(𝒮,𝒜,H,,r) is a linear MDP with a feature map ϕ:𝒮×𝒜d, if for any h[H], there exist d unknown (signed) measures 𝝁h=(μh(1),,μh(d)) over 𝒮 and an unknown vector 𝜽hd, such that for any (x,a)𝒮×𝒜, we have

h(|x,a)=ϕ(x,a),𝝁h(),rh(x,a)=ϕ(x,a),𝜽h. (3)

Without loss of generality, we assume ϕ(x,a)1 for all (x,a)𝒮×𝒜, and max{𝝁h(𝒮),𝜽h}d for all h[H].

By definition, in a linear MDP, both the Markov transition model and the reward functions are linear in a feature mapping ϕ. We remark that despite being linear, the Markov transition model h(|x,a) can still have infinite degrees of freedom as the measure 𝝁h is unknown. This is a key difference from the linear quadratic regulator [1, 18, 4, 3, 15] or the recent work of Yang and Wang [50], whose transition models are completely specified by a finite-dimensional matrix such that the degrees of freedom are bounded.

Recall that we assume the reward functions are bounded in [0,1], which implies that the value functions are bounded in [0,H]. Our choice of normalization conditions in Assumption A implies that the following concrete examples serve as special cases of a linear MDP.

Example 2.1 (Tabular MDP).

For the scenario with finitely many states and actions, letting d=|𝒮|×|𝒜|, then each coordinate can be indexed by state-action pair (x,a)𝒮×𝒜. Let ϕ(x,a)=𝐞(x,a) be the canonical basis in d. Then if we set 𝐞(x,a)𝝁h()=h(|x,a) and 𝐞(x,a)𝜽h=rh(x,a) for any h[H], we recover the tabular MDP.

Example 2.2 (Simplex Feature Space).

When the feature space, {ϕ(x,a):(x,a)𝒮×𝒜}, is a subset of the d-dimensional simplex, {𝝍|i=1dψi=1 and ψi0 for all i}, a linear MDP can be instantiated by choosing 𝐞i𝝁h to be an arbitrary probability measure over 𝒮 and letting 𝜽h be any vector such that 𝜽h1.

As mentioned earlier, a crucial property of the linear MDP is that, for all policies, the action-value functions are always linear in the feature map ϕ. Therefore, when designing RL algorithms, it suffices to focus on linear action-value functions.

Proposition 2.3.

For a linear MDP, for any policy π, there exist weights {whπ}h[H] such that for any (x,a,h)S×A×[H], we have Qhπ(x,a)=ϕ(x,a),whπ.

We provide a proof of this proposition in Appendix A, where we also present additional discussion of the basic properties of a linear MDP.

3 Main Results

In this section, we present our main results, which provide sample complexity guarantees for Algorithm 3 in the linear MDP setting (Theorem 3.1) and in a misspecified setting (Theorem 3.2).

{algorithm}

[t] Least-Squares Value Iteration with UCB (LSVI-UCB) \[email protected]@algorithmic[1] \Forepisode k=1,,K \StateReceive the initial state x1k. \Forstep h=H,,1 \StateΛhτ=1k-1ϕ(xhτ,ahτ)ϕ(xhτ,ahτ)+λ𝐈. \State𝐰hΛh-1τ=1k-1ϕ(xhτ,ahτ)[rh(xhτ,ahτ)+maxaQh+1(xh+1τ,a)]. \StateQh(,)min{𝐰hϕ(,)+β[ϕ(,)Λh-1ϕ(,)]1/2,H}. \EndFor\Forstep h=1,,H \StateTake action ahkargmaxa𝒜Qh(xhk,a), and observe xh+1k. \EndFor\EndFor

We first lay out our algorithm (Algorithm 3)—an optimistic modification of Least-Square Value Iteration (LSVI), where the optimism is realized by Upper-Confidence Bounds (UCB). At a high level, each episode consists of two passes (or loops) over all steps. The first pass (line 3-6) updates the parameters (𝐰h,Λh) that are used to form the action-value function Qh. The second pass (line 7-8) executes the greedy policy, ah=argmaxa𝒜Qh(xh,a), according to the Qh obtained in the first pass. We note QH+1(,)0 since the agent receives no reward after the Hth step. For the first episode k=1, since the summation in line 4-5 is from τ=1 to 0, we simply have Λhλ𝐈 and 𝐰h0. Line 6 specifies the dependency of the action-value function Qh on the parameters 𝐰h and Λh, and no actual updates need to be performed.

The idea of Least-Square Value Iteration [12, 33] stems from the classical value-iteration algorithm, which finds the optimal policy (or action-value function) by applying the Bellman optimality equation Eq. (2) recursively:

Qh(x,a)[rh+hmaxa𝒜Qh+1(,a)](x,a),(x,a)𝒮×𝒜.

In practical RL with linear function approximation, there are two challenges to face in implementing the updates: First, h is unknown, and it is replaced by the samples observed empirically. Second, in the setting of large state space, we cannot iterate over all (x,a). We parametrize Qh(x,a) by a linear form 𝐰hϕ(x,a) instead. A natural idea here is to replace the Bellman update by solving for 𝐰h in a least-squares problem. In fact, the update of 𝐰h in Algorithm 3 solves precisely the following regularized least-squares problem:

𝐰hargmin𝐰dτ=1k-1[rh(xhτ,ahτ)+maxa𝒜Qh+1(xh+1τ,a)-𝐰ϕ(xhτ,ahτ)]2+λ𝐰2.

Algorithm 3 additionally adds an UCB bonus term of form β(ϕΛh-1ϕ)1/2 to encourage exploration, where Λh is the Gram matrix of the regularized least-squares problem, and β is a scalar. This form of bonus is common in the literature on linear bandits [13, 26]. Intuitively, m:=(ϕΛh-1ϕ)-1 represents the effective number of samples the agent has observed so far along the ϕ direction, and thus the bonus term β/m represents the uncertainty along the ϕ direction. It is called an upper confidence bound because, by choosing a proper value for β we can prove that, with high probability, Qh in line 3 of Algorithm 3 is always an upper bound of Qh for all state-action pair (see Lemma B.5).

We are now ready to state our main theorem, which gives a T-regret bound in the linear MDP setting without any further assumptions. Here, T=KH is the total number of steps.

Theorem 3.1.

Under Assumption A, there exists an absolute constant c>0 such that, for any fixed p(0,1), if we set λ=1 and β=cdHι in Algorithm 3 with ι:=log(2dT/p), then with probability 1-p, the total regret of LSVI-UCB (Algorithm 3) is at most O(d3H3Tι2), where O() hides only absolute constants.

Theorem 3.1 asserts that when λ and β are set properly, LSVI-UCB will suffer total regret at most 𝒪~(d3H3T). We emphasize that while a naive adaptation of existing linear bandit algorithms to this linear MDP setting easily yields a regret exponential in H, our regret is only polynomial in H. Avoiding this exponential dependency on the planning horizon is a key step in efficiently solving the sequential RL problem. Additionally, comparing to the minimax regret in a tabular setting, Θ~(H2SAT), our regret replaces the number of state-action pairs SA by a polynomial dependency on the intrinsic complexity measure of feature space, d. In fact, our regret is completely independent of S and A, which is crucial in the large state-space setting where function approximation is necessary. Please see also Section 5 for more discussion on the optimal dependencies on d and H.

We remark that Algorithm 3 only needs to store Λh,𝐰h, r(xhk,ahk) and {ϕ(xhk,a)}a𝒜 for all (h,k)[H]×[K], which takes 𝒪(d2H+dAT) space. When we compute Λh-1 by the Sherman-Morrison formula, the computational complexity of Algorithm 3 is dominated by line 5 in computing maxaQh+1(xh+1τ,a) for all τ[k]. This takes 𝒪(d2AK) time per step, which gives a total runtime 𝒪(d2AKT).

Finally, similarly to the discussion in Section 3.1 of [22], our regret bound (Theorem 3.1) directly translates to a sample complexity guarantee (or a PAC guarantee) in the following sense. When the initial state x1 is fixed for all episodes, then, with at least constant probability, we can learn an ε-optimal policy π which satisfies V(x1)-Vπ(x1)ε using 𝒪~(d3H4/ε2) samples. The algorithm to achieve this is to simply run Algorithm 3 for K=𝒪~(d3H3/ε2) episodes, and then output the greedy policy according to the action-value function Q at the kth episode, where k is sampled uniformly from [K].

3.1 Results for a misspecified setting

Theorem 3.1 hinges on the fact that the MDP has a linear structure. A natural follow-up question arises: what would happen if the underlying MDP is not linear, and thus misspecified? We first present a definition for an approximate linear model.

Assumption B (ζ-Approximate Linear MDP).

For any ζ1, we say that MDP(𝒮,𝒜,H,,r) is a ζ-approximate linear MDP with a feature map ϕ:𝒮×𝒜d, if for any h[H], there exist d unknown (signed) measures 𝝁h=(μh(1),,μh(d)) over 𝒮 and an unknown vector 𝜽hd such that for any (x,a)𝒮×𝒜, we have

h(|x,a)-ϕ(x,a),𝝁h()TVζ,|rh(x,a)-ϕ(x,a),𝜽h|ζ. (4)

Without loss of generality, we assume that ϕ(x,a)1 for all (x,a)𝒮×𝒜, and max{𝝁h(𝒮),𝜽h}d for all h[H].

By definition, an MDP is an ζ-approximately linear MDP if there exists a linear MDP such that their Markov transition dynamics and reward functions are close. Here the closeness between transition dynamics is measured in terms of total variation distance.

In general, an algorithm designed for a linear MDP could break down entirely if the underlying MDP is not linear. The following theorem states that this is not the case for our algorithm. It is in fact robust to small model misspecification. To achieve this, we need only to adopt a different hyperparameter β in different episodes.

Theorem 3.2.

Under Assumption B, there exists an absolute constant c>0 such that, for any fixed p(0,1), if we set λ=1 and βk=c(dι+ζkd)H in Algorithm 3 with ι:=log(2dT/p), then with probability 1-p, the total regret of LSVI-UCB (Algorithm 3) is at most O(d3H3Tι2+ζdHTι).

Compared with Theorem 3.1, Theorem 3.2 asserts that the LSVI-UCB algorithm will incur at most an additional 𝒪~(ζdHT) regret when the model is misspecified. This additional term is inevitably linear in T due the intrinsic bias introduced by linear approximation. When ζ is sufficiently small, i.e., the underlying MDP is not far away from being linear, our algorithm will still enjoy good theoretical guarantees.

Theorem 3.2 can also be converted to a PAC guarantee with a similar flavor. When the initial state x1 is fixed for all episodes, then, with at least constant probability, we can learn an ε-optimal policy π which satisfies V(x1)-Vπ(x1)ε+𝒪~(ζdH2) using 𝒪~(d3H4/ε2) samples.

4 Mechanisms

In this section, we overview several of the key ideas behind the regret bound in Theorem 3.1. We defer the full proof of Theorem 3.1 and Theorem 3.2 to Appendix B and Appendix C respectively.

In Section 3, we mentioned that the LSVI algorithm is motivated from the Bellman optimality equation Eq. (2). It remains to verify that line 3 in Algorithm 3 indeed well approximates the Bellman optimality equation, which turns out to require not only the linear MDP structure but also hinges on several other facts.

To simplify our presentation, in this section we treat the regularization parameter λ loosely as being sufficiently small so that Λh-1τ=1k-1ϕ(xhτ,ahτ)ϕ(xhτ,ahτ)𝐈. We will focus in this section on a fixed episode k, and drop the dependency of parameters and value functions on k when it is clear from the context. Now, ignoring the UCB bonus, the least-squares solution (line 3) gives the following estimate of the action-value function:

Qh(x,a)ϕ(x,a)𝐰h=ϕ(x,a)Λh-1τ=1k-1ϕ(xhτ,ahτ)[rh(xhτ,ahτ)+Vh+1(xh+1τ)],

where Vh+1()=maxa𝒜Qh+1(,a). Plugging in rh(,)=ϕ(,)𝜽h, we know the first term on the right-hand side approximates rh(x,a). Comparing this to Eq. (2), it remains to show why the second term of right-hand side approximates hVh+1(x,a). We thus define our empirical Markov transition measure as

^h(|x,a):=ϕ(x,a)Λh-1τ=1k-1ϕ(xhτ,ahτ)δ(,xh+1τ),

where the δ-measure δ(,x) puts an atom on element x. It remains to verify that ^hVh+1(x,a)hVh+1(x,a). To establish this, we use a measure ¯h to bridge these two quantities:

¯h(|x,a):=ϕ(x,a)Λh-1τ=1k-1ϕ(xhτ,ahτ)h(|xhτ,ahτ). (5)

Our analysis depends on the following two key steps.

Step 1: Prove ^hVh+1(x,a)¯hVh+1(x,a) via Value-Aware Uniform Concentration.

Computing the difference, we have (^h-¯h)Vh+1(x,a)=ϕ(x,a)Λh-1τ=1k-1ϕ(xhτ,ahτ)[Vh+1(xh+1τ)-hVh+1(xhτ,ahτ)]. Since 𝐱h+1τ is a sample from the distribution h(|xhτ,ahτ), we would expect this term to be small due to concentration. This would be the case if function Vh+1 is fixed and independent of the samples {xh+1τ}τ=1k-1. Then, Vh+1(xh+1τ)-hVh+1(xhτ,ahτ) is a zero-mean random variable in [-H,H], and we could aim to use a concentration inequality for self-normalized processes to bound (^h-¯h)Vh+1(x,a). Please see Theorem D.3 or [2] for more detail on this approach.

However, the function Vh+1 in Algorithm 3 is again computed by least-squares value iteration in later steps [h+1,H] and it thus inevitably depends on the choices of actions {ah+1τ}τ=1k-1, and thus also samples {xh+1τ}τ=1k-1. Therefore, the concentration of self-normalized process does not apply directly. To resolve this issue, we establish the uniform concentration over all value functions in the following class:

𝒱={V()|V()=min{maxa𝒜ϕ(,a)𝐰+βϕ(,a)Λ-1ϕ(,a),H},𝐰d,β,Λd×d}, (6)

where the parameters 𝐰,β,Λ are all bounded. We ensure that Algorithm 3 only uses value functions within this class 𝒱, which has a reasonably small covering number. This gives, with high probability, |(^h-¯h)Vh+1(x,a)|𝒪~(dH)(ϕ(x,a)Λh-1ϕ(x,a))1/2 (Lemma B.3).

Step 2: Show ¯hVh+1(x,a)hVh+1(x,a) due to Linear Markov Transitions.

One big challenge in RL with function approximation is that, due to the large state space, the learner may never visit the neighborhood of a state-action pair twice. This raises a question of how to use the experiences from other state-action pairs to infer information about a state-action pair of interest. In Eq. (5), P¯h(|x,a) provides such an estimate via regularized least-squares. Our modeling assumption of a linear MDP (Assumption A) ensures that this least-square estimate is valid: since h(|x,a)=ϕ(x,a)𝝁h() for any (x,a) pair, we have

¯h(|x,a)=ϕ(x,a)Λh-1τ=1k-1ϕ(xhτ,ahτ)ϕ(xhτ,ahτ)𝝁h()ϕ(x,a)𝝁h()=h(|x,a).

In summary, combining step 1 and step 2, we establish ^hVh+1(x,a)hVh+1(x,a), and hence show that LSVI approximates the optimal Bellman equation. We emphasize that despite being linear, the Markov transition model h(|x,a)=ϕ(x,a)𝝁h() can still have infinite degrees of freedom since the measure 𝝁h is unknown. Therefore, within a finite number of samples, no algorithm can establish that ^h and h are close in total variation distance. In contrast, our algorithm only requires ^hVh+1(x,a)hVh+1(x,a) for all value functions Vh+1 in a small function class 𝒱 (especially in step 1). This bypasses the need for fully learning the transition model h. Thus, our algorithm can also be viewed as “model-free” in this sense.

Finally, with the above key observations in mind, our proof proceeds by leveraging and adapting techniques from the literature on tabular MDP and linear bandits. Please see Appendix B and C for the details.

5 Conclusion

In this paper, we have presented the first provable RL algorithm with both polynomial runtime and polynomial sample complexity for linear MDPs, without requiring a “simulator” or additional assumptions. The algorithm is simply Least-Squares Value Iteration—a classical RL algorithm commonly studied in the setting of linear function approximation—with a UCB bonus. We hope that our work may serve as a first step towards a better understanding of efficient RL with function approximation.

We provide a few additional concluding observations.

On the optimal dependencies on d and H.

Theorem 3.1 claims the total regret to be upper bounded by 𝒪~(d3H3T). One immediate question is what the optimal dependencies on d and H are. Since our setting covers the standard tabular setting, as in shown in Example 2.1, a lower bound can be directly obtained through a reduction from the tabular setting, which gives Ω(dH2T) for the case of nonstationary transitions [22]. We believe the H difference between this lower bound and our upper bound is expected because the exploration bonus used in this paper is intrinsically “Hoeffding-type.” Using a “Bernstein-type” bonus can potentially help shave off one H factor (see [8, 22] for a similar phenomenon in the tabular setting).

In contrast, the optimal dependency on dimension d is more important but is also less clear. In the case where the number of actions is very large, one may attempt to use the lower bound in the linear bandit setting, Ω(dT), for the case H=1. We comment that as soon as H2 (where the Markov transition matters), the assumption of a linear MDP imposes structure on the feature space {ϕ(x,a)|(x,a)𝒮×𝒜} (see Proposition A.1). Technically, the standard constructions for the hard instances in the linear bandit lower bound do not respect this structure, so the lower bound does not directly apply. It remains an interesting future direction to determine this optimal dependency on d.

On the assumption of linear transition dynamics.

The main assumption in this paper is the linear MDP assumption (Assumption A), which requires the Markov transition h(|x,a) to be linear in ϕ(x,a). This requirement could be strong in practice. It turns out that our proof only relies on a weaker version of this assumption:

hV(x,a)=ϕ(x,a),𝐰V, for all V𝒱, (7)

where 𝐰V is a vector independent of (x,a) and 𝒱 is the class of value functions considered in this paper, as in Eq. (6). That is, we effectively only need that h(|x,a) appears to be linear when we apply it to a value function V. When there is additional problem structure in the feature map ϕ so that 𝒱 is relatively small and structured, Eq. (7) can potentially provide a usefully weaker condition compared to Assumption A.

When both the feature map ϕ and the policy π are fully generic, we comment that under mild conditions, the assumption of linear transition is then in fact necessary for the Bellman error to be zero for all policies π. Indeed, defining the Bellman operator 𝕋hπ associated with π as

(𝕋hπQ)(x,a)=rh(x,a)+𝔼xh(|x,a){Q(x,π(x))},(x,a)𝒮×𝒜, (8)

for any Q:𝒮×𝒜, we have the following proposition.

Proposition 5.1.

Let Q={Q|Q(,)=ϕ(,)w,wRd} be the family of linear action-value functions. Suppose that S is a finite set, and for any xS, there exist two actions a,a¯A such that ϕ(x,a)ϕ(x,a¯). Then, ThπQQ for all π only if the Markov transition measures Ph are linear in ϕ.

Finally, it remains an interesting future question whether an RL algorithm can be proved to be efficient without assuming a linear structure in the transition dynamics.

Acknowledgements

We thank Alekh Agarwal, Zeyuan Allen-Zhu, Sebastian Bubeck, Nan Jiang and Akshay Krishnamurthy for valuable discussions. This work was supported in part by the DARPA program on Lifelong Learning Machines.

References

  • Abbasi-Yadkori and Szepesvári [2011] Y. Abbasi-Yadkori and C. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Conference on Learning Theory, pages 1–26, 2011.
  • Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Abbasi-Yadkori et al. [2019] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári. Model-free linear quadratic control via reduction to expert prediction. In International Conference on Artificial Intelligence and Statistics, pages 3108–3117, 2019.
  • Abeille and Lazaric [2018] M. Abeille and A. Lazaric. Improved regret bounds for Thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9, 2018.
  • Auer [2002] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Azar et al. [2011] M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.
  • Azar et al. [2012] M. G. Azar, R. Munos, and B. Kappen. On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461, 2012.
  • Azar et al. [2017] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272, 2017.
  • Azizzadenesheli et al. [2018] K. Azizzadenesheli, E. Brunskill, and A. Anandkumar. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018.
  • Baird [1995] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pages 30–37, 1995.
  • Boyan and Moore [1995] J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems, pages 369–376, 1995.
  • Bradtke and Barto [1996] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.
  • Bubeck and Cesa-Bianchi [2012] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • Cohen et al. [2019] A. Cohen, T. Koren, and Y. Mansour. Learning linear-quadratic regulators efficiently with only T regret. arXiv preprint arXiv:1902.06223, 2019.
  • Dani et al. [2008] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008.
  • Dann et al. [2017] C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  • Dean et al. [2018] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
  • Du et al. [2019] S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321, 2019.
  • Jaksch et al. [2010] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563–1600, 2010.
  • Jiang et al. [2017] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
  • Jin et al. [2018] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • Kober and Peters [2012] J. Kober and J. Peters. Reinforcement learning in robotics: A survey. In Reinforcement Learning, pages 579–610. Springer, 2012.
  • Koenig and Simmons [1993] S. Koenig and R. G. Simmons. Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence, pages 99–107, 1993.
  • Lattimore and Hutter [2012] T. Lattimore and M. Hutter. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pages 320–334, 2012.
  • Lattimore and Szepesvári [2018] T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, 2018.
  • Li et al. [2016] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
  • Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, pages 661–670, 2010.
  • Melo and Ribeiro [2007] F. S. Melo and M. I. Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007.
  • Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2016] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Osband and Van Roy [2016] I. Osband and B. Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
  • Osband et al. [2014] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
  • Puterman [2014] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • Rusmevichientong and Tsitsiklis [2010] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Schulman et al. [2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • Sidford et al. [2018] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In ACM-SIAM Symposium on Discrete Algorithms, pages 770–787, 2018.
  • Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Strehl et al. [2006] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
  • Sutton [1988] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
  • Sutton and Barto [2011] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2011.
  • Szepesvári [2010] C. Szepesvári. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1):1–103, 2010.
  • Tsitsiklis and Van Roy [1997] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems, pages 1075–1081, 1997.
  • Vershynin [2010] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
  • Wainwright [2019] M. J. Wainwright. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.
  • Wang et al. [2019] T. Wang, W. Ye, D. Geng, and C. Rudin. Towards practical Lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.
  • Wen and Van Roy [2013] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
  • Wen and Van Roy [2017] Z. Wen and B. Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017.
  • Yang and Wang [2019a] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019a.
  • Yang and Wang [2019b] L. F. Yang and M. Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019b.
  • Zhu and Dunson [2019] X. Zhu and D. B. Dunson. Lipschitz bandit optimization with improved efficiency. arXiv preprint arXiv:1904.11131, 2019.

Appendix A Properties of Linear MDP

In this section, we present some of the basic properties of linear MDPs.

We start with the most important property of a linear MDP: the action-value function is always linear in the feature map ϕ for any policy.

See 2.3

Proof.

The linearity of the action-value functions directly follows from the Bellman equation in Eq. (1):

Qhπ(x,a)=r(x,a)+(hVh+1π)(x,a)=ϕ(x,a),𝜽h+𝒮Vh+1π(x)ϕ(x,a),d𝝁h(x).

Therefore, we have Qhπ(x,a)=ϕ(x,a),𝐰hπ where 𝐰hπ is given by 𝐰hπ=𝜽h+𝒮Vh+1π(x)d𝝁h(x). ∎

Second, we show that, under mild conditions, the assumption of a linear transition is necessary for the Bellman error to be zero for all policies π.

See 5.1

Proof.

For any fixed state x0𝒮, by assumption, there exist two actions a0 and a¯0 such that ϕ(x0,a0)ϕ(x0,a¯0). Then there exists 𝐰0d such that

𝐰0[ϕ(x0,a0)-ϕ(x0,a¯0)]=1. (9)

We define the function Q0(,)=ϕ(,)𝐰0. Additionally, let two policies π1 and π2 satisfy

π1(x)=π2(x),x𝒮\{x0},andπ1(x0)=a0,π2(x0)=a¯0. (10)

Now consider 𝕋hπ1Q0-𝕋hπ2Q0 for any h. By the definition of Bellman operator in Eq. (8), for any (x,a)𝒮×𝒜, we have

𝕋hπ1Q0(x,a)-𝕋hπ2Q0(x,a)=x𝒮h(x|x,a){Q0[x,π1(x)]-Q0[x,π2(x)]}
  =h(x0|x,a)[Q0(x0,a0)-Q0(x0,x¯0)]=h(x0|x,a)[ϕ(x0,a0)-ϕ(x0,x¯0)]𝐰0, (11)

where the second equality holds due to Eq. (10). Thus, by combining Eq. (9) and Eq. (A), we have

𝕋hπ1Q0(x,a)-𝕋hπ2Q0(x,a)=h(x0|x,a),(x,a)𝒮×𝒜.

Since 𝕋hπ𝒬𝒬 for all π, we know both 𝕋hπ1Q0 and 𝕋hπ2Q0 are elements of 𝒬, so is h(x0|,), which implies that h(x0|,) is a linear function of ϕ(,). That is, there exists a vector 𝝁(x0) independent of (x,a) so that h(x0|x,a)=ϕ(x,a),𝝁(x0) for all (x,a). Because this holds for all x0𝒮, we have h(|x,a)=ϕ(x,a),𝝁(). This concludes the proof. ∎

Finally, we note Assumption A also implicitly enforces the following structure on the feature space since h(|x,a) must be a probability measure over 𝒮 for any (x,a)𝒮×𝒜.

Proposition A.1.

For a linear MDP, for any (x,a,h)S×A×[H], we have

ϕ(x,a)𝝁h(𝒮)=1,ϕ(x,a)𝝁h()0, measurable 𝒮. (12)
Proof.

This proposition immediately follows from the fact that h(|x,a) is a probability measure over 𝒮 for any (x,a,h)𝒮×𝒜×[H]. ∎

In particular, the first condition in Eq. (12) requires the image of ϕ, {ϕ(x,a)|(x,a)𝒮×𝒜}, to be contained in a (d-1)-dimensional hyperphane.

Appendix B Proof of Theorem 3.1

In this section, we prove Theorem 3.1. We first introduce the notation that is used throughout this section. Then, we present lemmas and their proofs. Finally, we combine the lemmas to prove Theorem 3.1.

Notation:

Throughout this section, we denote Λhk, 𝐰hk, and Qhk as the parameters and the Q-value function estimate in episode k. Denote value function Vhk as Vhk(x)=maxaQhk(x,a). We also denote πk as the greedy policy induced by {Qhk}h=1H. To simplify our presentation, we always denote ϕhk:=ϕ(xhk,ahk).

First, we prove two lemmas which state that the linear weights 𝐰h in both the action-value functions and Algorithm 3 are bounded.

Lemma B.1 (Bound on Weights of Value Functions).

Under Assumption A, for any fixed policy π, let {whπ}h[H] be the corresponding weights such that Qhπ(x,a)=ϕ(x,a),whπ for all (x,a,h)S×A×[H]. Then, we have

h[H],𝐰hπ2Hd.
Proof.

By the Bellman equation in Eq. (1), we know, for any h[H]:

Qhπ(x,a)=(rh+hVh+1π)(x,a).

Since MDP is linear, by definition, this gives:

𝐰hπ=𝜽h+Vh+1π(x)d𝝁h(x).

Under the normalization conditions of Assumption A, the reward at each step is in [0,1], thus Vh+1π(x)H for any state x. Therefore, 𝜽hd, and Vh+1π(x)d𝝁h(x)Hd, which finishes the proof. ∎

Lemma B.2 (Bound on Weights in Algorithm).

For any (k,h)[K]×[H], the weight whk in Algorithm 3 satisfies:

𝐰hk2Hdk/λ.
Proof.

For any vector 𝐯d, we have

|𝐯𝐰hk| =|𝐯(Λhk)-1τ=1k-1ϕhτ[r(xhτ,ahτ)+maxaQh+1(xh+1τ,a)]|
τ=1k-1|𝐯(Λhk)-1ϕhτ|2H[τ=1k-1𝐯(Λhk)-1𝐯][τ=1k-1(ϕhτ)(Λhk)-1ϕhτ]2H
2H𝐯dk/λ,

where the last step is due to Lemma D.1. The remainder of the proof follows from the fact that 𝐰hk=max𝐯:𝐯=1|𝐯𝐰hk|. ∎

Second, we present our main concentration lemma, which is crucial in controlling the fluctuations in least-squares value iteration.

Lemma B.3.

Under the setting of Theorem 3.1, let cβ be the constant in our definition of β (i.e., β=cβdHι). There exists an absolute constant C that is independent of cβ such that for any fixed p[0,1], if we let E be the event that:

(k,h)[K]×[H]:τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1k(xhτ,ahτ)](Λhk)-1CdHχ,

where χ=log[2(cβ+1)dT/p], then P(E)1-p/2.

Proof.

For all (k,h)[K]×[H], by Lemma B.2 we have 𝐰hk2Hdk/λ. In addition, by the construction of Λhk, the minimum eigenvalue of Λhk is lower bounded by λ. Thus, by combining Lemmas D.4 and D.6, we have for any fixed ε>0 that:

τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1k(xhτ,ahτ)](Λhk)-12 (13)
  4H2[d2log(k+λλ)+dlog(1+8Hdkελ)+d2log(1+8d1/2β2ε2λ)+log(2p)]+8k2ε2λ.

Notice that we choose the hyperparameters λ=1 and β=CdHι where C is an absolute constant. Finally, picking ε=dH/k, by Eq. (13), there exists a absolute constant C>0 that is independent of cβ such that

τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1k(xhτ,ahτ)](Λhk)-12Cd2H2log[2(cβ+1)dT/p],

which concludes the proof. ∎

Next, we recursively bound the difference between the value function maintained in Algorithm 3 (without bonus) and the true value function of any policy π. We bound this difference using their expected difference at next step, plus a error term. This error term can be upper bounded by our bonus with high probability. This is the key technical lemma in this section.

Lemma B.4.

There exists an absolute constant cβ such that for β=cβdHι where ι=log(2dT/p), and for any fixed policy π, on the event E defined in Lemma B.3, we have for all (x,a,h,k)S×A×[H]×[K] that:

ϕ(x,a),𝐰hk-Qhπ(x,a)=h(Vh+1k-Vh+1π)(x,a)+Δhk(x,a),

for some Δhk(x,a) that satisfies |Δhk(x,a)|βϕ(x,a)(Λhk)-1ϕ(x,a).

Proof.

By Proposition 2.3 and the Bellman equation Eq. (1), we know for any (x,a,h)𝒮×𝒜×[H]:

Qhπ(x,a):=ϕ(x,a),𝐰hπ=(rh+hVh+1π)(x,a).

This gives:

𝐰hk-𝐰hπ =(Λhk)-1τ=1k-1ϕhτ[rhτ+Vh+1k(xh+1τ)]-𝐰hπ
=(Λhk)-1{-λ𝐰hπ+τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1π(xhτ,ahτ)]}
=-λ(Λhk)-1𝐰hπ𝐪1+(Λhk)-1τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1k(xhτ,ahτ)]𝐪2
+(Λhk)-1τ=1k-1ϕhτh(Vh+1k-Vh+1π)(xhτ,ahτ)𝐪3.

Now, we bound the terms on the right-hand side individually. For the first term,

|ϕ(x,a),𝐪1|=|λϕ(x,a),(Λhk)-1𝐰hπ|λ𝐰hπϕ(x,a)(Λhk)-1ϕ(x,a).

For the second term, given the event 𝔈 defined in Lemma B.3, we have:

|ϕ(x,a),𝐪2|c0dHχϕ(x,a)(Λhk)-1ϕ(x,a)

for an absolute constant c0 independent of cβ, and χ=log[2(cβ+1)dT/p]. For the third term,

ϕ(x,a),𝐪3 =ϕ(x,a),(Λhk)-1τ=1k-1ϕhτh(Vh+1k-Vh+1π)(xhτ,ahτ)
=ϕ(x,a),(Λhk)-1τ=1k-1ϕhτ(ϕhτ)(Vh+1k-Vh+1π)(x)d𝝁h(x)
=ϕ(x,a),(Vh+1k-Vh+1π)(x)d𝝁h(x)p1-λϕ(x,a),(Λhk)-1(Vh+1k-Vh+1π)(x)d𝝁h(x)p2,

where, by Eq. (3), we have

p1=h(Vh+1k-Vh+1π)(x,a),|p2|2Hdλϕ(x,a)(Λhk)-1ϕ(x,a).

Finally, since ϕ(x,a),𝐰hk-Qhπ(x,a)=ϕ(x,a),𝐰hk-𝐰hπ=ϕ(x,a),𝐪1+𝐪2+𝐪3, by Lemma B.1 and our choice of parameter λ, we have

|ϕ(x,a),𝐰hk-Qhπ(x,a)-h(Vh+1k-Vh+1π)(x,a)|cdHχϕ(x,a)(Λhk)-1ϕ(x,a),

for an absolute constant c independent of cβ. Finally, to prove this lemma, we only need to show that there exists a choice of absolute constant cβ so that

cι+log(cβ+1)cβι, (14)

where ι=log(2dT/p). We know ι[log2,) by its definition, and c is an absolute constant independent of cβ. Therefore, we can pick an absolute constant cβ which satisfies clog2+log(cβ+1)cβlog2. This choice of cβ will make Eq. (14) hold for all ι[log2,), which finishes the proof. ∎

Lemma B.4 implies that by adding appropriate bonuses, Qhk in Algorithm 3 can be always an upper bound of Qh with high confidence.

Lemma B.5 (UCB).

Under the setting of Theorem 3.1, on the event E defined in Lemma B.3, we have Qhk(x,a)Qh(x,a) for all (x,a,h,k)S×A×[H]×[K].

Proof.

We prove this lemma by induction.

First, we prove the base case, at the last step H. The statement holds because QHk(x,a)QH(x,a). Since the value function at H+1 step is zero, by Lemma B.4, we have:

|ϕ(x,a),𝐰Hk-QH(x,a)|βϕ(x,a)(ΛHk)-1ϕ(x,a).

Therefore, we know:

QH(x,a)min{ϕ(x,a),𝐰Hk+βϕ(x,a)(ΛHk)-1ϕ(x,a),H}=QHk(x,a).

Now, suppose the statement holds true at step h+1 and consider step h. Again, by Lemma B.4, we have:

|ϕ(x,a),𝐰hk-Qh(x,a)-h(Vh+1k-Vh+1)(x,a)|βϕ(x,a)(Λhk)-1ϕ(x,a).

By the induction assumption that h(Vh+1k-Vh+1)(x,a)0, we have:

Qh(x,a)min{ϕ(x,a),𝐰hk+βϕ(x,a)(Λhk)-1ϕ(x,a),H}=Qhk(x,a),

which concludes the proof. ∎

Lemma B.4 also easily transforms to a recursive formula for δhk=Vhk(xhk)-Vhπk(xhk). This formula will be very useful in proving the main theorem.

Lemma B.6 (Recursive formula).

Let δhk=Vhk(xhk)-Vhπk(xhk), and ζh+1k=E[δh+1k|xhk,ahk]-δh+1k. Then, on the event E defined in Lemma B.3, we have the following for any (k,h)[K]×[H]:

δhkδh+1k+ζh+1k+2β(ϕhk)(Λhk)-1ϕhk.
Proof.

By Lemma B.4 we have that for any (x,a,h,k)𝒮×𝒜×[H]×[K]:

Qhk(x,a)-Qhπk(x,a)h(Vh+1k-Vh+1πk)(x,a)+2βϕ(x,a)(Λhk)-1ϕ(x,a)

and finally, by Algorithm 3 and the definition of Vπk we have

δhk=Qhk(xhk,ahk)-Qhπk(xhk,ahk),

which finishes the proof. ∎

Finally, we are ready to prove the main theorem. We restate our main theorem as follows.

See 3.1

Proof.

We use the notion of δhk and ζhk as in Lemma B.6. We condition on the event 𝔈 defined in Lemma B.3 with δ=p/2. By Lemmas B.5 and B.6, we have

Regret(K) =k=1K[V1(x1k)-V1πk(x1k)]k=1Kδ1kk=1Kh=1Hζhk+2βk=1Kh=1H(ϕhk)(Λhk)-1ϕhk. (15)

We now bound the two terms on the right-hand side of Eq. (15) separately. For the first term, since the computation of Vhk is independent of the new observation xhk at episode k, we obtain that {ζhk} is a martingale difference sequence satisfying |ζhk|2H for all (k,h). Therefore, by the Azuma-Hoeffding inequality, for any t>0, we have

(k=1Kh=1Hζhk>t)exp(-t22TH2).

Hence, with probability at least 1-p/2, we have

k=1Kh=1Hζhk2TH2log(2/p)2HTι, (16)

where ι=log(2dT/p). Furthermore, for the second term, note that the minimum eigenvalue of Λhk is at least λ (which equals to 1) for all (k,h)[K]×[H]. Also notice that ϕhk1. By Lemma D.2, for any h[H], we have

k=1K(ϕhk)(Λhk)-1ϕhk2log[det(Λhk+1)det(Λh1)].

Moreover, note that Λhk+1=τ=1kϕhk(ϕhk)+λ𝐈λ+k; this gives

k=1K(ϕhk)(Λhk)-1ϕhk2dlog[λ+kλ]2dι. (17)

Now, by the Cauchy-Schwartz inequality, we have

k=1Kh=1H(ϕhk)(Λhk)-1ϕhkh=1HK[k=1K(ϕhk)(Λhk)-1ϕhk]1/2H2dKι, (18)

which yields an upper bound on the second term in Eq. (15). Finally, combining Eq. (15), Eq. (16), Eq. (18), and with our choice of β=cdHι for some absolute constant c, we conclude that with probability 1-p:

Regret(K)2HTι+βH2dKιcd3H3Tι2,

for some absolute constant c. This concludes the proof. ∎

Appendix C Proof of Theorem 3.2

In this section, we prove Theorem 3.2. At a high level, the proof structure is similar to the structure in Appendix B. We will particularly focus on the parts that require different treatments in the misspecified setting.

Notation:

Throughout this section, we denote Λhk, 𝐰hk, and Qhk as the parameters and the Q-value functions estimated in episode k. Denote the value function Vhk as Vhk(x)=maxaQhk(x,a). We denote πk as the greedy policy induced by {Qhk}h=1H. To simplify the presentation, we denote ϕhk:=ϕ(xhk,ahk).

First, we establish a lemma that is the counterpart of Lemma 2.3 in the misspecified setting: for any policy π, its action-value function is always close to a linear function.

Lemma C.1.

For a ζ-nearly linear MDP, for any policy π, there exist corresponding weights {whπ}h[H] where whπ=𝛉h+Vh+1π(x)d𝛍h(x) such that for any (x,a,h)S×A×[H]:

|Qhπ(x,a)-ϕ(x,a),𝐰hπ|2Hζ.
Proof.

Since 𝝁h and 𝜽h satisfy Eq.(4), we have:

|Qhπ(x,a)-ϕ(x,a),𝐰hπ|
  |rh(x,a)-ϕ(x,a),𝜽h|+|hVh+1π(x,a)-ϕ(x,a),Vh+1π(x)d𝝁h(x)|
  ζ+Hζ2Hζ,

which finishes the proof. ∎

We can again show that the linear weights defined in Lemma C.1 are bounded.

Lemma C.2 (Bound on Weights of Value Functions).

Under Assumption B, for any policy π, let {whπ}h[H] be the corresponding weights as defined in Lemma C.1. Then, we have

h[H],𝐰hπ2Hd.
Proof.

Under the normalization conditions of Assumption B, the reward at each step is in [0,1], thus Vh+1π(x)H for any state x. Therefore, 𝜽hd, and Vh+1π(x)d𝝁h(x)Hd, which finishes the proof. ∎

Similar to Lemma B.3, we also bound the stochastic noise in concentration.

Lemma C.3.

Under the setting of Theorem 3.2, let cβ be the constant in our choice of βk (i.e. βk=cβ(dι+ζkd)H), There exists an absolute constant C that is independent of cβ such that for any fixed p[0,1], if we let E be the event that:

(k,h)[K]×[H]:τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1k(xhτ,ahτ)](Λhk)-1CdHχ,

where χ=log[2(cβ+1)dT/p], then P(E)1-p/2.

Proof.

The proof is essentially the same as the proof for Lemma B.3, with the only difference that βk is now bounded by cβ(dι+ζKd)H instead of cβdHι as in Lemma B.3. Because ζ1 as in Assumption B, the new bound of βk only affects the choice of absolute C in Lemma C.3. ∎

In the misspecified case, we also need to bound an error term where the noise can be potentially adversarial instead of stochastic. The adversarial noise is precisely due to model misspecification.

Lemma C.4.

Let {ετ} be any sequence so that |ετ|B for any τ. Then, we have for any (h,k)[H]×[K] and any ϕRd that:

|ϕ(Λhk)-1τ=1k-1ϕhτετ|Bdkϕ(Λhk)-1ϕ.
Proof.

By the Cauchy-Schwarz inequality,

|ϕ(Λhk)-1τ=1k-1ϕhτετ| τ=1k-1|ϕ(Λhk)-1ϕhτ|B[τ=1k-1ϕ(Λhk)-1ϕ][τ=1k-1(ϕhτ)(Λhk)-1ϕhτ]B
Bdkϕ(Λhk)-1ϕ,

where the last inequality is due to Lemma D.1. ∎

Now we are ready to prove the key lemma, which is the counterpart of Lemma B.4.

Lemma C.5.

There exists an absolute constant cβ such that for βk=cβ(dι+ζkd)H where ι=log(2dT/p), and for any fixed policy π, on the event E defined in Lemma C.3, we have for all (x,a,h,k)S×A×[H]×[K] that:

ϕ(x,a),𝐰hk-Qhπ(x,a)=h(Vh+1k-Vh+1π)(x,a)+Δhk(x,a),

for some Δhk(x,a) that satisfies |Δhk(x,a)|βkϕ(x,a)(Λhk)-1ϕ(x,a)+4Hζ.

Proof.

By Lemma C.1, there exists 𝐰hπ=𝜽h+Vh+1π(x)d𝝁h(x) so that for any (x,a,h)𝒮×𝒜×[H]:

|Qhπ(x,a)-ϕ(x,a),𝐰hπ|2Hζ.

On the other hand, let ~(|x,a)=ϕ(x,a),𝝁h(). Then, for any (x,a)𝒮×𝒜, we have

ϕ(x,a),𝐰hπ=ϕ(x,a),𝜽h+~hVh+1π(x,a).

This further gives

𝐰hk-𝐰hπ=(Λhk)-1τ=1k-1ϕhτ[rhτ+Vh+1k(xh+1τ)]-𝐰hπ
=(Λhk)-1{-λ𝐰hπ+τ=1k-1ϕhτ[rhτ+Vh+1k(xh+1τ)-(ϕhτ)𝜽h-~hVh+1π(xhτ,ahτ)]}
=-λ(Λhk)-1𝐰hπ𝐪1+(Λhk)-1τ=1k-1ϕhτ[Vh+1k(xh+1τ)-hVh+1k(xhτ,ahτ)]𝐪2+(Λhk)-1τ=1k-1ϕhτ~h(Vh+1k-Vh+1π)(xhτ,ahτ)𝐪3
 +(Λhk)-1τ=1k-1ϕhτ[rhτ-(ϕhτ)𝜽h+(h-~h)Vh+1k(xhτ,ahτ)]𝐪4.

Now, we bound the terms on the right-hand side individually. For the first term,

|ϕ(x,a),𝐪1|=|λϕ(x,a),(Λhk)-1𝐰hπ|λ𝐰hπϕ(x,a)(Λhk)-1ϕ(x,a).

For the second term, given the event 𝔈 defined in Lemma C.3, we have:

|ϕ(x,a),𝐪2|c0dHχϕ(x,a)(Λhk)-1ϕ(x,a),

for an absolute constant c0 independent of cβ, and χ=log[2(cβ+1)dT/p]. For the third term,

ϕ(x,a),𝐪3 =ϕ(x,a),(Λhk)-1τ=1k-1ϕhτ~h(Vh+1k-Vh+1π)(xhτ,ahτ)
=ϕ(x,a),(Λhk)-1τ=1k-1ϕhτ(ϕhτ)(Vh+1k-Vh+1π)(x)d𝝁h(x)
=ϕ(x,a),(Vh+1k-Vh+1π)(x)d𝝁h(x)p1-λϕ(x,a),(Λhk)-1(Vh+1k-Vh+1π)(x)d𝝁h(x)p2,

where by definition of ~h, we have

p1=~h(Vh+1k-Vh+1π)(x,a),|p2|2Hdλϕ(x,a)(Λhk)-1ϕ(x,a).

Since ~h-hTVζ, we have

|p1-h(Vh+1k-Vh+1π)(x,a)|=|(h-P~h)(Vh+1k-Vh+1π)(x,a)|2Hζ.

For the fourth term, by Lemma C.4, we have

|ϕ(x,a),𝐪4|2Hζdkϕ(x,a)(Λhk)-1ϕ(x,a).

Finally, since ϕ(x,a),𝐰hk-𝐰hπ=ϕ(x,a),𝐪1+𝐪2+𝐪3+𝐪4, we have:

|ϕ(x,a),𝐰hk-Qhπ(x,a)-h(Vh+1k-Vh+1π)(x,a)|(cdχ+2ζkd)Hϕ(x,a)(Λhk)-1ϕ(x,a)+4Hζ,

for an absolute constant c independent of cβ. As in the proof of Lemma B.4, to prove this lemma, we only need to show that there exists a choice of absolute constant cβ so that cβ2, and

cι+log(cβ+1)cβι for any ι[log2,).

This can be done by an picking absolute constant cβ that satisfies clog2+log(cβ+1)cβlog2. ∎

Given Lemma C.5, we can now easily proceed to prove that Qhk is a upper bound of Qh up to an error that depends linearly on the misspecification ζ.

Lemma C.6 (UCB).

Under the setting of Theorem 3.2, on the event E defined in Lemma C.3, we have Qhk(x,a)Qh(x,a)-4H(H+1-h)ζ for all (x,a,h,k)S×A×[H]×[K].

Proof.

We prove this lemma by induction.

First, we consider the base case. The statement holds for the last step H, i.e., QHk(x,a)QH(x,a)-4Hζ. Since the value function at H+1 step is zero, by Lemma C.5, we have:

|ϕ(x,a),𝐰Hk-QH(x,a)|βkϕ(x,a)(ΛHk)-1ϕ(x,a)+4Hζ.

Therefore, we obtain that

QH(x,a)-4Hζmin{ϕ(x,a),𝐰Hk+βkϕ(x,a)(ΛHk)-1ϕ(x,a),H}=QHk(x,a).

Now, suppose the statement holds true at step h+1, and consider step h. Again, by Lemma C.5, we have:

|ϕ(x,a),𝐰hk-Qh(x,a)-h(Vh+1k-Vh+1)(x,a)|βkϕ(x,a)(Λhk)-1ϕ(x,a)+4Hζ.

By the induction assumption that h(Vh+1k-Vh+1)(x,a)-4H(H-h)ζ, we have:

Qh(x,a)-4H(H+1-h)ζmin{ϕ(x,a),𝐰hk+βkϕ(x,a)(Λhk)-1ϕ(x,a),H}=Qhk(x,a).

Therefore, we conclude the proof of this lemma. ∎

The gap δhk=Vhk(xhk)-Vhπk(xhk) also has a recursive formula similar to Lemma B.6.

Lemma C.7 (Recursive formula).

Let δhk=Vhk(xhk)-Vhπk(xhk), and ζh+1k=E[δh+1k|xhk,ahk]-δh+1k. Then, on the event E defined in Lemma C.3, we have the following for any (k,h)[K]×[H]:

δhkδh+1k+ζh+1k+2βk(ϕhk)(Λhk)-1ϕhk.
Proof.

This is because by Lemma C.5, we have for any (x,a,h,k)𝒮×𝒜×[H]×[K]:

Qhk(x,a)-Qhπk(x,a)h(Vh+1k-Vh+1πk)(x,a)+2βkϕ(x,a)(Λhk)-1ϕ(x,a)+4Hζ.

Finally, by Algorithm 3 and the definition of Vπk we have

δhk=Qhk(xhk,ahk)-Qhπk(xhk,ahk),

which finishes the proof. ∎

Finally, we are ready to combine all previous lemmas to prove the main theorem in the misspecified setting.

See 3.2

Proof of Theorem 3.2.

The proof of this theorem is similar to that of Theorem 3.1. We condition on the event 𝔈 defined in Lemma C.3. For for any (k,h)[K]×[H], we define δhk=Vhk(xhk)-Vhπk(xhk). By Lemma C.6, we have Q1k(x,a)Q1*(x,a)-4H2ζ for all k[K], which implies that V1(x1k)-V1πk(x1k)δ1k+4H2ζ. Furthermore, by Lemma C.7, on the event 𝔈 we have:

Regret(K) =k=1K[V1(x1k)-V1πk(x1k)]k=1K[δ1k+4H2ζ]
k=1Kh=1Hζhk+2k=1Kβkh=1H(ϕhk)(Λhk)-1ϕhk+4HTζ, (19)

where we use the fact that T=HK. Since {ζhk} is a martingale difference sequence with each term bounded by 2H, the Azuma-Hoeffding inequality implies that

k=1Kh=1Hζhk2TH2log(2/p)2HTι (20)

holds with probability at least 1-p/2, where ι=log(2dT/p). Moreover, by the Cauchy-Schwarz inequality, we have

k=1Kβk(ϕhk)(Λhk)-1ϕhk[k=1Kβk2]1/2[k=1K(ϕhk)(Λhk)-1ϕhk]1/2. (21)

Similarly to Eq. (17), we have

[k=1K(ϕhk)(Λhk)-1ϕhk]1/22dι. (22)

Moreover, since we set βk=c(dι+ζkd)H for some absolute constant c>0, we have

k=1Kβk2=c2k=1K(dι+ζkd)2H22c2k=1K(d2ι+ζ2kd)H22c2(d2HTι+ζ2T2d),

which implies that

(k=1Kβk2)1/22c(dHTι+ζTd). (23)

Therefore, combining Eq. (21), Eq. (22), and Eq. (23), we have

k=1Kβkh=1H(ϕhk)(Λhk)-1ϕhk2c(d3H3Tι2+ζdHTι). (24)

Finally, combining Eq. (C), Eq. (20), and Eq. (24), we obtain

Regret(K)c′′(d3H3Tι2+ζdHTι),

for some absolute constant c′′. This concludes the proof of the theorem. ∎

Appendix D Auxiliary Lemmas

This section presents several auxiliary lemmas and their proofs.

D.1 Important inequalities for summations

First, we present a few important short inequalities for summations.

Lemma D.1.

Let Λt=λI+i=1tϕiϕi where ϕiRd and λ>0. Then:

i=1tϕi(Λt)-1ϕid.
Proof.

We have i=1tϕi(Λt)-1ϕi=i=1ttr(ϕi(Λt)-1ϕi)=tr((Λt)-1i=1tϕiϕi). Given the eigenvalue decomposition i=1tϕiϕi=𝐔diag(λ1,,λd)𝐔, we have Λt=𝐔diag(λ1+λ,,λd+λ)𝐔, and tr((Λt)-1i=1tϕiϕi)=j=1dλj/(λj+λ)d. ∎

Lemma D.2 ([2]).

Let {ϕt}t0 be a bounded sequence in Rd satisfying supt0ϕt1. Let Λ0Rd×d be a positive definite matrix. For any t0, we define Λt=Λ0+j=1tϕjϕj. Then, if the smallest eigenvalue of Λ0 satisfies λmin(Λ0)1, we have

log[det(Λt)det(Λ0)]j=1tϕjΛj-1-1ϕj2log[det(Λt)det(Λ0)].
Proof.

Since λmin(Λ0)1 and ϕt1 for all j0, we have

ϕjΛj-1-1ϕj[λmin(Λ0)]-1ϕj21,j0.

Note that, for any x[0,1], it holds that log(1+x)x2log(1+x). Therefore, we have

j=1tlog(1+ϕjΛj-1-1ϕj)j=1tϕjΛj-1-1ϕj2j=1tlog(1+ϕjΛj-1-1ϕj). (25)

Moreover, for any t0, by the definition of Λt, we have

det(Λt)=det(Λt-1+ϕtϕt)=det(Λt-1)det(𝐈+Λt-1-1/2ϕtϕtΛt-1-1/2).

Since det(𝐈+Λt-1-1/2ϕtϕtΛt-1-1/2)=1+ϕtΛt-1-1ϕt, the recursion gives:

j=1tlog(1+ϕjΛj-1-1ϕj)=logdet(Λt)-logdet(Λ0). (26)

Therefore, combining Eq. (25) and Eq. (26), we conclude the proof. ∎

D.2 Concentration inequalities for self-normalized processes

Next, we present a few concentration inequalities. The following one provides a concentration inequality for the standard self-normalized processes.

Theorem D.3 (Concentration of Self-Normalized Processes [2]).

Let {εt}t=1 be a real-valued stochastic process with corresponding filtration {Ft}t=0. Let εt|Ft-1 be zero-mean and σ-subGaussian; i.e. E[εt|Ft-1]=0, and

λ,𝔼[eλεt|t-1]eλ2σ2/2.

Let {ϕt}t=0 be an Rd-valued stochastic process where ϕtFt-1. Assume Λ0 is a d×d positive definite matrix, and let Λt=Λ0+s=1tϕsϕs. Then for any δ>0, with probability at least 1-δ, we have for all t0:

s=1tϕsεsΛt-122σ2log[det(Λt)1/2det(Λ0)-1/2δ].

When specializing this concentration inequality to our setting, we require uniform concentration over all value functions V within a function class 𝒱. This uniform concentration incurs an additional term that depends logarithmically on the covering number of 𝒱.

Lemma D.4.

Let {xτ}τ=1 be a stochastic process on state space S with corresponding filtration {Fτ}τ=0. Let {ϕτ}τ=0 be an Rd-valued stochastic process where ϕτFτ-1, and ϕτ1. Let Λk=λI+τ=1kϕτϕτ. Then for any δ>0, with probability at least 1-δ, for all k0, and any VV so that supx|V(x)|H, we have:

τ=1kϕτ{V(xτ)-𝔼[V(xτ)|τ-1]}Λk-124H2[d2log(k+λλ)+log𝒩εδ]+8k2ε2λ,

where Nε is the ε-covering number of V with respect to the distance dist(V,V)=supx|V(x)-V(x)|.

Proof.

For any V𝒱, we know there exists a V~ in the ε-covering such that

V=V~+ΔV and supx|ΔV(x)|ε.

This gives following decomposition:

τ=1kϕτ{V(xτ)-𝔼[V(xτ)|τ-1]}Λk-12
  2τ=1kϕτ{V~(xτ)-𝔼[V~(xτ)|τ-1]}Λk-12+2τ=1kϕτ{ΔV(xτ)-𝔼[ΔV(xτ)|τ-1]}Λk-12,

where we can apply Theorem D.3 and a union bound to the first term. Also, it is not hard to bound the second term by 8k2ε2/λ. ∎

To compute the covering number of function class 𝒱, we first require a basic result on the covering number of a Euclidean ball as follows. We refer readers to classical material, such as Lemma 5.2 in [44], for its proof.

Lemma D.5 (Covering Number of Euclidean Ball).

For any ε>0, the ε-covering number of the Euclidean ball in Rd with radius R>0 is upper bounded by (1+2R/ε)d.

Now, we are ready to compute the covering number of 𝒱.

Lemma D.6.

Let V denote a class of functions mapping from S to R with following parametric form

V()=min{maxa𝐰ϕ(,a)+βϕ(,a)Λ-1ϕ(,a),H},

where the parameters (w,β,Λ) satisfy wL, β[0,B] and the minimum eigenvalue satisfies λmin(Λ)λ. Assume ϕ(x,a)1 for all (x,a) pairs, and let Nε be the ε-covering number of V with respect to the distance dist(V,V)=supx|V(x)-V(x)|. Then

log𝒩εdlog(1+4L/ε)+d2log[1+8d1/2B2/(λε2)].
Proof.

Equivalently, we can reparametrize the function class 𝒱 by let 𝐀=β2Λ-1, so we have

V()=min{maxa𝐰ϕ(,a)+ϕ(,a)𝐀ϕ(,a),H} (27)

for 𝐰L and 𝐀B2λ-1. For any two functions V1,V2𝒱, let them take the form in Eq. (27) with parameters (𝐰1,𝐀1) and (𝐰2,𝐀2), respectively. Then, since both min{,H} and maxa are contraction maps, we have

dist(V1,V2) supx,a|[𝐰1ϕ(x,a)+ϕ(x,a)𝐀2ϕ(x,a)]-[𝐰2ϕ(x,a)+ϕ(x,a)𝐀2ϕ(x,a)]|
supϕ:ϕ1|[𝐰1ϕ+ϕ𝐀2ϕ]-[𝐰2ϕ+ϕ𝐀2ϕ]|
supϕ:ϕ1|(𝐰1-𝐰2)ϕ|+supϕ:ϕ1|ϕ(𝐀1-𝐀2)ϕ|
=𝐰1-𝐰2+𝐀1-𝐀2𝐰1-𝐰2+𝐀1-𝐀2F, (28)

where the second last inequality follows from the fact that |x-y||x-y| holds for any x,y0. For matrices, and F denote the matrix operator norm and Frobenius norm respectively.

Let 𝒞𝐰 be an ε/2-cover of {𝐰d|𝐰L} with respect to the 2-norm, and 𝒞𝐀 be an ε2/4-cover of {𝐀d×d|𝐀Fd1/2B2λ-1} with respect to the Frobenius norm. By Lemma D.5, we know:

|𝒞𝐰|(1+4L/ε)d,|𝒞𝐀|[1+8d1/2B2/(λε2)]d2.

By Eq. (D.2), for any V1𝒱, there exists 𝐰2𝒞𝐰 and 𝐀2𝒞𝐀 such that V2 parametrized by (𝐰2,𝐀2) satisfies dist(V1,V2)ε. Hence, it holds that 𝒩ε|𝒞𝐰||𝒞𝐀|, which gives:

log𝒩εlog|𝒞𝐰|+log|𝒞𝐀|dlog(1+4L/ε)+d2log[1+8d1/2B2/(λε2)].

This concludes the proof. ∎