Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning

  • 2020-07-30 00:56:28
  • Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clarke, Siva Theja Maguluri
  • 0

Abstract

Motivated by applications in Reinforcement Learning (RL), in this paper, westudy a nonlinear Stochastic Approximation (SA) algorithm under Markoviannoise, and derive its finite-sample convergence bounds. Our proof is based onthe Lyapunov drift arguments, and to handle the Markovian noise, we exploit thefast mixing of the underlying Markov chain. Our result is used to show the finite-sample bounds of the popular Q-learningwith linear function approximation algorithm for solving the RL problem. SinceQ-learning with linear function approximation may diverge in general, we studyit under a condition on the behavior policy that ensures the stability of thealgorithm. Due to the generality of our SA results, we do not need to make theunnatural assumption that the samples are i.i.d. (since they are Markovian),and do not require an additional projection step in the algorithm to maintainthe boundedness of the iterates.

 

Quick Read (beta)

Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in
Reinforcement Learning

Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clarke, Siva Theja Maguluri All authors are affiliated with Georgia Institute of Technology
Abstract

Motivated by applications in Reinforcement Learning (RL), in this paper, we study a nonlinear Stochastic Approximation (SA) algorithm under Markovian noise, and derive its finite-sample convergence bounds. Our proof is based on the Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain.

Our result is used to show the finite-sample bounds of the popular Q-learning with linear function approximation algorithm for solving the RL problem. Since Q-learning with linear function approximation may diverge in general, we study it under a condition on the behavior policy that ensures the stability of the algorithm. Due to the generality of our SA results, we do not need to make the unnatural assumption that the samples are i.i.d. (since they are Markovian), and do not require an additional projection step in the algorithm to maintain the boundedness of the iterates.

1 Introduction

Reinforcement Learning (RL) is a framework to solve sequential decision-making problems by interacting with the environment [30]. This approach has demonstrated tremendous successes for solving many practical problems in several different areas, such as robotics [15], power management [31], autonomous driving [26], and board games [27].

The RL problem can be modeled as a Markov Decision Process (MDP) and so reduces to solving the so-called Bellman’s equation [30, 21]. However, unlike in an MDP, since the environmental model is unknown in RL, such system of equations are usually solved using the Stochastic Approximation (SA) method, where the solution is iteratively estimated using noisy samples collected from the system [4, 23]. A typical example is the popular Q-learning algorithm proposed in [36], which can be viewed as a nonlinear SA algorithm with martingale difference noise (conditionally mean zero noise). However, when using Q-learning along with function approximation, the corresponding SA algorithm involves Markovian noise, and cannot be modeled by martingale difference. This motivates one to study the behavior of general nonlinear SA algorithms under Markovian sampling. This paper focuses on establishing finite-sample convergence guarantee for such SA algorithms, and apply it to the Q-learning with linear function approximation algorithm.

In general SA framework, usually the target equation can be written in the form 𝔼μ[F(X,θ)]=0, where X is a random variable with distribution μ (μ is unknown), and F is a general nonlinear mapping. The SA algorithm for solving such an equation is then given by: θk+1=θk+ϵkF(Xk,θk), where {Xk} is a sequence of samples of the random variable X, and ϵk is the step size. Besides their applications in RL, SA algorithms of this form are widely used in optimization. For example, when F(x,θ)=-cf(θ)+x for some cost function f(θ), the SA algorithm is known as the Stochastic Gradient Descent (SGD) method for minimizing f(θ) [20, 25]. The behavior of such an SA algorithm, critically depends upon the following: (a) The way we collect the sample sequence {Xk}, and (b) The properties of the function F(x,θ).

In existing literature, when the function F(x,θ) is a linear function of θ, finite-sample convergence bounds can be derived for both i.i.d. sampling (Xkμ being i.i.d.), and Markovian sampling ({Xk} being a Markov chain with stationary distribution μ)[5, 28]. When F is nonlinear, then finite-sample bounds are only derived in the case where F represents the stochastic gradient of some convex function f. Moreover, unlike i.i.d. sampling, when {Xk} is a Markov chain, an artificial projection (onto a large enough ball) is introduced in the algorithm to ensure that the iterates are bounded [10]. Such a projection step is not practical in RL because one needs to know the problem parameters to pick the projection set so that the solution θ* lies in it. In this paper, we provide finite-sample convergence guarantees for general nonlinear stochastic approximation with Markovian noise without needing such an impractical projection step.

Recall that our motivation for studying nonlinear SA with Markovian sampling is to analyze the Q-learning with linear function approximation algorithm for solving the RL problem. However, unlike tabular Q-learning, Q-learning with linear function approximation may in general diverge [1]. Therefore, besides the nonlinearity and the Markovian noise, another fundamental problem is to maintain the stability of the algorithm. In existing literature, it is usually done by placing a requirement on the behavior policy, i.e., the policy used to collect samples [19, 16]. We work with a similar condition in this paper, and provide detailed discussion about our proposed condition. We also empirically show how a well-known divergent counter-example of Q-learning with linear function approximation can be made to converge if our proposed condition is satisfied.

The main contribution of this paper are as follows.

  • We derive finite-sample convergence guarantee for nonlinear SA with Markovian noise and without needing an artificial projection step. Then we study its behavior under several common choices of step sizes. The idea is to apply a suitable Lyapunov function on the stochastic iterates, and show that in expectation it produces a negative drift. To handle the nonlinearity, we use the Lipchitz continuity of F to show that F(x,θ) can be bounded affinely by θ. To handle the Markovian noise, we exploit the geometric mixing of the underlying Markov chain to control the cumulative bias caused by the Markovian sampling.

  • We apply our results on SA to the Q-learning with linear function approximation algorithm in RL. Since it is known to diverge, we study it under a condition that involves the behavior policy, feature vectors and the discount factor. We show that when using constant step sizes, Q-learning achieves an exponential convergence rate to a ball centered at the solution of the so-called projected Bellman’s equation, and the size of the ball is proportional to the step size and the mixing time of the underlying Markov chain. When using diminishing step sizes with a proper decay rate, we show that the convergence rate is roughly O(ln(k)/k). Moreover, we provide detailed discussion for our proposed condition on the behavior policy.

  • We present an empirical case study of our theoretical convergent results using numerical experiments on a well-known divergent counter-example of Q-learning from [1]. In particular, we show that if our condition is satisfied, the algorithm converges, and the rates match the theoretical results we derived.

1.1 Related Work

Stochastic approximation methods, originally proposed in [23], have been widely used in reinforcement learning for solving the Bellman’s equation. Typical examples are the TD-learning algorithm for solving the policy evaluation problem [29], and the Q-learning algorithm for estimating the optimal state-action value function [36]. In the tabular case (i.e., without using function approximation), they both can be modeled as an SA algorithm involving a contraction operator and martingale difference noise, and their asymptotic convergence has been established in [33, 13, 7, 6]. As for the finite-sample analysis, [2, 8, 35] studies the mean square error bounds of synchronous Q-learning, and [11, 22, 18] provides concentration results for asynchronous Q-learning.

When using RL algorithms along with linear function approximation, the noise cannot be modeled as a martingale difference, and one should directly study Markovian noise. In the case of TD-learning with linear function approximation, which is a linear SA, asymptotic convergence was established in [34] using an Ordinary Differential Equation (ODE) approach [3]. As for the finite-sample convergence bounds, it was derived in [5] for both i.i.d. sampling and Markovian sampling. However, in the latter case, an additional projection step (onto a ball) is needed for the proof. Later in [28], this artificial projection step was successfully removed. When using Q-learning along with linear function approximation, it was shown in [1] that Q-learning does not necessarily converge. Its asymptotic convergence was proved in [19, 16] under an additional condition on the behavior policy. Under a similar condition, we derive its finite-sample bounds by exploiting the fast mixing of finite state Markov chains, which was a technique developed in [28] to study the convergence rate of linear SA with Markovian noise. However, note that Q-learning is a nonlinear SA even under linear function approximation. To establish the results, some natural properties of Q-learning and our stability condition on the behavior policy play important roles in our analysis.

2 Nonlinear Stochastic Approximation with Markovian Noise

2.1 Problem Setting

We are interested in the problem of solving for θ*d in the equation

F¯(θ):=𝔼μ[F(X,θ)]=0, (1)

where X𝒳 is a random vector with distribution μ, and 𝒳 is a finite set. The function F:𝒳×dd is a general nonlinear mapping. When the distribution μ of 𝒳 is unknown, Eq. (1) cannot be solved analytically. Therefore, we consider solving the equation by appealing to the stochastic approximation method proposed in [23]. Specifically, suppose that we can collect a sequence of samples {Xk} of the random vector X. Then, with initialization θ0, the estimate θk of θ* is iteratively updated by

θk+1=θk+ϵkF(Xk,θk). (2)

Here {ϵk} is the step size sequence which we can choose.

To analyze the behavior of Algorithm (2), one popular approach is to compare the iterates {θk} with the trajectory of the following ODE:

θ˙(t)=F¯(θ(t)). (3)

Under some mild conditions on the sample sequence {Xk} and the step sizes {ϵk}, it was shown in [7, 6] that θk converges to θ* almost surely as long as the ODE (3) is globally asymptotically stable (GAS), with θ* being its unique equilibrium point. Such a stability property of the ODE is usually analyzed by constructing a Lyapunov function V(θ), and studying its time derivative along the trajectory of the ODE.

Our goal is to obtain finite-sample bounds for the stochastic iterative algorithm (2). A natural approach is to use the ODE Lyapunov function to study the iterates of Algorithm (2). However, since Algorithm (2) is a discrete and stochastic counterpart of ODE (3), a major challenge is to handle the error caused by the discretization and the noise {Xk}.

2.2 Assumptions

Recall that to study the stochastic approximation algorithm (2), the properties of the function F(x,θ) and the nature of the noise sequence {Xk} are important. We next present our main assumptions regarding these two aspects. Let be the 2-norm for vectors and induced 2-norm for matrices.

Assumption 2.1.

The function F(x,θ) is Lipschitz continuous with respect to θ uniformly in x, i.e., there exists L>0 such that F(x,θ1)-F(x,θ2)Lθ1-θ2 for all θ1, θ2, and x. Moreover, we assume that LmaxxXF(x,0).

In the special case where F(x,θ) is a linear function of θ as considered in [5, 28], i.e., F(x,θ)=A(x)θ+b(x), since the state-space 𝒳 is finite, Assumption 2.1 is automatically satisfied. Note that Assumption 2.1 also implies the Lipschitz continuity of the function F¯, which is sufficient to study the nonlinear dynamical system (3), e.g., it guarantees that ODE (3) has a unique solution.

Although F(x,θ) is in general a nonlinear function of θ, Assumption 2.1 restricts that the growth rate of both F(x,θ) and F¯(θ) can at most be affine in terms of θ. To see this, let θ1=θ and θ2=0 in Assumption 2.1. Then we have by triangle inequality and our assumption Lmaxx𝒳F(x,0) that

F(x,θ)Lθ+F(x,0)L(θ+1). (4)

Moreover, using Jensen’s inequality and we further obtain

F¯(θ)𝔼μ[F(X,θ)]L(θ+1). (5)

These properties for F(x,θ) and F¯(θ) essentially let us prove a finite-sample convergence bound akin to the case when F(x,θ) is a linear function of θ.

Assumption 2.2.

The equation F¯(θ)=0 has a unique solution, denoted by θ*, and there exists α>0 such that (θ-θ*)F¯(θ)-αθ-θ*2 for all θRd.

Assumption 2.2 is satisfied when -F is a strongly monotone function [25]. Also, it can be viewed as an exponential dissipativeness property of the ODE (3) with a quadratic storage function [12]. In fact, this assumption guarantees that θ* is the unique globally exponentially stable (GES) equilibrium point of ODE (3). To see this, let V(θ)=θ-θ*2 be a candidate Lyapunov function. Then we have by Assumption 2.2 that

ddtV(θ(t))=2(θ(t)-θ*)θ˙(t)-2αV(θ(t)), (6)

which implies V(θ(t))V(θ(0))e-2αt for all t0. The quantity α is called the negative drift, and we see that the larger α is, the faster θ(t) converges.

Assumption 2.3.

The sequence {Xk} is an irreducible and aperiodic Markov chain on state space X, with unique stationary distribution μ.

Assumption 2.3 is often made to study the asymptotic convergence of stochastic approximation under Markovian noise when the state space is finite; see for examples [6, 34, 9]. When compared to the case where {Xk} is i.i.d., the main difference for {Xk} being Markovian is that there is a bias in the update of Algorithm (2), i.e., 𝔼[F(Xk,θ)X0=x]F¯(θ). As will be seen later, Assumption 2.3 enables us to control the bias and show that it is not strong enough to alter the desired direction of the update.

In addition to these assumptions, the choice of the step size sequence {ϵk} is important. In order to state certain conditions on the step sizes we pick, we need to use the mixing time of the Markov chain {Xk}. We begin with the definition of the total variation distance between probability distributions, which is used to introduce the mixing time.

Definition 2.1.

The total variation distance between two probability distributions ν1 and ν2 on X is defined by ν1-ν2𝑇𝑉=12xX|ν1(x)-ν2(x)|.

Denote Pk(x,) as the distribution of Xk with initial state x, and let

dmax(k):=maxx𝒳Pk(x,)-μTV

be the maximal distance between Pk(x,) and the stationary distribution μ. Then the mixing time of the Markov chain {Xk} with precision δ is defined by tmix(δ)=min{k:dmax(k)δ}. Intuitively, smaller mixing time means faster convergence of the Markov chain {Xk} to its stationary distribution. Consider the extreme case where {Xk} is i.i.d., we have tmix(δ)=1 for any δ>0.

Let tδ=tmix(δ2L), where L is the Lipschitz constant given in Assumption 2.1. The mixing time tδ is defined this way so that it accounts for both the Markov chain {Xk} and the function F(x,θ) (through its Lipschitz constant). We will later use tδ to state our requirement on the step size sequence {ϵk}. Before that, we need the following property regarding tδ, whose proof is presented in Appendix A.1.

Claim 2.1.

There exists a constant L1>0 that depends only on the Markov chain {Xk} and the Lipschitz constant L such that tδL1(ln(1/δ)+1).

Claim 2.1 states that the mixing time tδ grows at most affinely in terms of ln(1/δ), which further implies limδ0δtδ=0. Since we will always choose δ=ϵk, where ϵk is the step size, for simplicity, we will write tk for tϵk in the following. Now we state our condition on the step sizes {ϵk}. For simplicity of notations, denote ϵi,j=k=ijϵk.

Condition 2.1.

The step size sequence {ϵk} satisfies the following conditions.

  1. (1)

    The step size sequence {ϵk} is non-increasing and ϵ0(0,1).

  2. (2)

    There exists a minimal integer K>0 such that ϵ0,K-114L and ϵk-tk,k-1<α114L2 for all kK.

Condition 2.1 (1) is standard [4]. Condition 2.1 (2) is mainly used to control the discretization error and the stochastic error in Algorithm (2). In fact, since limϵ0ϵtϵ=0 under Claim 2.1, we can show that Condition 2.1 (2) is satisfied when using either small enough constant step sizes or diminishing step sizes of the form ϵk=ϵ(k+h)ξ for all ϵ>0, and ξ(0,1], provided that h is chosen properly.

2.3 Finite-Sample Convergence Bounds for Nonlinear SA

In this subsection, we state our main results. We begin with the finite-sample convergence bounds of Algorithm (2), whose proof is presented in Subsection 2.4.

Theorem 2.1.

Consider {θk} of Algorithm (2). Suppose that Assumptions 2.12.3 are satisfied, and {ϵk} satisfies Condition 2.1. Then we have for all kK:

𝔼[θk-θ*2]β1j=Kk-1(1-αϵj)+β2i=Kk-1ϵ^ij=i+1k-1(1-αϵj), (7)

where β1=(θ0+θ0-θ*+1)2, β2=114L2(θ*+1)2, and ϵ^i=ϵiϵi-ti,i-1.

On the right-hand side of Eq. (7), the first term represents the bias due to the initial guess θ0, and the second term captures the variance due to the noise. We will see later that the Markovian nature of the noise mainly impacts the second term.

Theorem 2.1 is one of our main contribution in that (1) The function F(x,θ) is allowed to be nonlinear, (2) It holds when {Xk} is sampled in a Markovian manner, and (3) No modifications on Algorithm (2) (e.g., adding a projection step) is needed to establish the result.

After establishing the finite-sample bound of Algorithm (2) in its general form, we next consider several common choices of step sizes, and see more explicitly what are the corresponding convergence rates. We begin by presenting the result for using constant step sizes, whose proof can be found in Appendix A.2.

Corollary 2.1.

Let ϵk=ϵ for all k0, where ϵ is chosen such that ϵtϵα114L2. Then we have for all ktϵ:

𝔼[θk-θ*2]β1(1-αϵ)k-tϵ+β2ϵtϵ/α.

We see from Corollary 2.1 that when using constant step sizes, the finite-sample bound is composed by two terms. The first term converges to zero exponentially fast as the number of iterations tends to infinity, while the second term is a constant which is independent of k. Geometrically, this means that the iterates of Algorithm (2) achieves an exponential convergence rate in expectation to a ball containing θ*, with radius proportional to ϵtϵ. Intuitively, constant step sizes efficiently eliminate the bias. However, since the noise is constantly added to the iterates without being progressively suppressed, the variance does not converge to zero.

We next consider using diminishing step sizes in Algorithm (2). Let ϵk=ϵ(k+h)ξ, where ϵ>0, ξ(0,1], and h is chosen properly so that Condition 2.1 is satisfied for some K>0. The verification of Condition 2.1 and the proof of the following result can be found in Appendix A.3.

Corollary 2.2.

Suppose {ϵk} is chosen as above, then we have the following finite-sample bounds.

  1. (1)

    When ξ=1, we have for all kK:

    𝔼[θk-θ*2]{β1(K+hk+h)αϵ+8β2ϵ2L11-αϵ[ln(k+hϵ)+1](k+h)αϵ,αϵ(0,1),β1(K+hk+h)+8β2ϵ2L1ln(k+hK+h)[ln(k+hϵ)+1]k+h,αϵ=1,β1(K+hk+h)αϵ+8eβ2ϵ2L1αϵ-1[ln(k+hϵ)+1]k+h,αϵ(1,).
  2. (2)

    When ξ(0,1) and ϵ>0, suppose in addition that K+h[2ξ/(αϵ)]1/(1-ξ), then we have for all kK:

    𝔼[θk-θ*2]β1exp[-αϵ1-ξ((k+h)1-ξ-(K+h)1-ξ)]+4β2ϵ2L1αϵ[ln(k+hϵ)+1](k+h)ξ.

From Corollary 2.2 (1), we see that when using ϵk+h step size, the constant ϵ must be chosen very carefully (i.e., ϵ>1/α) to achieve the optimal O(ln(k)/k) convergence rate, otherwise the convergence rate is roughly O(ln(k)/kαϵ), which can be arbitrarily slow. In the case when ξ(0,1), the convergence rate is O(ln(k)/kξ), and it is independent of ϵ.

The above analysis indicates that our choice of step size should depend on how precise our estimate of the negative drift parameter α is. When our estimate to α is accurate, we should use ϵk=ϵk+h with ϵ>1/α so that the convergence rate is the optimal O(ln(k)/k). When our understanding to the system model is poor (therefore inaccurate estimate of α), we should use ϵk=ϵ(k+h)ξ with ξ chosen close to unity. In that case, we sacrifice the convergence rate for robustness.

Unlike almost sure convergence, where the usual requirement for step sizes are k=0ϵk= and k=0ϵk2< (which corresponds to ξ(1/2,1] in our case), we have convergence in mean square error for all ξ(0,1]. Same phenomenon has been observed in [5, 32], where they study linear SA and nonlinear SA with martingale difference noise.

2.4 Proof of Theorem 2.1

This subsection is dedicated to the proof of Theorem 2.1. Before going into details, we first provide some intuition.

Recall that the Lyapunov function V(θ)=θ-θ*2 can be used to show that θ* is the unique GES equilibrium point of ODE (3). To analyze the convergence rate of the iterates {θk} generated by Algorithm (2), naturally we want to use the Lyapunov function V() on {θk} to show something like

𝔼[V(θk+1)]-𝔼[V(θk)](-αϵk+e1)𝔼[V(θk)]+e2. (8)

Here on the right-hand side of Eq. (8), the -αϵk term corresponds to the negative drift of the ODE, which is provided by Assumption 2.2, and the two terms e1 and e2 account for the discretization error and the stochastic error in Algorithm (2). The discretization error can be handled using the properties of the function F(x,θ) (Assumption 2.1) and properly chosen step sizes (Condition 2.1). As for the stochastic error, since Markovian noise natually produces bias in the update, we need to show that 𝔼[F(Xk,θ)X0=x] converges to F¯(θ) fast enough for any θ, which implies that the overall cumulative bias is small. This result is provided by Assumption 2.3 and Claim 2.1. Once we show that both error terms are dominated by the drift term, i.e., e1=o(ϵk) and e2=o(ϵk), Eq. (8) can be recursively used to establish a finite-sample error bound of Algorithm (2).

Following the high level idea we stated above, we now prove Theorem 2.1. The proofs of all the technical lemmas we used can be found in the Appendices A.4A.8.

To begin with, we apply V(θ)=θ-θ*2 on the iterates θk of Algorithm (2). Moreover, to utilize the mixing time of the Markov chain, we take expectation conditioning on Xk-tk and θk-tk. For simplicity of notations, we use 𝔼k[] for 𝔼[Xk-tk,θk-tk] in the following. Using the update rule (2), we have for all ktk:

𝔼k[θk+1-θ*2]-𝔼k[θk-θ*2]
= 2𝔼k[θk-θ*,θk+1-θk]+𝔼k[θk+1-θk2]
= 2ϵk𝔼k[(θk-θ*)F¯(θk)](a)+2ϵk𝔼k[(θk-θ*)(F(Xk,θk)-F¯(θk))](b)
+ϵk2𝔼k[F(Xk,θk)2](c). (9)

The term (a) corresponds to the drift of the ODE (3), and we have by Assumption 2.2 that

(a)-2αϵk𝔼k[θk-θ*2].

The terms (b) and (c) correspond to the stochastic error and the discretization error respectively. What remains to show is that they are dominated by the term (a). We begin by bounding the term (c) in the following lemma.

Lemma 2.1.

(c)2L2ϵk2[𝔼k[θk-θ*2]+(θ*+1)2] for all ktk.

Observe that Lemma 2.1 implies that (c)=O(ϵk2)=o(ϵk), which is what we desire. We next turn to the term (b), to control it, we need the following two results.

Lemma 2.2.

Given precision δ>0, the following inequality holds for all xX, θRd, and ktδ: E[F(Xk,θ)X0=x]-F¯(θ)δ(θ+1).

Lemma 2.3.

For any k1<k2 satisfying ϵk1,k2-114L, the following two inequalities hold:

θk2-θk1 2Lϵk1,k2-1(θk1+1),
θk2-θk1 4Lϵk1,k2-1(θk2+1).

Lemma 2.2 utilizes the mixing time tδ=tmix(δ2L) to show that the bias in the update of Algorithm (2) is exponentially small after the Markov chain is sufficiently mixed. Lemma 2.3 controls the difference between θk1 and θk2 when |k2-k1| is not too large. With the help of Lemmas 2.2 and 2.3, we are now ready to bound the term (b) in the following lemma.

Lemma 2.4.

The following inequality holds for all k such that ϵk-tk,k-114L:

(b)112L2ϵkϵk-tk,k-1[𝔼k[θk-θ*2]+(θ*+1)2].

Substituting the upper bounds we obtained for the terms (a), (b), and (c) into Eq. (9), we obtain the following recursive relation between θk+1 and θk.

Lemma 2.5.

The following inequality holds for all k such that ϵk-tk,k-114L:

𝔼[θk+1-θ*2](1-2αϵk+114L2ϵkϵk-tk,k-1)𝔼[θk-θ*2]+114L2ϵkϵk-tk,k-1(θ*+1)2. (10)

Eq. (10) is of the desired recursive form (8). Therefore, as long as the drift term dominates the error terms, i.e., 2αϵk>114L2ϵkϵk-tk,k-1, we can use Eq. (10) to derive a finite-sample error bound of Algorithm (2).

Proof of Theorem 2.1.

When kK (see Condition 2.1 for the definition of K), we have by Eq. (10) that 𝔼[θk+1-θ*2](1-αϵk)𝔼[θk-θ*2]+β2ϵ^k, where ϵ^k and β2 are defined in Theorem 2.1. Recursively using the preceding inequality starting from K, we obtain

𝔼[θk-θ*2]𝔼[θK-θ*2]j=Kk-1(1-αϵj)+β2i=Kk-1ϵ^ij=i+1k-1(1-αϵj). (11)

To bound 𝔼[θK-θ*2], using Lemma 2.3, and our assumption that ϵ0,K-114L, we have 𝔼[θK-θ*2]𝔼[(θK-θ0+θ*-θ0)2]β1, which gives the desired finite-sample bound (7). ∎

In summary, we have stated and proved a finite-sample convergence bound for stochastic approximation algorithm (2). Moreover, we analyzed the convergence rates for using both constant and diminishing step sizes. In the following section, we will use these results to establish the convergence rates of the Q-learning with linear function approximation algorithm for solving the Reinforcement Learning problem.

3 Applications in Reinforcement Learning

3.1 Q-Learning with Linear Function Approximation

We begin by introducing the Markov Decision Processes and the Reinforcement Learning problem.

Definition 3.1 (Markov Decision Processes).

An infinite horizon discounted Markov decision process M is comprised by a tuple (S,A,P,R,γ), where S is a finite set of states, A is a finite set of actions, Pa(s,s) stands for the probability of transition from state s to state s under action a, R(s,a) is the reward when taking action a at state s, and γ(0,1) is the discount factor.

We assume that the reward is non-negative and is uniformly bounded such that (s,a)[0,rmax] for all state-action pairs (s,a). The underlying model of the Reinforcement Learning problem is essentially an MDP except that the transition probabilities and the reward function are unknown to the agent.

The goal of the agent is to find a strategy (called policy) of selecting actions based on the state of the environment such that the expected long-term discounted reward is maximized. More precisely, define the state-action value function (also known as the Q-function) of a policy π at state-action pairs (s,a) by

Qπ(s,a)=𝔼π[k=0γk(sk,ak)|s0=s,a0=a],(s,a)𝒮×𝒜,

where {ak}k1 is executed according to policy π. In words, Qπ(s,a) is the expected total reward the agent would receive if it starts at state s, first taking action a, and then following the policy π thereafter. We want to find an optimal policy π* such that its corresponding Q-function, denote by Q*, is such that Q*(s,a)Qπ(s,a) for all state-action pairs (s,a) and all policy π. A fundamental property of the function Q* is that, if one just select actions greedy based on Q*, then that is an optimal policy, i,e, π*(s)argmaxa𝒜Q*(s,a) for all state s. Therefore, solving the RL problem reduces to finding the optimal Q-function.

The Q-learning algorithm proposed in [36] is a popular approach for estimating the optimal Q-function. However, it is well-known that a major challenge in the Q-learning algorithm is the curse of dimensionality. That is, Q-learning becomes intractable when the number of state-action pairs is very large. Therefore, it was suggested that we approximate the optimal Q-function within a chosen function space which has much smaller dimension than |𝒮||𝒜| [4]. We next describe the approximation architecture.

Let {ϕi:𝒮×𝒜1id} be a set of basis functions (called features). Define the feature matrix Φ|𝒮||𝒜|×d by Φ=[ϕ1,ϕ2,,ϕd]. Then the linear subspace 𝒲 spanned by the features {ϕi} can be written as 𝒲={Q~θ=Φθθd}. We will use 𝒲 as our approximating function space, and the goal here is to find θ* such that Q~θ* best represents Q*.

Using the notations above, we now present the Q-learning algorithm under linear function approximation [4, 19]. Let {(sk,ak)} be a trajectory generated by applying some policy π to the system model, where π is called the behavior policy. Then, the parameter θ of the approximation Q~θ is iteratively updated by:

θk+1=θk+ϵkϕ(sk,ak)((sk,ak)+γmaxa𝒜ϕ(sk+1,a)θk-ϕ(sk,ak)θk), (12)

where {ϵk} is a sequence of step sizes. Algorithm (12) can be viewed as a stochastic approximation algorithm for solving the equation:

𝔼μ[ϕ(s,a)((s,a)+γmaxa𝒜ϕ(s,a)θ-ϕ(s,a)θ)]=0, (13)

where μ is the stationary distribution of the Markov chain {sk} under policy π. Under some mild conditions, Eq. (13) is equivalent to a so-called projected Bellman’s equation (see [19] for more details). In the special case where the feature matrix Φ is a identity matrix with dimension |𝒮||𝒜|, Algorithm (12) reduces to the tabular Q-learning algorithm [36], and Eq. (13) becomes the regular Bellman’s equation [4].

In general, Eq. (13) may not necessarily admit a solution [9], and the iteration in (12) may diverge [1]. However, it was shown in [19, 16] that under an assumption on the behavior policy π, θk converges to the solution of Eq. (13), denoted by θ*, almost surely. In this paper, we focus on establishing the finite-sample convergence guarantee of Algorithm (12). We begin by stating our main assumptions.

Assumption 3.1.

The features {ϕi}1id are linearly independent and are normalized so that ϕ(s,a)1 for all state-action pairs (s,a).

Assumption 3.1 is indeed without loss of generality since we can disregard dependent features. Moreover, since the state and action spaces are finite, we can always make sure max(s,a)ϕ(s,a)1 by proper scaling.

Assumption 3.2.

The Markov chain {sk}k0 induced by the behavior policy π is irreducible and aperiodic, with unique stationary distribution μ.

Assumption 3.2 essentially requires that the behavior policy π has enough exploration, which is necessary in studying value-based RL algorithms.

Assumption 3.3.

Eq. (13) has a unique solution, denoted by θ*, and there exists κ>0 such that the following inequality holds for all θRd:

γ2𝔼μ[maxa𝒜(ϕ(s,a)θ)2]-𝔼μ[(ϕ(s,a)θ)2]-κθ2. (14)

We make Assumption 3.3 and especially Eq. (14) on the behavior policy π to ensure the stability of Algorithm (12), which is in the same spirit as the conditions proposed in [19, 16]. A detailed discussion about this assumption is presented in Subsection 3.3.

3.2 Finite-Sample Convergence Bounds for Q-Learning with Linear Function Approximation

To begin with, we rewrite Algorithm (12) in the form of stochastic iterative algorithm (2). Let Xk=(sk,ak,sk+1) for all k0. It is clear that {Xk} is a Markov chain with finite state space 𝒳={(s,a,s)s𝒮,π(a|s)>0,Pa(s,s)>0}. Define

F(x,θ)=F(s,a,s,θ):=ϕ(s,a)((s,a)+γmaxa𝒜ϕ(s,a)θ-ϕ(s,a)θ). (15)

With F(x,θ) being defined above, Algorithm (12) can be written as:

θk+1=θk+ϵkF(Xk,θk). (16)

Let F¯(θ)=𝔼μ[F(X,θ)], where μ is the stationary distribution of the Markov chain {sk} under policy π. Wee see that F¯(θ)=0 is exactly the target equation (13) of Algorithm (12).

To apply Theorem 2.1, we need to show that Assumptions 2.1-2.3 are naturally satisfied in the context of Q-learning under Assumptions 3.1-3.3, which is provided by the following proposition. See Appendix B.1 for its proof.

Proposition 3.1.

Consider Q-learning algorithm (12). Suppose that Assumptions 3.1-3.3 are satisfied, then we have the following results.

  1. (1)

    The Markov chain {Xk} with state space 𝒳 is irreducible and aperiodic.

  2. (2)

    The function F(x,θ) defined in (15) is Lipschitz continuous with respect to θ uniformly in x, and M=1+γ+rmax is a valid Lipschitz constant. Moreover, we have Mmaxx𝒳F(x,0).

  3. (3)

    The equation F¯(θ)=0 has a unique solution θ*, and we have (θ-θ*)F¯(θ)-κ2θ-θ*2 for all θd, where κ is given in Assumption 3.3.

Similarly as in Section 2, given precision δ>0, we define the mixing time tδ=tmix(δ2M), where tmix() is for the Markov chain {Xk}, and M is the Lipschitz constant given in Proposition 3.1. Moreover, there exists a constant M1 that depends only on the Markov chain {Xk} and the Lipschitz constant M such that tδM1(ln(1/δ)+1) for any δ>0.

We next use Theorem 2.1 to derive finite-sample error bounds of Algorithm (12) for different choice of step sizes. We begin with the result for using constant step sizes, which directly follows from Proposition 3.1 and Corollary 2.1. Let η1=(θ0+θ0-θ*+1)2, and η2=114M2(θ*+1)2.

Theorem 3.1.

Consider {θk} of Algorithm (12). Suppose that Assumptions 3.1-3.3 are satisfied, and ϵkϵ with ϵ chosen such that ϵtϵκ228M2. Then we have for all ktϵ:

𝔼[θk-θ*2]η1(1-κϵ/2)k-tϵ+2η2ϵtϵ/κ.

Theorem 3.1 is qualitatively similar to Corollary 2.1 in that the iterates of Q-learning converges exponentially fast to a ball centered at θ*, and the size of the ball is proportional to ϵtϵ. Consider the special case where {(sk,ak)} is sampled in an i.i.d. manner. Since tϵ=1 for any ϵ>0, the size of the ball is only proportional to the step size ϵ. Thus, we see that the Markovian nature of the noise produces an additional mixing time factor. This agrees with results in [28, 5], where the popular TD-learning with linear function approximation algorithm was studied.

We next present the result for using diminishing step sizes. For simplicity, we only present the case of the optimal convergence rate in Corollary 2.2. Let ϵk=ϵk+h with ϵ>2/κ, where h is chosen properly so that there exists a minimal K>0 such that ϵ0,K-114M and ϵk-tk,k-1<κ228M2 for all kK. The following result directly follows from Proposition 3.1 and Corollary 2.2.

Theorem 3.2.

Consider {θk} of Algorithm (12). Suppose that Assumptions 3.1-3.3 are satisfied, and ϵk is chosen as above. Then we have for all kK:

𝔼[θk-θ*2]η1(K+hk+h)κϵ2+16eη2ϵ2M1κϵ-2[ln(k+hϵ)+1]k+h.

Similarly as in Corollary 2.2, Theorem 3.2 suggests that the convergence rate is roughly O(ln(k)/k) with properly chosen diminishing step sizes, and the additional ln(k) factor is a consequence of the Markovian noise.

3.3 Discussion about Assumption 3.3 on the Behavior Policy

In this subsection, we take a closer look at Assumption 3.3 and Eq. (14), which are essential for the stability of the Q-learning with linear function approximation algorithm. First note that Eq. (14) is equivalent to

γ2𝔼μ[maxa𝒜(ϕ(s,a)θ)2]<𝔼μ[(ϕ(s,a)θ)2],θ0. (17)

The direction Eq. (14) implying Eq. (17) is trivial. As for the other direction, let

κ=-maxθ:θ=1{γ2𝔼μ[maxa𝒜(ϕ(s,a)θ)2]-𝔼μ[(ϕ(s,a)θ)2]}.

By Weierstrass extreme value theorem [24], κ is well-defined and strictly positive, which immediately gives Eq. (14).

We next analyze condition (17). We first look at the terms inside the expectation on both sides of the inequality. On the right-hand side, the quantity (ϕ(s,a)θ)2 can be interpreted as Q~θ2(s,a1) with the action a1 chosen according to the behavior policy π. On the left-hand side, the term maxa𝒜(ϕ(s,a)θ)2 is essentially Q~θ2(s,a2) where a2 is chosen greedily with respect to Q~θ2. Therefore, it is clear that

𝔼μ[maxa𝒜(ϕ(s,a)θ)2]𝔼μ[(ϕ(s,a)θ)2],θd.

To meet condition (17), besides the presence of γ2, the behavior policy π should be somehow close to the policy induced greedily by Q~θ2 for any non-zero θ, which is a relatively restrictive requirement.

Similar assumptions on the behavior policy were also proposed in [19, 16]. Although the exact form of the conditions are different, they all follow the same spirit. That is, with a chosen Lyapunov function, the condition should enable us to show that the corresponding ODE

θ˙(t)=F¯(θ(t)) (18)

of the Q-learning algorithm (12) is GAS. We next briefly compare our condition to those proposed in [19] and [16]. The condition in [19] (their Eq. (7)) implies

2γ2𝔼μ[(maxa𝒜ϕ(s,a)θ)2]<𝔼μ[(ϕ(s,a)θ)2] (19)

for all nonzero θ 11 1 The factor of 2 appears to be missing in [19]..The right-hand side is the same for both Eqs. (19) and (17). On the left-hand side, Eq. (19) has an additional factor of 2, and the square is outside the max operator. Although they are similar, our condition and the condition proposed in [19] do not imply each other. As for the condition proposed in [16], while it is not clear if it is less restrictive than ours, it is shown that the condition in [16] implies the condition in [19] under more restrictive assumptions. However, [16] assumes i.i.d. sampling, and studies only the asymptotic convergence rather than finite-sample error bounds.

We next analyze how the features {ϕi} and the behavior policy π impact condition (17) in the following two examples.

Example 3.1.

Consider the case when d=1. That is, there is only one feature ϕ1, and the weight θ is a scalar on the real line. Condition (17) reduces to

γ2𝔼μ[maxa𝒜ϕ(s,a)2]<𝔼μ[ϕ(s,a)2]. (20)

Let rπ=Eμ[ϕ(s,a)R(s,a)],

h+=𝔼μ[γϕ(s,a)maxa𝒜ϕ(s,a)-ϕ(s,a)2],𝑎𝑛𝑑h-=𝔼μ[γϕ(s,a)mina𝒜ϕ(s,a)-ϕ(s,a)2].

Then we have the following result, whose proof can be found in Appendix B.2.

Proposition 3.2.

Eq. (20) implies h+<0 and h-<0, and the following statements regarding the relation between the stability of ODE (18) and the sign of h+ and h- hold:

ODE (18) is GAS if and only if {h+<0,h-<0,when rπ=0,h+<0,h-0,when rπ>0,h+0,h-<0,when rπ<0.

Proposition 3.2 implies that the condition (20) is “almost necessary" for the GAS of ODE (18). Moreover, it is clear from Eq. (20) that when d=1, there always exists a behavior policy π such that Eq. (20) is satisfied, e.g. π(s)argmaxaAϕ(s,a)2 is a feasible behavior policy.

Example 3.2.

We next consider another extreme case where we have d=|S||A|, i.e., there is no dimension reduction at all. We show in the following proposition that, in the full-dimension case, condition (17) is feasible in terms of the behavior π only when the discount factor γ is sufficiently small. See Appendix B.3 for its proof.

Proposition 3.3.

When γ21/m, condition (17) is infeasible for any behavior policy π.

We now compare the results for the two extreme cases, i.e., d=1 and d=|𝒮||𝒜|. We see that in the uni-dimensional case, Eq. (17) implies a condition which is almost sufficient and necessary for the GAS of the equilibrium θ* to the ODE (18). Moreover, there always exists a behavior policy π satisfying (17). However, in the full-dimensional case, condition (17) is infeasible in terms of the behavior policy π when γ21/m, which is usually the case in practice where γ is chosen close to unity to ensure enough exploration.

In the next section, we present numerical experiments to demonstrate the sufficiency of condition (17), as well as the resulting convergence rate of the Q-learning with linear function approximation algorithm.

3.4 Numerical Experiments

To present the numerical experiments, we need to rewrite condition (17) in a different way. Defining

δ(π)=min{θ:θ=1}𝔼μ[(ϕ(s,a)θ)2]𝔼μ[maxa𝒜(ϕ(s,a)θ)2], (21)

we see that condition (17) is equivalent to δ(π)>γ2. One way to compute δ(π) is presented in Appendix B.4.

In our simulation, we consider the divergent example of Q-learning with linear function approximation introduced in [1], where there are 7 states and 2 actions, and the number of features is 14. See [1] or Appendix B.5 for more details about this example. Note that since |𝒮||𝒜|=d in this example, it belongs to the full-dimensional case discussed in Subsection 3.3, and is essentially a change of basis. Surprisingly, even in this case, the Q-learning algorithm (12) can diverge.

To demonstrate the effectiveness of condition (17) for the stability of Q-learning, in our first set of simulations, the reward function is set to zero. Since the reward function is identically zero, Q* is zero, implying θ* is zero. Because of this structure, it is possible for the Q-learning algorithm to converge even when constant step size is used. We choose the behavior policy π which takes each action with equal probability. It turns out that δ(π)0.5, giving the threshold for γ to satisfy Eq. (17) being δ(π)1/20.7. In the numerical experiments, we choose constant step size ϵ=0.01, discount factor γ{0.7,0.9,0.97}, and plot θk as a function of the number of iterations k in Figure 2. Here, θk converges when γ=0.7,0.9, but diverges when γ=0.97. This demonstrates that condition (17) is sufficient but not necessary for convergence. This also shows that by ensuring (17), the counter example from [1] can be made to converge.

To show the exponential convergence rate for using constant step size, we consider the convergence of θk when γ=0.7 given in Figure 2, where we plot ln𝔼[θk2] as a function on the number of iterations k. In this case, θk seems to converge exponentially to 0, which agrees with Corollary 3.1.

Figure 1: Convergence of Q-learning with linear function approximation for different discount factor γ
Figure 2: Exponentially fast convergence of Q-learning with linear function approximation for γ=0.7
Figure 3: Convergence rate for diminishing step sizes
Figure 4: Convergence rate for diminishing step sizes: the asymptotic behavior

In our second set of experiments, we illustrate the convergence rate of Q-learning with linear function approximation for using diminishing step sizes ϵk=ϵ/kξ. We use the same MDP model, but the reward is sampled independently from a uniform distribution on (0,1) for all state-action pairs. The behavior policy is still to take each action with equal probability. The constant κ given in Eq. (14) is estimated by numerical optimization, and the discount factor γ is set to be 0.7. In figure 4, we plot the quantity 𝔼[θk-θ*2] as a function of k for ξ{0.4,0.6,0.8,1}. In the case where ξ=1, the constant coefficient ϵ is chosen such that κϵ2 in order to achieve the optimal convergence rate. We see that the iterates converge for all ξ(0,1]. Moreover, the larger the value of ξ is, the faster θk converges.

To further demonstrate the convergence rate, we plot the quantity ln𝔼[θk-θ*2] as a function of lnk in Figure 4 and look at its asymptotic behavior. We see that the slope is approximately -ξ, which agrees with Corollary 3.2.

4 Conclusion

In this paper we establish finite-sample convergence guarantee for a general nonlinear stochastic approximation algorithm with Markovian noise. The result is used to derive the finite-sample error bounds of the Q-learning with linear function approximation algorithm for solving the RL problem. Since such an algorithm is known to diverge in general, we study it under a condition on the behavior policy that ensures stability. Sufficiency of this condition on the behavior policy, as well as the rate of convergence of Q-learning is also verified numerically in the context of a well-known counter example.

Appendix A Proof of All Technical Results in Section 2

A.1 Proof of Claim 2.1

Since the finite state Markov chain {Xk} is irreducible and aperiodic, we have by Theorem 4.9 in [17] that there exists C>0 and ρ(0,1) such that dmax(k)Cρk for all k0. Note that the inequality Cρkδ2L holds when kln(1/δ)+ln(2CL)ln(1/ρ). Therefore, we have

tδ=tmix(δ2L)ln(1/δ)+ln(2CL)ln(1/ρ)L1(ln(1/δ)+1),

where L1=max(1,ln(2CL))/ln(1/ρ).

A.2 Proof of Corollary 3.1

Let K=tϵ. Since ϵ is chosen such that ϵtϵα114L2, it is easy to verify that Condition 2.1 is satisfied. Moreover, we have

  1. (1)

    i=tϵk-1(1-αϵi)=(1-αϵ)k-tϵ

  2. (2)

    i=tϵk-1ϵ^ij=i+1k-1(1-αϵj)ϵ2tϵi=0(1-αϵ)i=ϵtϵα.

The result then follows from the above two observations.

A.3 Proof of Corollary 3.2

We first show that when ϵk=ϵ(k+h)ξ (ϵ>0, ξ(0,1]) with properly chosen h, there exists K>0 such that Condition 2.1 is satisfied.

It is enough to verify Condition 2.1 (2). Note that ϵ0,K-114L and ϵk-tk,k-1α114L2 for all kK are satisfied when KtK and ϵ0,K-1α114L2. Using Claim 2.1 and ϵkϵhξ for all k0, we have

KtK KL1[ln(K+h)ξϵ+1] (22)
ϵ0,K-1α114L2 ϵKhξα114L2. (23)

Due to the logarithmic function in Eq. (22), it is easy to reason the existence of h and K satisfying Eqs. (22) and (23). In particular, denote N=max(0,L1(ln(1+α114L2)+1-lnϵ)), Eqs. (22) and (23) are satisfied when

h(228ϵL2(L1+N)ln(L1+N)α)1/ξ,

and K=αhξ114ϵL2.

We next plug in the expression of the step sizes into the finite-sample bound (7). We first simplify ϵ^k in the following. Using Claim 2.1 and we have

ϵ^ktkϵk2=ϵk-tk,k-1tkϵkϵk-tkϵk(k+h)ξ(k+h-L1(ξln(k+h)-lnϵ+1))ξ1 as k.

Assume with out loss of generality that ϵ^ktkϵk22 for all kK, we have:

𝔼[θk-θ*2]
β1j=Kk-1(1-αϵj)+β2i=Kk-1ϵ^ij=i+1k-1(1-αϵj)
β1j=Kk-1(1-αϵj)+β2i=Kk-12tiϵi2j=i+1k-1(1-αϵj)
β1j=Kk-1(1-αϵj)+2β2tki=Kk-1ϵi2j=i+1k-1(1-αϵj)
β1j=Kk-1(1-αϵ(j+h)ξ)A1+2β2ϵ2L1[ln(k+hϵ)+1]i=Kk-11(i+h)2ξj=i+1k-1(1-αϵ(j+h)ξ)A2. (24)

We now evaluate the terms A1 and A2. For the term A1, using the fact that ab+1f(x)𝑑xk=abf(k)a-1bf(x)𝑑x for all non-increasing function f(x), we have

j=Kk-1(1-αϵ(j+h)ξ) exp(-αϵj=Kk-11(j+h)ξ) (x+1ex for all x)
exp(-αϵKk1(x+h)ξ𝑑x)
= {(K+hk+h)αϵ,ξ=1,exp[-αϵ1-ξ((k+h)1-ξ-(K+h)1-ξ)],ξ(0,1). (25)

We next evaluate the term A2. When ξ=1, using the same line of analysis for evaluating A1, i.e., bounding the summation by its corresponding integral, we have

A2 =i=Kk-11(i+h)2j=i+1k-1(1-αϵj+h)
i=Kk-11(i+h)2(i+1+hk+h)αϵ
=1(k+h)αϵi=Kk-11(i+1+h)2-αϵ(i+1+hi+h)2
4(k+h)αϵi=Kk-11(i+1+h)2-αϵA3.

The upper bound for the term A3 depends on the value αϵ. Specifically, we have the following results.

  1. (1)

    When αϵ(0,1), we have

    A3K-1k-11(x+1+h)2-αϵ𝑑x11-αϵ.
  2. (2)

    When αϵ=1, we have

    A3K-1k-11x+1+h𝑑xln(k+hK+h).
  3. (3)

    When αϵ(1,2), we have

    A3K-1k-11(x+1+h)2-αϵ𝑑x1αϵ-1(k+h)αϵ-1.
  4. (4)

    When αϵ=2, we have

    A3=k-K.
  5. (5)

    When αϵ(2,), we have

    A3 Kk1(x+1+h)2-αϵ𝑑x
    1αϵ-1[(k+1+h)αϵ-1-(K+1+h)αϵ-1]
    1αϵ-1(k+1+h)αϵ-1
    eαϵ-1(k+h)αϵ-1.

Combine the above five cases and we have

A24(k+h)αϵA3{41-αϵ1(k+h)αϵ,αϵ<1,4k+hln(k+hK+h),αϵ=1,4eαϵ-11k+h,αϵ>1.

We next consider the case when ξ(0,1). Let {uk}kK be a sequence defined by

uK=0,uk+1=(1-αϵ(k+h)ξ)uk+1(k+h)2ξ.

It is easy to verify that uk=A2. We next use induction to show that when K+h[2ξ/(αϵ)]1/(1-ξ), we have uk2αϵ1(k+h)ξ for all kK.

Since uK=02αϵ1(K+h)ξ, we have the base case. Now suppose uk2αϵ1(k+h)ξ for some kK. Using the definition of uk and the induction hypothesis, we have

2αϵ1(k+1+h)ξ-uk+1= 2αϵ1(k+1+h)ξ-(1-αϵ(k+h)ξ)uk-1(k+h)2ξ
2αϵ1(k+1+h)ξ-(1-αϵ(k+h)ξ)2αϵ1(k+h)ξ
-1(k+h)2ξ
= 2αϵ1(k+h)2ξ[αϵ2-(k+h)ξ(1-(k+hk+1+h)ξ)]
2αϵ1(k+h)2ξ[αϵ2-ξ(k+h)1-ξ] (26)
0, (27)

where (26) follows from

(k+1k+1+h)ξ =[(1+1k+h)k+h]-ξ/(k+h)
e-ξ/(k+h)
1-ξk+h,

and (27) follows from k+hK+h[2ξ/(αϵ)]1/(1-ξ).

Now we have provided upper bounds for the terms A1 and A2 for different values of ξ and ϵ. Using them in Eq. (24) and we finally obtain the finite-sample bound appeared in Corollary 2.2.

A.4 Proof of Lemma 2.1

Using Eq. (4) and we have for all ktk:

𝔼k[F(Xk,θk)2]L2𝔼k[(θk+1)2]2L2[𝔼k[θk-θ*2]+(θ*+1)2].

A.5 Proof of Lemma 2.2

Using Eq. (4), Definition 2.1, and the definition of the maximal distance dmax(k), we have for all x, θ, and ktδ:

𝔼[F(Xk,θ)X0=x]-F¯(θ)= y𝒳(Pk(x,y)-μ(y))F(y,θ)
y𝒳|Pk(x,y)-μ(y)|F(y,θ)
L(θ+1)y𝒳|Pk(x,y)-μ(y)|
2L(θ+1)Pk(x,)-μTV
2Ldmax(k)(θ+1)
δ(θ+1).

A.6 Proof of Lemma 2.3

For any k1<k2, we first upper bound θt for any t[k1,k2]. Using triangle inequality and Eq. (4), we have

θt+1-θtθt+1-θtLϵt(θt+1), (28)

which gives (θt+1+1)(Lϵt+1)(θt+1). Recursively applying the preceding inequality, then using the fact that 1+xex for all x, we have for all t[k1,k2]:

θt+1j=k1t-1(Lϵj+1)(θk1+1)exp(Lϵk1,k2-1)(θk1+1).

Since ex1+2x for all x[0,1/2] and ϵk1,k2-114L, we further obtain

θt+1(1+2Lϵk1,k2-1)(θk1+1)2(θk1+1).

Note that the previous inequality holds for all t[k1,k2]. Then we have by Eq. (28) that

θk2-θk1t=k1k2-1θt+1-θt2Lϵk1,k2-1(θk1+1).

Moreover, since

θk2-θk12Lϵk1,k2-1(θk1+1)2Lϵk1,k2-1(θk2-θk1+θk2+1)

and ϵk1,k2-114L, we have θk2-θk14Lϵk1,k2-1(θk2+1).

A.7 Proof of Lemma 2.4

We begin by decomposing the left-hand side of the desired inequality in the following way:

𝔼k[(θk-θ*)(F(Xk,θk)-F¯(θk))]
= 𝔼k[(θk-θk-tk)(F(Xk,θk)-F¯(θk))](T1)+𝔼k[(θk-tk-θ*)(F(Xk,θk)-F¯(θk))](T2).

We first consider the term (T1). Since ϵk-tk,k-114L, Lemma 2.3 is applicable for k1=k-tk and k2=k. Therefore, we have by Lemma 2.3 that:

(T1)= 𝔼k[(θk-θk-tk)(F(Xk,θk)-F¯(θk))]
𝔼k[θk-θk-tkF(Xk,θk)-F¯(θk)]
𝔼k[θk-θk-tk(F(Xk,θk)+F¯(θk))]
2L𝔼k[θk-θk-tk(θk+1)]
8L2ϵk-tk,k-1𝔼k[(θk+1)2]
16L2ϵk-tk,k-1(𝔼k[θk-θ*2]+(θ*+1)2). (29)

For the term (T2), Assumption 2.1 gives

|(θk-tk-θ*)(F(Xk,θk)-F¯(θk))-(θk-tk-θ*)(F(Xk,θk-tk)-F¯(θk-tk))|
|(θk-tk-θ*)(F(Xk,θk)-F(Xk,θk-tk))|+|(θk-tk-θ*)(F¯(θk)-F¯(θk-tk))|
θk-tk-θ*(F(Xk,θk)-F(Xk,θk-tk)+F¯(θk)-F¯(θk-tk))
2Lθk-tk-θ*θk-θk-tk.

Thus, we have

(T2)= 𝔼k[(θk-tk-θ*)(F(Xk,θk)-F¯(θk))]
𝔼k[(θk-tk-θ*)(F(Xk,θk-tk)-F¯(θk-tk))]+2Lθk-tk-θ*𝔼k[θk-θk-tk]
= (θk-tk-θ*)(𝔼k[F(Xk,θk-tk)]-F¯(θk-tk))+2Lθk-tk-θ*𝔼k[θk-θk-tk]
θk-tk-θ*𝔼k[F(Xk,θk-tk)]-F¯(θk-tk)(T2,1)+2Lθk-tk-θ*𝔼k[θk-θk-tk](T2,2).

Consider the term (T2,1). Using Lemma 2.2, we have

𝔼k[F(Xk,θk-tk)]-F¯(θk-tk)ϵk(θk-tk+1).

Therefore, we obtain

(T2,1)ϵk𝔼k[(θk-tk+1)θk-tk-θ*].

Since Lemma 2.3 together with our assumption that ϵk-tk,k-114L gives

θk-θk-tk4Lϵk-tk,k-1(θk+1)(θk+1),

we have

θk-tk-θ*(θk-tk+1)
(θk-θk-tk+θk-θ*)(θk-θk-tk+θk-θ*+θ*+1)
(θk+θk-θ*+1)(θk+θk-θ*+θ*+2)
(2θk-θ*+θ*+1)(2θk-θ*+2θ*+2)
4(θk-θ*+θ*+1)2
8[θk-θ*2+(θ*+1)2].

Hence we can further control the term (T2,1) by

(T2,1)8ϵk[𝔼k[θk-θ*2]+(θ*+1)2]. (30)

As for the term (T2,2), similarly Lemma 2.3 gives

(T2,2) =2Lθk-tk-θ*𝔼k[θk-θk-tk]
8L2ϵk-tk,k-1𝔼k[θk-tk-θ*(θk+1)]
8L2ϵk-tk,k-1𝔼k[(θk-θk-tk+θk-θ*)(θk+1)]
16L2ϵk-tk,k-1𝔼k[(θk-θ*+θ*+1)2]
32L2ϵk-tk,k-1𝔼k[[θk-θ*2]+(θ*+1)2]. (31)

Using Eqs. (30) and (31) and the fact that ϵkϵk-tk,k-1, we have

(T2)(T2,1)+(T2,2)40L2ϵk-tk,k-1[𝔼k[θk-θ*2]+(θ*+1)2]. (32)

Using Eqs. (29) and (32), we finally obtain

(b)= 2ϵk𝔼k[(θk-θ*)(F(Xk,θk)-F¯(θk))]
= 2ϵk[(T1)+(T2)]
112L2ϵkϵk-tk,k-1[𝔼k[θk-θ*2]+(θ*+1)2].

A.8 Proof of Lemma 2.5

Substituting the upper bounds we obtained for the terms (a), (b), and (c) into Eq. (9), we have

𝔼k[θk+1-θ*2]-𝔼k[θk-θ*2] (-2αϵk+112L2ϵkϵk-tk,k-1+2L2ϵk2)𝔼k[θk-θ*]
+(112L2ϵkϵk-tk,k-1+2L2ϵk2)(θ*+1)2
(-2αϵk+114L2ϵkϵk-tk,k-1)𝔼k[θk-θ*]
+114L2ϵkϵk-tk,k-1(θ*+1)2,

where in the last line we used ϵkϵk-tk,k-1. The result then follows by taking the total expectation on both side of the previous inequality.

Appendix B Proof of All Technical Results in Section 3

B.1 Proof of Proposition 3.1

  1. (1)

    Consider two arbitrary states x1=(s1,a1,s1),x2=(s2,a2,s2)𝒳. Since {sk} is irreducible, there exists k>0 such that Pk(s1,s2)>0. Hence we have

    Pk+1(x1,x2)=Pk(s1,s2)π(a2|s2)Pa2(s2,s2)>0.

    It follows that {Xk} is irreducible.

    To show {Xk} is aperiodic, assume for a contradiction that {Xk} is periodic with period T2. Since {Xk} is irreducible, every state in 𝒳 has the same period. Therefore, for any x=(s,a,s)𝒳, we must have Pk(x,x)=0 for all k not divisible by T. However, we have for any k not divisible by T:

    Pk(s,s) =s𝒮Pk-1(s,s)P(s,s)
    =s𝒮a𝒜Pk-1(s,s)π(a|s)Pa(s,s)
    =s𝒮a𝒜Pk((s,a,s),(s,a,s)) (33)
    =0. (34)

    To see (33), since {sk} is a Markov chain, we have

    Pk((s,a,s),(s,a,s))= (sk=s,ak=a,sk+1=s|s0=s,a0=a,s1=s)
    = (sk=s,ak=a,sk+1=s|s1=s)
    = Pk-1(s,s)π(a|s)Pa(s,s).

    Therefore, Eq. (34) shows that the period of s is at least T, which is a contradiction. Hence the Markov chain {Xk} must be aperiodic.

  2. (2)

    Using Cauchy-Schwarz inequality, and our assumption that ϕ(s,a)1 for all state-action pairs, we have for any θ1,θ2 and x:

    F(x,θ1)-F(x,θ2)= ϕ(s,a)((s,a)+γmaxa1𝒜ϕ(s,a1)θ1-ϕ(s,a)θ1)
    -ϕ(s,a)((s,a)+γmaxa2𝒜ϕ(s,a2)θ2-ϕ(s,a)θ2)
    γϕ(s,a)(maxa1𝒜ϕ(s,a1)θ1-maxa2𝒜ϕ(s,a2)θ2)
    +ϕ(s,a)ϕ(s,a)(θ1-θ2)
    γ|maxa1𝒜ϕ(s,a1)θ1-maxa2𝒜ϕ(s,a2)θ2|+θ1-θ2.

    Since

    |maxa1𝒜ϕ(s,a1)θ1-maxa2𝒜ϕ(s,a1)θ2| maxa𝒜|ϕ(s,a)(θ1-θ2)| (35)
    maxa𝒜ϕ(s,a)θ1-θ2
    θ1-θ2,

    we have for any θ1,θ2 and x:

    F(x,θ1)-F(x,θ2) γ|maxa1𝒜ϕ(s,a1)θ1-maxa2𝒜ϕ(s,a2)θ2|+θ1-θ2
    (γ+1)θ1-θ2.

    Therefore, the quantity M=1+γ+rmax is a valid Lipschitz constant. Moreover, since F(x,0)=ϕ(s,a)(s,a)rmax for any (s,a), we have Mmaxx𝒳F(x,0) as required.

  3. (3)

    Using the fact that F¯(θ*)=0, we have

    (θ-θ*)(F¯(θ)-F¯(θ*))
    = γ(θ-θ*)𝔼μ[ϕ(s,a)(maxa1𝒜ϕ(s,a1)θ-maxa2𝒜ϕ(s,a2)θ*)]
    -𝔼μ[(ϕ(s,a)(θ-θ*))2]
    γ𝔼μ[|ϕ(s,a)(θ-θ*)|maxa𝒜|ϕ(s,a)(θ-θ*)|]-𝔼μ[(ϕ(s,a)(θ-θ*))2] (36)
    γ𝔼μ[(ϕ(s,a)(θ-θ*))2]𝔼μ[maxa𝒜(ϕ(s,a)(θ-θ*))2]
    -𝔼μ[(ϕ(s,a)(θ-θ*))2]. (37)

    Eq. (36) follows from Eq. (35). Eq. (37) follows from the fact that when s is sampled from the stationary distribution μ, its successor state s has the same distribution. For simplicity of notations, denote A=𝔼μ[(ϕ(s,a)(θ-θ*))2] and B=𝔼μ[maxa𝒜(ϕ(s,a)(θ-θ*))2]. Since Assumption 3.3 gives γ2B2-A2-κθ-θ*2, we have

    (θ-θ*)F¯(θ)γAB-A2=γ2B2-A2γB/A+1-κ2θ-θ*2.

B.2 Proof of Proposition 3.2

We first show that Eq. (20) implies h+<0, and h-<0. Note that Jensen’s inequality implies

𝔼μ[maxa𝒜ϕ(s,a)2]= 𝔼μ{max[(maxa𝒜ϕ(s,a))2,(mina𝒜ϕ(s,a))2]}
max{𝔼μ[(maxa𝒜ϕ(s,a))2],𝔼μ[(mina𝒜ϕ(s,a))2]}. (38)

Thus, using Eq. (20), we have

h+= 𝔼μ[γϕ(s,a)maxa𝒜ϕ(s,a)]-𝔼μ[ϕ(s,a)2]
= 𝔼μ[γϕ(s,a)maxa𝒜ϕ(s,a)]-𝔼μ[ϕ(s,a)2]𝔼μ[ϕ(s,a)2]
< 𝔼μ[γϕ(s,a)maxa𝒜ϕ(s,a)]-γ𝔼μ[maxa𝒜ϕ(s,a)2]𝔼μ[ϕ(s,a)2]
0,

where the last inequality follows from Cauchy-Schwarz inquality and the fact that s and s are equal in distribution if sμ(). Similarly, we also have h-<0.

We next show the three statements hold in Proposition 3.2. By definition of h+ and h-, in uni-dimensional case, ODE (18) can be equivalently written as

θ˙(t)={h+θ(t)+rπ,θ(t)0h-θ(t)+rπ,θ(t)<0.

In the case where rπ=0, it is easy to see that ODE (18) is globally asymptotically stable if and only if h+,h-<0. Now we without loss of generality assume rπ>0, the proof for the other case is entirely similar.

  • Sufficiency: We first note that θ*=-rπ/h+>0. Let V(θ)=12(θ-θ*)2 be a candidate Lyapunov function. It is clear that V(θ)0 for all θ, and V(θ)=0 if and only if θ=θ*. Moreover,

    V˙(θ(t)) =(θ(t)-θ*)θ˙(t)
    ={h+(θ(t)-θ*)2,θ(t)0(θ(t)-θ*)(h-θ(t)-h+θ*),θ(t)<0.

    It is clear that V˙(θ(t))<0 when θ(t)[0,θ*)(θ*,). For θ(t)<0, since θ(t)-θ*<0, h+θ*=-rπ<0, and h-θ(t)0, we must also have V˙(θ(t))<0. Therefore, the time derivative of the Lyapunov function V(θ) along the trajectory of ODE (18) is strictly negative when θ(t)θ*. It then follows from the Lyapunov stability theorem [12, 14] that θ* is globally asymptotically stable.

  • Necessity: We prove by contradiction. Suppose that the equilibrium point θ* is globally asymptotically stable, but h+0 or h->0.

    If h+0, when θ(0)>max(0,θ*), we have θ˙(t)=h+θ(t)+rπrπ>0. It follows that θ(t)>θ(0)>θ* for all t0, which contradict to the fact that θ* is a globally asymptotically stable equilibrium point.

    Similarly, if h->0, when θ(0)<min(θ*,-(1+rπ)/h-), we have θ˙(t)=h-θ(t)+rπ-1<0. It follows that θ(t)<θ(0)<θ* for all t0, which also contradict to the fact θ* being globally asymptotically stable.

B.3 Proof of Proposition 3.3

Since d=|𝒮||𝒜|, there are total number of |𝒮||𝒜| feature vectors, and the feature matrix Φ is a full-rank square matrix with dimension |𝒮||𝒜|. Define

Θs,a=span({ϕ(s,a)(s,a)𝒮×𝒜,(s,a)(s,a)}).

Note that Θs,a exists for all state-action pairs since Φ is full rank. Now for a given state-action pair (s,a), let θΘs,a (θ0), Eq. (17) implies

γ2μ(s)(ϕ(s,a)θ)2<μ(s)π(a|s)(ϕ(s,a)θ)2,

which further gives γ2<π(a|s). Therefore, by running though all state-action pairs, we have γ2<min(s,a)𝒮×𝒜π(a|s)1m. Thus, if γ21/m, there is no behavior policy π that satisfies condition (17).

B.4 Computing δ(π) Given in Eq. (21)

We here present one way to compute δ(π) for an MDP with a chosen policy π when the underlying model is known. Before that, the following definitions are needed.

Definition B.1.

Let Σμ,π:=ΦDμ,πΦRd×d, where Dμ,πR|S||A|×|S||A| is a diagonal matrix with diagonal entries {μ(s)π(a|s)}(s,a)S×A, and Φ is the feature matrix.

Definition B.2.

Let B=AnRn be the set of all deterministic policies.

Definition B.3.

Let DμR|S|×|S| be a diagonal matrix with diagonal entries {μ(s)}sS, and let Σμ,b:=ΦbDμΦbRd×d, where ΦbR|S|×d (bB) is such that

Φb=[ϕ(s1,b1),ϕ(s2,b2),,ϕ(sn,bn)].

We now compute δ(π) given in the following lemma.

Lemma B.1.

Let δ(π) be defined in (21) and let λmax(M) be the largest eigenvalue of a positive semi-definite matrix M. Then we have

δ(π)=minb[1/λmax(Σμ,π-1/2Σμ,bΣμ,π-1/2)].
Proof of Lemma B.1.

Recall our definition for δ(π):

δ(π)=minθ0s𝒮μ(s)a𝒜π(a|s)(ϕ(s,a)θ)2s𝒮μ(s)maxa𝒜(ϕ(s,a)θ)2 (39)

Let f(θ) be the numerator, we have

f(θ)=s𝒮μ(s)a𝒜π(a|s)(ϕ(s,a)θ)2=θΦDμ,πΦθ=θΣμ,πθ.

Since the diagonal entries of Dμ,π are all positive, and Φ is full column rank, Σμ,π is symmetric and positive definite. To represent the denominator of (39) in a similar form, let

g(θ,b)=i=1nμ(si)(ϕ(si,bi)θ)2=θΦbDμΦbθ=θΣμ,bθ,where b.

Since the columns of Φb can be dependent, Σμ,b is in general only symmetric and positive semi-definite. With the definition of f(θ) and g(θ,b), δ(π) can be represented as

δ(π)=minθ0f(θ)maxbg(θ,b)=minθ0minbf(θ)g(θ,b)=minbminθ0f(θ)g(θ,b).

Now since Σμ,π is positive definite, Σμ,π1/2 and Σμ,π-1/2 are both well-defined and positive definite, we have

minθ0f(θ)g(θ,b) =[maxθ0g(θ,b)f(θ)]-1
=[maxθ0θΣμ,bθθΣμ,πθ]-1
=[(maxx0Σμ,b1/2Σμ,π-1/2xx)2]-1
=1λmax(Σμ,π-1/2Σμ,bΣμ,π-1/2),

where the function λmax() returns the largest eigenvalue. It follows that

δ(π)=minb[1/λmax(Σμ,π-1/2Σμ,bΣμ,π-1/2)].

B.5 The MDP Model Used in Numerical Experiments

Our numerical experiments adopt the MDP model of the classical divergent example of Q-learning with linear function approximation introduced in [1]. Consider the infinite-horizon seven-state, two-action MDP shown in Figure 5. The dashed action takes the system to one of the six upper states with equal probability, whereas the solid action takes the system to the seventh state with probability one. The behavior policy π selects the dashed and solid actions with equal probability. Consider estimating the Q-function under the linear parameterization indicated by the expression showing along each arrow in Figure 5. For example, the estimated value of state 1 taking the solid action is θ0+2θ1, where the subscript corresponds to the component of the overall weight vector θ14. It is easy to verify that the feature matrix Φ in this example is a square matrix with full rank.

Figure 5: Baird’s counter-example [1]

References

  • Baird [1995] Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
  • Beck and Srikant [2013] Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant step-size Q-learning. In 2013 American Control Conference, pages 1926–1931. IEEE.
  • Benveniste et al. [2012] Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
  • Bertsekas and Tsitsiklis [1996] Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA.
  • Bhandari et al. [2018] Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691–1692.
  • Borkar [2009] Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer.
  • Borkar and Meyn [2000] Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469.
  • Chen et al. [2020] Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874.
  • De Farias and Van Roy [2000] De Farias, D. P. and Van Roy, B. (2000). On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105(3):589–608.
  • Duchi et al. [2012] Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. (2012). Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578.
  • Even-Dar and Mansour [2003] Even-Dar, E. and Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25.
  • Haddad and Chellaboina [2011] Haddad, W. M. and Chellaboina, V. (2011). Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.
  • Jaakkola et al. [1994] Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710.
  • Khalil and Grizzle [2002] Khalil, H. K. and Grizzle, J. W. (2002). Nonlinear systems, volume 3. Prentice hall Upper Saddle River, NJ.
  • Kober et al. [2013] Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
  • Lee and He [2019] Lee, D. and He, N. (2019). A unified switching system perspective and ode analysis of Q-learning algorithms. Preprint arXiv:1912.02270.
  • Levin and Peres [2017] Levin, D. A. and Peres, Y. (2017). Markov chains and mixing times, volume 107. American Mathematical Soc.
  • Li et al. [2020] Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Preprint arXiv:2006.03041.
  • Melo et al. [2008] Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM.
  • Nemirovski et al. [2009] Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
  • Puterman [2014] Puterman, M. L. (2014). Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
  • Qu and Wierman [2020] Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. Preprint arXiv:2002.00260.
  • Robbins and Monro [1951] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407.
  • Rudin et al. [1964] Rudin, W. et al. (1964). Principles of mathematical analysis, volume 3. McGraw-hill New York.
  • Ryu and Boyd [2016] Ryu, E. K. and Boyd, S. (2016). Primer on monotone operator methods. Appl. Comput. Math, 15(1):3–43.
  • Shalev-Shwartz et al. [2016] Shalev-Shwartz, S., Shammah, S., and Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. Preprint arXiv:1610.03295.
  • Silver et al. [2017] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676):354.
  • Srikant and Ying [2019] Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830.
  • Sutton [1988] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44.
  • Sutton and Barto [2018] Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press Cambridge, 2 edition.
  • Tesauro et al. [2008] Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., and Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In Advances in Neural Information Processing Systems, pages 1497–1504.
  • Thoppe and Borkar [2019] Thoppe, G. and Borkar, V. (2019). A concentration bound for stochastic approximation via alekseev’s formula. Stochastic Systems, 9(1):1–26.
  • Tsitsiklis [1994] Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine learning, 16(3):185–202.
  • Tsitsiklis and Van Roy [1997] Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081.
  • Wainwright [2019] Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp -bounds for Q-learning. Preprint arXiv:1905.06265.
  • Watkins and Dayan [1992] Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279–292.