Motivated by applications in Reinforcement Learning (RL), in this paper, westudy a nonlinear Stochastic Approximation (SA) algorithm under Markoviannoise, and derive its finite-sample convergence bounds. Our proof is based onthe Lyapunov drift arguments, and to handle the Markovian noise, we exploit thefast mixing of the underlying Markov chain. Our result is used to show the finite-sample bounds of the popular Q-learningwith linear function approximation algorithm for solving the RL problem. SinceQ-learning with linear function approximation may diverge in general, we studyit under a condition on the behavior policy that ensures the stability of thealgorithm. Due to the generality of our SA results, we do not need to make theunnatural assumption that the samples are i.i.d. (since they are Markovian),and do not require an additional projection step in the algorithm to maintainthe boundedness of the iterates.
Quick Read (beta)
Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in
Motivated by applications in Reinforcement Learning (RL), in this paper, we study a nonlinear Stochastic Approximation (SA) algorithm under Markovian noise, and derive its finite-sample convergence bounds. Our proof is based on the Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain.
Our result is used to show the finite-sample bounds of the popular -learning with linear function approximation algorithm for solving the RL problem. Since -learning with linear function approximation may diverge in general, we study it under a condition on the behavior policy that ensures the stability of the algorithm. Due to the generality of our SA results, we do not need to make the unnatural assumption that the samples are i.i.d. (since they are Markovian), and do not require an additional projection step in the algorithm to maintain the boundedness of the iterates.
Reinforcement Learning (RL) is a framework to solve sequential decision-making problems by interacting with the environment . This approach has demonstrated tremendous successes for solving many practical problems in several different areas, such as robotics , power management , autonomous driving , and board games .
The RL problem can be modeled as a Markov Decision Process (MDP) and so reduces to solving the so-called Bellman’s equation [30, 21]. However, unlike in an MDP, since the environmental model is unknown in RL, such system of equations are usually solved using the Stochastic Approximation (SA) method, where the solution is iteratively estimated using noisy samples collected from the system [4, 23]. A typical example is the popular -learning algorithm proposed in , which can be viewed as a nonlinear SA algorithm with martingale difference noise (conditionally mean zero noise). However, when using -learning along with function approximation, the corresponding SA algorithm involves Markovian noise, and cannot be modeled by martingale difference. This motivates one to study the behavior of general nonlinear SA algorithms under Markovian sampling. This paper focuses on establishing finite-sample convergence guarantee for such SA algorithms, and apply it to the -learning with linear function approximation algorithm.
In general SA framework, usually the target equation can be written in the form , where is a random variable with distribution ( is unknown), and is a general nonlinear mapping. The SA algorithm for solving such an equation is then given by: , where is a sequence of samples of the random variable , and is the step size. Besides their applications in RL, SA algorithms of this form are widely used in optimization. For example, when for some cost function , the SA algorithm is known as the Stochastic Gradient Descent (SGD) method for minimizing [20, 25]. The behavior of such an SA algorithm, critically depends upon the following: (a) The way we collect the sample sequence , and (b) The properties of the function .
In existing literature, when the function is a linear function of , finite-sample convergence bounds can be derived for both i.i.d. sampling ( being i.i.d.), and Markovian sampling ( being a Markov chain with stationary distribution )[5, 28]. When is nonlinear, then finite-sample bounds are only derived in the case where represents the stochastic gradient of some convex function . Moreover, unlike i.i.d. sampling, when is a Markov chain, an artificial projection (onto a large enough ball) is introduced in the algorithm to ensure that the iterates are bounded . Such a projection step is not practical in RL because one needs to know the problem parameters to pick the projection set so that the solution lies in it. In this paper, we provide finite-sample convergence guarantees for general nonlinear stochastic approximation with Markovian noise without needing such an impractical projection step.
Recall that our motivation for studying nonlinear SA with Markovian sampling is to analyze the -learning with linear function approximation algorithm for solving the RL problem. However, unlike tabular -learning, -learning with linear function approximation may in general diverge . Therefore, besides the nonlinearity and the Markovian noise, another fundamental problem is to maintain the stability of the algorithm. In existing literature, it is usually done by placing a requirement on the behavior policy, i.e., the policy used to collect samples [19, 16]. We work with a similar condition in this paper, and provide detailed discussion about our proposed condition. We also empirically show how a well-known divergent counter-example of -learning with linear function approximation can be made to converge if our proposed condition is satisfied.
The main contribution of this paper are as follows.
We derive finite-sample convergence guarantee for nonlinear SA with Markovian noise and without needing an artificial projection step. Then we study its behavior under several common choices of step sizes. The idea is to apply a suitable Lyapunov function on the stochastic iterates, and show that in expectation it produces a negative drift. To handle the nonlinearity, we use the Lipchitz continuity of to show that can be bounded affinely by . To handle the Markovian noise, we exploit the geometric mixing of the underlying Markov chain to control the cumulative bias caused by the Markovian sampling.
We apply our results on SA to the -learning with linear function approximation algorithm in RL. Since it is known to diverge, we study it under a condition that involves the behavior policy, feature vectors and the discount factor. We show that when using constant step sizes, -learning achieves an exponential convergence rate to a ball centered at the solution of the so-called projected Bellman’s equation, and the size of the ball is proportional to the step size and the mixing time of the underlying Markov chain. When using diminishing step sizes with a proper decay rate, we show that the convergence rate is roughly . Moreover, we provide detailed discussion for our proposed condition on the behavior policy.
We present an empirical case study of our theoretical convergent results using numerical experiments on a well-known divergent counter-example of -learning from . In particular, we show that if our condition is satisfied, the algorithm converges, and the rates match the theoretical results we derived.
1.1 Related Work
Stochastic approximation methods, originally proposed in , have been widely used in reinforcement learning for solving the Bellman’s equation. Typical examples are the TD-learning algorithm for solving the policy evaluation problem , and the -learning algorithm for estimating the optimal state-action value function . In the tabular case (i.e., without using function approximation), they both can be modeled as an SA algorithm involving a contraction operator and martingale difference noise, and their asymptotic convergence has been established in [33, 13, 7, 6]. As for the finite-sample analysis, [2, 8, 35] studies the mean square error bounds of synchronous -learning, and [11, 22, 18] provides concentration results for asynchronous -learning.
When using RL algorithms along with linear function approximation, the noise cannot be modeled as a martingale difference, and one should directly study Markovian noise. In the case of TD-learning with linear function approximation, which is a linear SA, asymptotic convergence was established in  using an Ordinary Differential Equation (ODE) approach . As for the finite-sample convergence bounds, it was derived in  for both i.i.d. sampling and Markovian sampling. However, in the latter case, an additional projection step (onto a ball) is needed for the proof. Later in , this artificial projection step was successfully removed. When using -learning along with linear function approximation, it was shown in  that -learning does not necessarily converge. Its asymptotic convergence was proved in [19, 16] under an additional condition on the behavior policy. Under a similar condition, we derive its finite-sample bounds by exploiting the fast mixing of finite state Markov chains, which was a technique developed in  to study the convergence rate of linear SA with Markovian noise. However, note that -learning is a nonlinear SA even under linear function approximation. To establish the results, some natural properties of -learning and our stability condition on the behavior policy play important roles in our analysis.
2 Nonlinear Stochastic Approximation with Markovian Noise
2.1 Problem Setting
We are interested in the problem of solving for in the equation
where is a random vector with distribution , and is a finite set. The function is a general nonlinear mapping. When the distribution of is unknown, Eq. (1) cannot be solved analytically. Therefore, we consider solving the equation by appealing to the stochastic approximation method proposed in . Specifically, suppose that we can collect a sequence of samples of the random vector . Then, with initialization , the estimate of is iteratively updated by
Here is the step size sequence which we can choose.
To analyze the behavior of Algorithm (2), one popular approach is to compare the iterates with the trajectory of the following ODE:
Under some mild conditions on the sample sequence and the step sizes , it was shown in [7, 6] that converges to almost surely as long as the ODE (3) is globally asymptotically stable (GAS), with being its unique equilibrium point. Such a stability property of the ODE is usually analyzed by constructing a Lyapunov function , and studying its time derivative along the trajectory of the ODE.
Our goal is to obtain finite-sample bounds for the stochastic iterative algorithm (2). A natural approach is to use the ODE Lyapunov function to study the iterates of Algorithm (2). However, since Algorithm (2) is a discrete and stochastic counterpart of ODE (3), a major challenge is to handle the error caused by the discretization and the noise .
Recall that to study the stochastic approximation algorithm (2), the properties of the function and the nature of the noise sequence are important. We next present our main assumptions regarding these two aspects. Let be the -norm for vectors and induced -norm for matrices.
The function is Lipschitz continuous with respect to uniformly in , i.e., there exists such that for all , , and . Moreover, we assume that .
In the special case where is a linear function of as considered in [5, 28], i.e., , since the state-space is finite, Assumption 2.1 is automatically satisfied. Note that Assumption 2.1 also implies the Lipschitz continuity of the function , which is sufficient to study the nonlinear dynamical system (3), e.g., it guarantees that ODE (3) has a unique solution.
Although is in general a nonlinear function of , Assumption 2.1 restricts that the growth rate of both and can at most be affine in terms of . To see this, let and in Assumption 2.1. Then we have by triangle inequality and our assumption that
Moreover, using Jensen’s inequality and we further obtain
These properties for and essentially let us prove a finite-sample convergence bound akin to the case when is a linear function of .
The equation has a unique solution, denoted by , and there exists such that for all .
Assumption 2.2 is satisfied when is a strongly monotone function . Also, it can be viewed as an exponential dissipativeness property of the ODE (3) with a quadratic storage function . In fact, this assumption guarantees that is the unique globally exponentially stable (GES) equilibrium point of ODE (3). To see this, let be a candidate Lyapunov function. Then we have by Assumption 2.2 that
which implies for all . The quantity is called the negative drift, and we see that the larger is, the faster converges.
The sequence is an irreducible and aperiodic Markov chain on state space , with unique stationary distribution .
Assumption 2.3 is often made to study the asymptotic convergence of stochastic approximation under Markovian noise when the state space is finite; see for examples [6, 34, 9]. When compared to the case where is i.i.d., the main difference for being Markovian is that there is a bias in the update of Algorithm (2), i.e., . As will be seen later, Assumption 2.3 enables us to control the bias and show that it is not strong enough to alter the desired direction of the update.
In addition to these assumptions, the choice of the step size sequence is important. In order to state certain conditions on the step sizes we pick, we need to use the mixing time of the Markov chain . We begin with the definition of the total variation distance between probability distributions, which is used to introduce the mixing time.
The total variation distance between two probability distributions and on is defined by .
Denote as the distribution of with initial state , and let
be the maximal distance between and the stationary distribution . Then the mixing time of the Markov chain with precision is defined by . Intuitively, smaller mixing time means faster convergence of the Markov chain to its stationary distribution. Consider the extreme case where is i.i.d., we have for any .
Let , where is the Lipschitz constant given in Assumption 2.1. The mixing time is defined this way so that it accounts for both the Markov chain and the function (through its Lipschitz constant). We will later use to state our requirement on the step size sequence . Before that, we need the following property regarding , whose proof is presented in Appendix A.1.
There exists a constant that depends only on the Markov chain and the Lipschitz constant such that .
Claim 2.1 states that the mixing time grows at most affinely in terms of , which further implies . Since we will always choose , where is the step size, for simplicity, we will write for in the following. Now we state our condition on the step sizes . For simplicity of notations, denote .
The step size sequence satisfies the following conditions.
The step size sequence is non-increasing and .
There exists a minimal integer such that and for all .
Condition 2.1 (1) is standard . Condition 2.1 (2) is mainly used to control the discretization error and the stochastic error in Algorithm (2). In fact, since under Claim 2.1, we can show that Condition 2.1 (2) is satisfied when using either small enough constant step sizes or diminishing step sizes of the form for all , and , provided that is chosen properly.
2.3 Finite-Sample Convergence Bounds for Nonlinear SA
On the right-hand side of Eq. (7), the first term represents the bias due to the initial guess , and the second term captures the variance due to the noise. We will see later that the Markovian nature of the noise mainly impacts the second term.
Theorem 2.1 is one of our main contribution in that (1) The function is allowed to be nonlinear, (2) It holds when is sampled in a Markovian manner, and (3) No modifications on Algorithm (2) (e.g., adding a projection step) is needed to establish the result.
After establishing the finite-sample bound of Algorithm (2) in its general form, we next consider several common choices of step sizes, and see more explicitly what are the corresponding convergence rates. We begin by presenting the result for using constant step sizes, whose proof can be found in Appendix A.2.
Let for all , where is chosen such that . Then we have for all :
We see from Corollary 2.1 that when using constant step sizes, the finite-sample bound is composed by two terms. The first term converges to zero exponentially fast as the number of iterations tends to infinity, while the second term is a constant which is independent of . Geometrically, this means that the iterates of Algorithm (2) achieves an exponential convergence rate in expectation to a ball containing , with radius proportional to . Intuitively, constant step sizes efficiently eliminate the bias. However, since the noise is constantly added to the iterates without being progressively suppressed, the variance does not converge to zero.
We next consider using diminishing step sizes in Algorithm (2). Let , where , , and is chosen properly so that Condition 2.1 is satisfied for some . The verification of Condition 2.1 and the proof of the following result can be found in Appendix A.3.
Suppose is chosen as above, then we have the following finite-sample bounds.
When , we have for all :
When and , suppose in addition that , then we have for all :
From Corollary 2.2 (1), we see that when using step size, the constant must be chosen very carefully (i.e., ) to achieve the optimal convergence rate, otherwise the convergence rate is roughly , which can be arbitrarily slow. In the case when , the convergence rate is , and it is independent of .
The above analysis indicates that our choice of step size should depend on how precise our estimate of the negative drift parameter is. When our estimate to is accurate, we should use with so that the convergence rate is the optimal . When our understanding to the system model is poor (therefore inaccurate estimate of ), we should use with chosen close to unity. In that case, we sacrifice the convergence rate for robustness.
2.4 Proof of Theorem 2.1
This subsection is dedicated to the proof of Theorem 2.1. Before going into details, we first provide some intuition.
Recall that the Lyapunov function can be used to show that is the unique GES equilibrium point of ODE (3). To analyze the convergence rate of the iterates generated by Algorithm (2), naturally we want to use the Lyapunov function on to show something like
Here on the right-hand side of Eq. (8), the term corresponds to the negative drift of the ODE, which is provided by Assumption 2.2, and the two terms and account for the discretization error and the stochastic error in Algorithm (2). The discretization error can be handled using the properties of the function (Assumption 2.1) and properly chosen step sizes (Condition 2.1). As for the stochastic error, since Markovian noise natually produces bias in the update, we need to show that converges to fast enough for any , which implies that the overall cumulative bias is small. This result is provided by Assumption 2.3 and Claim 2.1. Once we show that both error terms are dominated by the drift term, i.e., and , Eq. (8) can be recursively used to establish a finite-sample error bound of Algorithm (2).
To begin with, we apply on the iterates of Algorithm (2). Moreover, to utilize the mixing time of the Markov chain, we take expectation conditioning on and . For simplicity of notations, we use for in the following. Using the update rule (2), we have for all :
The terms and correspond to the stochastic error and the discretization error respectively. What remains to show is that they are dominated by the term . We begin by bounding the term in the following lemma.
for all .
Observe that Lemma 2.1 implies that , which is what we desire. We next turn to the term , to control it, we need the following two results.
Given precision , the following inequality holds for all , , and : .
For any satisfying , the following two inequalities hold:
Lemma 2.2 utilizes the mixing time to show that the bias in the update of Algorithm (2) is exponentially small after the Markov chain is sufficiently mixed. Lemma 2.3 controls the difference between and when is not too large. With the help of Lemmas 2.2 and 2.3, we are now ready to bound the term in the following lemma.
The following inequality holds for all such that :
Substituting the upper bounds we obtained for the terms , , and into Eq. (9), we obtain the following recursive relation between and .
The following inequality holds for all such that :
Proof of Theorem 2.1.
In summary, we have stated and proved a finite-sample convergence bound for stochastic approximation algorithm (2). Moreover, we analyzed the convergence rates for using both constant and diminishing step sizes. In the following section, we will use these results to establish the convergence rates of the -learning with linear function approximation algorithm for solving the Reinforcement Learning problem.
3 Applications in Reinforcement Learning
3.1 -Learning with Linear Function Approximation
We begin by introducing the Markov Decision Processes and the Reinforcement Learning problem.
Definition 3.1 (Markov Decision Processes).
An infinite horizon discounted Markov decision process is comprised by a tuple , where is a finite set of states, is a finite set of actions, stands for the probability of transition from state to state under action , is the reward when taking action at state , and is the discount factor.
We assume that the reward is non-negative and is uniformly bounded such that for all state-action pairs . The underlying model of the Reinforcement Learning problem is essentially an MDP except that the transition probabilities and the reward function are unknown to the agent.
The goal of the agent is to find a strategy (called policy) of selecting actions based on the state of the environment such that the expected long-term discounted reward is maximized. More precisely, define the state-action value function (also known as the -function) of a policy at state-action pairs by
where is executed according to policy . In words, is the expected total reward the agent would receive if it starts at state , first taking action , and then following the policy thereafter. We want to find an optimal policy such that its corresponding -function, denote by , is such that for all state-action pairs and all policy . A fundamental property of the function is that, if one just select actions greedy based on , then that is an optimal policy, i,e, for all state . Therefore, solving the RL problem reduces to finding the optimal -function.
The -learning algorithm proposed in  is a popular approach for estimating the optimal -function. However, it is well-known that a major challenge in the -learning algorithm is the curse of dimensionality. That is, -learning becomes intractable when the number of state-action pairs is very large. Therefore, it was suggested that we approximate the optimal -function within a chosen function space which has much smaller dimension than . We next describe the approximation architecture.
Let be a set of basis functions (called features). Define the feature matrix by . Then the linear subspace spanned by the features can be written as . We will use as our approximating function space, and the goal here is to find such that best represents .
Using the notations above, we now present the -learning algorithm under linear function approximation [4, 19]. Let be a trajectory generated by applying some policy to the system model, where is called the behavior policy. Then, the parameter of the approximation is iteratively updated by:
where is a sequence of step sizes. Algorithm (12) can be viewed as a stochastic approximation algorithm for solving the equation:
where is the stationary distribution of the Markov chain under policy . Under some mild conditions, Eq. (13) is equivalent to a so-called projected Bellman’s equation (see  for more details). In the special case where the feature matrix is a identity matrix with dimension , Algorithm (12) reduces to the tabular -learning algorithm , and Eq. (13) becomes the regular Bellman’s equation .
In general, Eq. (13) may not necessarily admit a solution , and the iteration in (12) may diverge . However, it was shown in [19, 16] that under an assumption on the behavior policy , converges to the solution of Eq. (13), denoted by , almost surely. In this paper, we focus on establishing the finite-sample convergence guarantee of Algorithm (12). We begin by stating our main assumptions.
The features are linearly independent and are normalized so that for all state-action pairs .
Assumption 3.1 is indeed without loss of generality since we can disregard dependent features. Moreover, since the state and action spaces are finite, we can always make sure by proper scaling.
The Markov chain induced by the behavior policy is irreducible and aperiodic, with unique stationary distribution .
Assumption 3.2 essentially requires that the behavior policy has enough exploration, which is necessary in studying value-based RL algorithms.
Eq. (13) has a unique solution, denoted by , and there exists such that the following inequality holds for all :
3.2 Finite-Sample Convergence Bounds for -Learning with Linear Function Approximation
With being defined above, Algorithm (12) can be written as:
To apply Theorem 2.1, we need to show that Assumptions 2.1-2.3 are naturally satisfied in the context of -learning under Assumptions 3.1-3.3, which is provided by the following proposition. See Appendix B.1 for its proof.
The Markov chain with state space is irreducible and aperiodic.
The function defined in (15) is Lipschitz continuous with respect to uniformly in , and is a valid Lipschitz constant. Moreover, we have .
The equation has a unique solution , and we have for all , where is given in Assumption 3.3.
Similarly as in Section 2, given precision , we define the mixing time , where is for the Markov chain , and is the Lipschitz constant given in Proposition 3.1. Moreover, there exists a constant that depends only on the Markov chain and the Lipschitz constant such that for any .
We next use Theorem 2.1 to derive finite-sample error bounds of Algorithm (12) for different choice of step sizes. We begin with the result for using constant step sizes, which directly follows from Proposition 3.1 and Corollary 2.1. Let , and .
Theorem 3.1 is qualitatively similar to Corollary 2.1 in that the iterates of -learning converges exponentially fast to a ball centered at , and the size of the ball is proportional to . Consider the special case where is sampled in an i.i.d. manner. Since for any , the size of the ball is only proportional to the step size . Thus, we see that the Markovian nature of the noise produces an additional mixing time factor. This agrees with results in [28, 5], where the popular TD-learning with linear function approximation algorithm was studied.
We next present the result for using diminishing step sizes. For simplicity, we only present the case of the optimal convergence rate in Corollary 2.2. Let with , where is chosen properly so that there exists a minimal such that and for all . The following result directly follows from Proposition 3.1 and Corollary 2.2.
3.3 Discussion about Assumption 3.3 on the Behavior Policy
In this subsection, we take a closer look at Assumption 3.3 and Eq. (14), which are essential for the stability of the -learning with linear function approximation algorithm. First note that Eq. (14) is equivalent to
We next analyze condition (17). We first look at the terms inside the expectation on both sides of the inequality. On the right-hand side, the quantity can be interpreted as with the action chosen according to the behavior policy . On the left-hand side, the term is essentially where is chosen greedily with respect to . Therefore, it is clear that
To meet condition (17), besides the presence of , the behavior policy should be somehow close to the policy induced greedily by for any non-zero , which is a relatively restrictive requirement.
Similar assumptions on the behavior policy were also proposed in [19, 16]. Although the exact form of the conditions are different, they all follow the same spirit. That is, with a chosen Lyapunov function, the condition should enable us to show that the corresponding ODE
for all nonzero 11 1 The factor of 2 appears to be missing in ..The right-hand side is the same for both Eqs. (19) and (17). On the left-hand side, Eq. (19) has an additional factor of , and the square is outside the max operator. Although they are similar, our condition and the condition proposed in  do not imply each other. As for the condition proposed in , while it is not clear if it is less restrictive than ours, it is shown that the condition in  implies the condition in  under more restrictive assumptions. However,  assumes i.i.d. sampling, and studies only the asymptotic convergence rather than finite-sample error bounds.
We next analyze how the features and the behavior policy impact condition (17) in the following two examples.
Consider the case when . That is, there is only one feature , and the weight is a scalar on the real line. Condition (17) reduces to
Then we have the following result, whose proof can be found in Appendix B.2.
We next consider another extreme case where we have , i.e., there is no dimension reduction at all. We show in the following proposition that, in the full-dimension case, condition (17) is feasible in terms of the behavior only when the discount factor is sufficiently small. See Appendix B.3 for its proof.
When , condition (17) is infeasible for any behavior policy .
We now compare the results for the two extreme cases, i.e., and . We see that in the uni-dimensional case, Eq. (17) implies a condition which is almost sufficient and necessary for the GAS of the equilibrium to the ODE (18). Moreover, there always exists a behavior policy satisfying . However, in the full-dimensional case, condition (17) is infeasible in terms of the behavior policy when , which is usually the case in practice where is chosen close to unity to ensure enough exploration.
In the next section, we present numerical experiments to demonstrate the sufficiency of condition (17), as well as the resulting convergence rate of the -learning with linear function approximation algorithm.
3.4 Numerical Experiments
To present the numerical experiments, we need to rewrite condition (17) in a different way. Defining
In our simulation, we consider the divergent example of -learning with linear function approximation introduced in , where there are states and actions, and the number of features is . See  or Appendix B.5 for more details about this example. Note that since in this example, it belongs to the full-dimensional case discussed in Subsection 3.3, and is essentially a change of basis. Surprisingly, even in this case, the -learning algorithm (12) can diverge.
To demonstrate the effectiveness of condition (17) for the stability of -learning, in our first set of simulations, the reward function is set to zero. Since the reward function is identically zero, is zero, implying is zero. Because of this structure, it is possible for the -learning algorithm to converge even when constant step size is used. We choose the behavior policy which takes each action with equal probability. It turns out that , giving the threshold for to satisfy Eq. (17) being . In the numerical experiments, we choose constant step size , discount factor , and plot as a function of the number of iterations in Figure 2. Here, converges when , but diverges when . This demonstrates that condition (17) is sufficient but not necessary for convergence. This also shows that by ensuring (17), the counter example from  can be made to converge.
To show the exponential convergence rate for using constant step size, we consider the convergence of when given in Figure 2, where we plot as a function on the number of iterations . In this case, seems to converge exponentially to , which agrees with Corollary 3.1.
In our second set of experiments, we illustrate the convergence rate of -learning with linear function approximation for using diminishing step sizes . We use the same MDP model, but the reward is sampled independently from a uniform distribution on for all state-action pairs. The behavior policy is still to take each action with equal probability. The constant given in Eq. (14) is estimated by numerical optimization, and the discount factor is set to be . In figure 4, we plot the quantity as a function of for . In the case where , the constant coefficient is chosen such that in order to achieve the optimal convergence rate. We see that the iterates converge for all . Moreover, the larger the value of is, the faster converges.
In this paper we establish finite-sample convergence guarantee for a general nonlinear stochastic approximation algorithm with Markovian noise. The result is used to derive the finite-sample error bounds of the -learning with linear function approximation algorithm for solving the RL problem. Since such an algorithm is known to diverge in general, we study it under a condition on the behavior policy that ensures stability. Sufficiency of this condition on the behavior policy, as well as the rate of convergence of -learning is also verified numerically in the context of a well-known counter example.
Appendix A Proof of All Technical Results in Section 2
A.1 Proof of Claim 2.1
Since the finite state Markov chain is irreducible and aperiodic, we have by Theorem 4.9 in  that there exists and such that for all . Note that the inequality holds when . Therefore, we have
A.2 Proof of Corollary 3.1
Let . Since is chosen such that , it is easy to verify that Condition 2.1 is satisfied. Moreover, we have
The result then follows from the above two observations.
A.3 Proof of Corollary 3.2
We first show that when (, ) with properly chosen , there exists such that Condition 2.1 is satisfied.
Assume with out loss of generality that for all , we have:
We now evaluate the terms and . For the term , using the fact that for all non-increasing function , we have
|( for all )|
We next evaluate the term . When , using the same line of analysis for evaluating , i.e., bounding the summation by its corresponding integral, we have
The upper bound for the term depends on the value . Specifically, we have the following results.
When , we have
When , we have
When , we have
When , we have
When , we have
Combine the above five cases and we have
We next consider the case when . Let be a sequence defined by
It is easy to verify that . We next use induction to show that when , we have for all .
Since , we have the base case. Now suppose for some . Using the definition of and the induction hypothesis, we have
where (26) follows from
and (27) follows from .
A.4 Proof of Lemma 2.1
Using Eq. (4) and we have for all :
A.5 Proof of Lemma 2.2
A.6 Proof of Lemma 2.3
For any , we first upper bound for any . Using triangle inequality and Eq. (4), we have
which gives . Recursively applying the preceding inequality, then using the fact that for all , we have for all :
Since for all and , we further obtain
Note that the previous inequality holds for all . Then we have by Eq. (28) that
and , we have .
A.7 Proof of Lemma 2.4
We begin by decomposing the left-hand side of the desired inequality in the following way:
For the term (), Assumption 2.1 gives
Thus, we have
Consider the term . Using Lemma 2.2, we have
Therefore, we obtain
Since Lemma 2.3 together with our assumption that gives
Hence we can further control the term by
A.8 Proof of Lemma 2.5
Substituting the upper bounds we obtained for the terms , , and into Eq. (9), we have
where in the last line we used . The result then follows by taking the total expectation on both side of the previous inequality.
Appendix B Proof of All Technical Results in Section 3
B.1 Proof of Proposition 3.1
Consider two arbitrary states . Since is irreducible, there exists such that . Hence we have
It follows that is irreducible.
To show is aperiodic, assume for a contradiction that is periodic with period . Since is irreducible, every state in has the same period. Therefore, for any , we must have for all not divisible by . However, we have for any not divisible by :
To see (33), since is a Markov chain, we have
Therefore, Eq. (34) shows that the period of is at least , which is a contradiction. Hence the Markov chain must be aperiodic.
Using Cauchy-Schwarz inequality, and our assumption that for all state-action pairs, we have for any and :
we have for any and :
Therefore, the quantity is a valid Lipschitz constant. Moreover, since for any , we have as required.
B.2 Proof of Proposition 3.2
We first show that Eq. (20) implies , and . Note that Jensen’s inequality implies
Thus, using Eq. (20), we have
where the last inequality follows from Cauchy-Schwarz inquality and the fact that and are equal in distribution if . Similarly, we also have .
In the case where , it is easy to see that ODE (18) is globally asymptotically stable if and only if . Now we without loss of generality assume , the proof for the other case is entirely similar.
Sufficiency: We first note that . Let be a candidate Lyapunov function. It is clear that for all , and if and only if . Moreover,
It is clear that when . For , since , , and , we must also have . Therefore, the time derivative of the Lyapunov function along the trajectory of ODE (18) is strictly negative when . It then follows from the Lyapunov stability theorem [12, 14] that is globally asymptotically stable.
Necessity: We prove by contradiction. Suppose that the equilibrium point is globally asymptotically stable, but or .
If , when , we have . It follows that for all , which contradict to the fact that is a globally asymptotically stable equilibrium point.
Similarly, if , when , we have . It follows that for all , which also contradict to the fact being globally asymptotically stable.
B.3 Proof of Proposition 3.3
Since , there are total number of feature vectors, and the feature matrix is a full-rank square matrix with dimension . Define
Note that exists for all state-action pairs since is full rank. Now for a given state-action pair , let (), Eq. (17) implies
which further gives . Therefore, by running though all state-action pairs, we have . Thus, if , there is no behavior policy that satisfies condition (17).
B.4 Computing Given in Eq. (21)
We here present one way to compute for an MDP with a chosen policy when the underlying model is known. Before that, the following definitions are needed.
Let , where is a diagonal matrix with diagonal entries , and is the feature matrix.
Let be the set of all deterministic policies.
Let be a diagonal matrix with diagonal entries , and let , where () is such that
We now compute given in the following lemma.
Let be defined in (21) and let be the largest eigenvalue of a positive semi-definite matrix . Then we have
Proof of Lemma B.1.
Recall our definition for :
Let be the numerator, we have
Since the diagonal entries of are all positive, and is full column rank, is symmetric and positive definite. To represent the denominator of (39) in a similar form, let
Since the columns of can be dependent, is in general only symmetric and positive semi-definite. With the definition of and , can be represented as
Now since is positive definite, and are both well-defined and positive definite, we have
where the function returns the largest eigenvalue. It follows that
B.5 The MDP Model Used in Numerical Experiments
Our numerical experiments adopt the MDP model of the classical divergent example of -learning with linear function approximation introduced in . Consider the infinite-horizon seven-state, two-action MDP shown in Figure 5. The dashed action takes the system to one of the six upper states with equal probability, whereas the solid action takes the system to the seventh state with probability one. The behavior policy selects the dashed and solid actions with equal probability. Consider estimating the -function under the linear parameterization indicated by the expression showing along each arrow in Figure 5. For example, the estimated value of state taking the solid action is , where the subscript corresponds to the component of the overall weight vector . It is easy to verify that the feature matrix in this example is a square matrix with full rank.
- Baird  Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
- Beck and Srikant  Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant step-size -learning. In 2013 American Control Conference, pages 1926–1931. IEEE.
- Benveniste et al.  Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
- Bertsekas and Tsitsiklis  Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA.
- Bhandari et al.  Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691–1692.
- Borkar  Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer.
- Borkar and Meyn  Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469.
- Chen et al.  Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874.
- De Farias and Van Roy  De Farias, D. P. and Van Roy, B. (2000). On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105(3):589–608.
- Duchi et al.  Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. (2012). Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578.
- Even-Dar and Mansour  Even-Dar, E. and Mansour, Y. (2003). Learning rates for -learning. Journal of Machine Learning Research, 5(Dec):1–25.
- Haddad and Chellaboina  Haddad, W. M. and Chellaboina, V. (2011). Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.
- Jaakkola et al.  Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710.
- Khalil and Grizzle  Khalil, H. K. and Grizzle, J. W. (2002). Nonlinear systems, volume 3. Prentice hall Upper Saddle River, NJ.
- Kober et al.  Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Lee and He  Lee, D. and He, N. (2019). A unified switching system perspective and ode analysis of -learning algorithms. Preprint arXiv:1912.02270.
- Levin and Peres  Levin, D. A. and Peres, Y. (2017). Markov chains and mixing times, volume 107. American Mathematical Soc.
- Li et al.  Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous -learning: Sharper analysis and variance reduction. Preprint arXiv:2006.03041.
- Melo et al.  Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM.
- Nemirovski et al.  Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
- Puterman  Puterman, M. L. (2014). Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
- Qu and Wierman  Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and -learning. Preprint arXiv:2002.00260.
- Robbins and Monro  Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407.
- Rudin et al.  Rudin, W. et al. (1964). Principles of mathematical analysis, volume 3. McGraw-hill New York.
- Ryu and Boyd  Ryu, E. K. and Boyd, S. (2016). Primer on monotone operator methods. Appl. Comput. Math, 15(1):3–43.
- Shalev-Shwartz et al.  Shalev-Shwartz, S., Shammah, S., and Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. Preprint arXiv:1610.03295.
- Silver et al.  Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676):354.
- Srikant and Ying  Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830.
- Sutton  Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44.
- Sutton and Barto  Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press Cambridge, 2 edition.
- Tesauro et al.  Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., and Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In Advances in Neural Information Processing Systems, pages 1497–1504.
- Thoppe and Borkar  Thoppe, G. and Borkar, V. (2019). A concentration bound for stochastic approximation via alekseev’s formula. Stochastic Systems, 9(1):1–26.
- Tsitsiklis  Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and -learning. Machine learning, 16(3):185–202.
- Tsitsiklis and Van Roy  Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081.
- Wainwright  Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp -bounds for -learning. Preprint arXiv:1905.06265.
- Watkins and Dayan  Watkins, C. J. and Dayan, P. (1992). -learning. Machine learning, 8(3-4):279–292.