Abstract
Motivated by applications in Reinforcement Learning (RL), in this paper, westudy a nonlinear Stochastic Approximation (SA) algorithm under Markoviannoise, and derive its finitesample convergence bounds. Our proof is based onthe Lyapunov drift arguments, and to handle the Markovian noise, we exploit thefast mixing of the underlying Markov chain. Our result is used to show the finitesample bounds of the popular Qlearningwith linear function approximation algorithm for solving the RL problem. SinceQlearning with linear function approximation may diverge in general, we studyit under a condition on the behavior policy that ensures the stability of thealgorithm. Due to the generality of our SA results, we do not need to make theunnatural assumption that the samples are i.i.d. (since they are Markovian),and do not require an additional projection step in the algorithm to maintainthe boundedness of the iterates.
Quick Read (beta)
FiniteSample Analysis of Nonlinear Stochastic Approximation with Applications in
Reinforcement Learning
Abstract
Motivated by applications in Reinforcement Learning (RL), in this paper, we study a nonlinear Stochastic Approximation (SA) algorithm under Markovian noise, and derive its finitesample convergence bounds. Our proof is based on the Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain.
Our result is used to show the finitesample bounds of the popular $Q$learning with linear function approximation algorithm for solving the RL problem. Since $Q$learning with linear function approximation may diverge in general, we study it under a condition on the behavior policy that ensures the stability of the algorithm. Due to the generality of our SA results, we do not need to make the unnatural assumption that the samples are i.i.d. (since they are Markovian), and do not require an additional projection step in the algorithm to maintain the boundedness of the iterates.
1 Introduction
Reinforcement Learning (RL) is a framework to solve sequential decisionmaking problems by interacting with the environment [30]. This approach has demonstrated tremendous successes for solving many practical problems in several different areas, such as robotics [15], power management [31], autonomous driving [26], and board games [27].
The RL problem can be modeled as a Markov Decision Process (MDP) and so reduces to solving the socalled Bellman’s equation [30, 21]. However, unlike in an MDP, since the environmental model is unknown in RL, such system of equations are usually solved using the Stochastic Approximation (SA) method, where the solution is iteratively estimated using noisy samples collected from the system [4, 23]. A typical example is the popular $Q$learning algorithm proposed in [36], which can be viewed as a nonlinear SA algorithm with martingale difference noise (conditionally mean zero noise). However, when using $Q$learning along with function approximation, the corresponding SA algorithm involves Markovian noise, and cannot be modeled by martingale difference. This motivates one to study the behavior of general nonlinear SA algorithms under Markovian sampling. This paper focuses on establishing finitesample convergence guarantee for such SA algorithms, and apply it to the $Q$learning with linear function approximation algorithm.
In general SA framework, usually the target equation can be written in the form ${\mathbb{E}}_{\mu}[F(X,\theta )]=0$, where $X$ is a random variable with distribution $\mu $ ($\mu $ is unknown), and $F$ is a general nonlinear mapping. The SA algorithm for solving such an equation is then given by: ${\theta}_{k+1}={\theta}_{k}+{\u03f5}_{k}F({X}_{k},{\theta}_{k})$, where $\{{X}_{k}\}$ is a sequence of samples of the random variable $X$, and ${\u03f5}_{k}$ is the step size. Besides their applications in RL, SA algorithms of this form are widely used in optimization. For example, when $F(x,\theta )=c\nabla f(\theta )+x$ for some cost function $f(\theta )$, the SA algorithm is known as the Stochastic Gradient Descent (SGD) method for minimizing $f(\theta )$ [20, 25]. The behavior of such an SA algorithm, critically depends upon the following: (a) The way we collect the sample sequence $\{{X}_{k}\}$, and (b) The properties of the function $F(x,\theta )$.
In existing literature, when the function $F(x,\theta )$ is a linear function of $\theta $, finitesample convergence bounds can be derived for both i.i.d. sampling (${X}_{k}\sim \mu $ being i.i.d.), and Markovian sampling ($\{{X}_{k}\}$ being a Markov chain with stationary distribution $\mu $)[5, 28]. When $F$ is nonlinear, then finitesample bounds are only derived in the case where $F$ represents the stochastic gradient of some convex function $f$. Moreover, unlike i.i.d. sampling, when $\{{X}_{k}\}$ is a Markov chain, an artificial projection (onto a large enough ball) is introduced in the algorithm to ensure that the iterates are bounded [10]. Such a projection step is not practical in RL because one needs to know the problem parameters to pick the projection set so that the solution ${\theta}^{*}$ lies in it. In this paper, we provide finitesample convergence guarantees for general nonlinear stochastic approximation with Markovian noise without needing such an impractical projection step.
Recall that our motivation for studying nonlinear SA with Markovian sampling is to analyze the $Q$learning with linear function approximation algorithm for solving the RL problem. However, unlike tabular $Q$learning, $Q$learning with linear function approximation may in general diverge [1]. Therefore, besides the nonlinearity and the Markovian noise, another fundamental problem is to maintain the stability of the algorithm. In existing literature, it is usually done by placing a requirement on the behavior policy, i.e., the policy used to collect samples [19, 16]. We work with a similar condition in this paper, and provide detailed discussion about our proposed condition. We also empirically show how a wellknown divergent counterexample of $Q$learning with linear function approximation can be made to converge if our proposed condition is satisfied.
The main contribution of this paper are as follows.

•
We derive finitesample convergence guarantee for nonlinear SA with Markovian noise and without needing an artificial projection step. Then we study its behavior under several common choices of step sizes. The idea is to apply a suitable Lyapunov function on the stochastic iterates, and show that in expectation it produces a negative drift. To handle the nonlinearity, we use the Lipchitz continuity of $F$ to show that $\parallel F(x,\theta )\parallel $ can be bounded affinely by $\parallel \theta \parallel $. To handle the Markovian noise, we exploit the geometric mixing of the underlying Markov chain to control the cumulative bias caused by the Markovian sampling.

•
We apply our results on SA to the $Q$learning with linear function approximation algorithm in RL. Since it is known to diverge, we study it under a condition that involves the behavior policy, feature vectors and the discount factor. We show that when using constant step sizes, $Q$learning achieves an exponential convergence rate to a ball centered at the solution of the socalled projected Bellman’s equation, and the size of the ball is proportional to the step size and the mixing time of the underlying Markov chain. When using diminishing step sizes with a proper decay rate, we show that the convergence rate is roughly $O(\mathrm{ln}(k)/k)$. Moreover, we provide detailed discussion for our proposed condition on the behavior policy.

•
We present an empirical case study of our theoretical convergent results using numerical experiments on a wellknown divergent counterexample of $Q$learning from [1]. In particular, we show that if our condition is satisfied, the algorithm converges, and the rates match the theoretical results we derived.
1.1 Related Work
Stochastic approximation methods, originally proposed in [23], have been widely used in reinforcement learning for solving the Bellman’s equation. Typical examples are the TDlearning algorithm for solving the policy evaluation problem [29], and the $Q$learning algorithm for estimating the optimal stateaction value function [36]. In the tabular case (i.e., without using function approximation), they both can be modeled as an SA algorithm involving a contraction operator and martingale difference noise, and their asymptotic convergence has been established in [33, 13, 7, 6]. As for the finitesample analysis, [2, 8, 35] studies the mean square error bounds of synchronous $Q$learning, and [11, 22, 18] provides concentration results for asynchronous $Q$learning.
When using RL algorithms along with linear function approximation, the noise cannot be modeled as a martingale difference, and one should directly study Markovian noise. In the case of TDlearning with linear function approximation, which is a linear SA, asymptotic convergence was established in [34] using an Ordinary Differential Equation (ODE) approach [3]. As for the finitesample convergence bounds, it was derived in [5] for both i.i.d. sampling and Markovian sampling. However, in the latter case, an additional projection step (onto a ball) is needed for the proof. Later in [28], this artificial projection step was successfully removed. When using $Q$learning along with linear function approximation, it was shown in [1] that $Q$learning does not necessarily converge. Its asymptotic convergence was proved in [19, 16] under an additional condition on the behavior policy. Under a similar condition, we derive its finitesample bounds by exploiting the fast mixing of finite state Markov chains, which was a technique developed in [28] to study the convergence rate of linear SA with Markovian noise. However, note that $Q$learning is a nonlinear SA even under linear function approximation. To establish the results, some natural properties of $Q$learning and our stability condition on the behavior policy play important roles in our analysis.
2 Nonlinear Stochastic Approximation with Markovian Noise
2.1 Problem Setting
We are interested in the problem of solving for ${\theta}^{*}\in {\mathbb{R}}^{d}$ in the equation
$\overline{F}(\theta ):={\mathbb{E}}_{\mu}[F(X,\theta )]=0,$  (1) 
where $X\in \mathcal{X}$ is a random vector with distribution $\mu $, and $\mathcal{X}$ is a finite set. The function $F:\mathcal{X}\times {\mathbb{R}}^{d}\mapsto {\mathbb{R}}^{d}$ is a general nonlinear mapping. When the distribution $\mu $ of $\mathcal{X}$ is unknown, Eq. (1) cannot be solved analytically. Therefore, we consider solving the equation by appealing to the stochastic approximation method proposed in [23]. Specifically, suppose that we can collect a sequence of samples $\{{X}_{k}\}$ of the random vector $X$. Then, with initialization ${\theta}_{0}$, the estimate ${\theta}_{k}$ of ${\theta}^{*}$ is iteratively updated by
${\theta}_{k+1}={\theta}_{k}+{\u03f5}_{k}F({X}_{k},{\theta}_{k}).$  (2) 
Here $\{{\u03f5}_{k}\}$ is the step size sequence which we can choose.
To analyze the behavior of Algorithm (2), one popular approach is to compare the iterates $\{{\theta}_{k}\}$ with the trajectory of the following ODE:
$\dot{\theta}(t)=\overline{F}(\theta (t)).$  (3) 
Under some mild conditions on the sample sequence $\{{X}_{k}\}$ and the step sizes $\{{\u03f5}_{k}\}$, it was shown in [7, 6] that ${\theta}_{k}$ converges to ${\theta}^{*}$ almost surely as long as the ODE (3) is globally asymptotically stable (GAS), with ${\theta}^{*}$ being its unique equilibrium point. Such a stability property of the ODE is usually analyzed by constructing a Lyapunov function $V(\theta )$, and studying its time derivative along the trajectory of the ODE.
Our goal is to obtain finitesample bounds for the stochastic iterative algorithm (2). A natural approach is to use the ODE Lyapunov function to study the iterates of Algorithm (2). However, since Algorithm (2) is a discrete and stochastic counterpart of ODE (3), a major challenge is to handle the error caused by the discretization and the noise $\{{X}_{k}\}$.
2.2 Assumptions
Recall that to study the stochastic approximation algorithm (2), the properties of the function $F(x,\theta )$ and the nature of the noise sequence $\{{X}_{k}\}$ are important. We next present our main assumptions regarding these two aspects. Let $\parallel \cdot \parallel $ be the ${\mathrm{\ell}}_{2}$norm for vectors and induced $2$norm for matrices.
Assumption 2.1.
The function $F\mathit{}\mathrm{(}x\mathrm{,}\theta \mathrm{)}$ is Lipschitz continuous with respect to $\theta $ uniformly in $x$, i.e., there exists $L\mathrm{>}\mathrm{0}$ such that $\mathrm{\parallel}F\mathit{}\mathrm{(}x\mathrm{,}{\theta}_{\mathrm{1}}\mathrm{)}\mathrm{}F\mathit{}\mathrm{(}x\mathrm{,}{\theta}_{\mathrm{2}}\mathrm{)}\mathrm{\parallel}\mathrm{\le}L\mathit{}\mathrm{\parallel}{\theta}_{\mathrm{1}}\mathrm{}{\theta}_{\mathrm{2}}\mathrm{\parallel}$ for all ${\theta}_{\mathrm{1}}$, ${\theta}_{\mathrm{2}}$, and $x$. Moreover, we assume that $L\mathrm{\ge}{\mathrm{max}}_{x\mathrm{\in}\mathrm{X}}\mathit{}\mathrm{\parallel}F\mathit{}\mathrm{(}x\mathrm{,}\mathrm{0}\mathrm{)}\mathrm{\parallel}$.
In the special case where $F(x,\theta )$ is a linear function of $\theta $ as considered in [5, 28], i.e., $F(x,\theta )=A(x)\theta +b(x)$, since the statespace $\mathcal{X}$ is finite, Assumption 2.1 is automatically satisfied. Note that Assumption 2.1 also implies the Lipschitz continuity of the function $\overline{F}$, which is sufficient to study the nonlinear dynamical system (3), e.g., it guarantees that ODE (3) has a unique solution.
Although $F(x,\theta )$ is in general a nonlinear function of $\theta $, Assumption 2.1 restricts that the growth rate of both $\parallel F(x,\theta )\parallel $ and $\parallel \overline{F}(\theta )\parallel $ can at most be affine in terms of $\parallel \theta \parallel $. To see this, let ${\theta}_{1}=\theta $ and ${\theta}_{2}=0$ in Assumption 2.1. Then we have by triangle inequality and our assumption $L\ge {\mathrm{max}}_{x\in \mathcal{X}}\parallel F(x,0)\parallel $ that
$\parallel F(x,\theta )\parallel \le L\parallel \theta \parallel +\parallel F(x,0)\parallel \le L(\parallel \theta \parallel +1).$  (4) 
Moreover, using Jensen’s inequality and we further obtain
$\parallel \overline{F}(\theta )\parallel \le {\mathbb{E}}_{\mu}[\parallel F(X,\theta )\parallel ]\le L(\parallel \theta \parallel +1).$  (5) 
These properties for $F(x,\theta )$ and $\overline{F}(\theta )$ essentially let us prove a finitesample convergence bound akin to the case when $F(x,\theta )$ is a linear function of $\theta $.
Assumption 2.2.
The equation $\overline{F}\mathit{}\mathrm{(}\theta \mathrm{)}\mathrm{=}\mathrm{0}$ has a unique solution, denoted by ${\theta}^{\mathrm{*}}$, and there exists $\alpha \mathrm{>}\mathrm{0}$ such that ${\mathrm{(}\theta \mathrm{}{\theta}^{\mathrm{*}}\mathrm{)}}^{\mathrm{\top}}\mathit{}\overline{F}\mathit{}\mathrm{(}\theta \mathrm{)}\mathrm{\le}\mathrm{}\alpha \mathit{}{\mathrm{\parallel}\theta \mathrm{}{\theta}^{\mathrm{*}}\mathrm{\parallel}}^{\mathrm{2}}$ for all $\theta \mathrm{\in}{\mathrm{R}}^{d}$.
Assumption 2.2 is satisfied when $F$ is a strongly monotone function [25]. Also, it can be viewed as an exponential dissipativeness property of the ODE (3) with a quadratic storage function [12]. In fact, this assumption guarantees that ${\theta}^{*}$ is the unique globally exponentially stable (GES) equilibrium point of ODE (3). To see this, let $V(\theta )={\parallel \theta {\theta}^{*}\parallel}^{2}$ be a candidate Lyapunov function. Then we have by Assumption 2.2 that
$\frac{d}{dt}}V(\theta (t))=2{(\theta (t){\theta}^{*})}^{\top}\dot{\theta}(t)\le 2\alpha V(\theta (t)),$  (6) 
which implies $V(\theta (t))\le V(\theta (0)){e}^{2\alpha t}$ for all $t\ge 0$. The quantity $\alpha $ is called the negative drift, and we see that the larger $\alpha $ is, the faster $\theta (t)$ converges.
Assumption 2.3.
The sequence $\mathrm{\{}{X}_{k}\mathrm{\}}$ is an irreducible and aperiodic Markov chain on state space $\mathrm{X}$, with unique stationary distribution $\mu $.
Assumption 2.3 is often made to study the asymptotic convergence of stochastic approximation under Markovian noise when the state space is finite; see for examples [6, 34, 9]. When compared to the case where $\{{X}_{k}\}$ is i.i.d., the main difference for $\{{X}_{k}\}$ being Markovian is that there is a bias in the update of Algorithm (2), i.e., $\mathbb{E}[F({X}_{k},\theta )\mid {X}_{0}=x]\ne \overline{F}(\theta )$. As will be seen later, Assumption 2.3 enables us to control the bias and show that it is not strong enough to alter the desired direction of the update.
In addition to these assumptions, the choice of the step size sequence $\{{\u03f5}_{k}\}$ is important. In order to state certain conditions on the step sizes we pick, we need to use the mixing time of the Markov chain $\{{X}_{k}\}$. We begin with the definition of the total variation distance between probability distributions, which is used to introduce the mixing time.
Definition 2.1.
The total variation distance between two probability distributions ${\nu}_{\mathrm{1}}$ and ${\nu}_{\mathrm{2}}$ on $\mathrm{X}$ is defined by ${\mathrm{\parallel}{\nu}_{\mathrm{1}}\mathrm{}{\nu}_{\mathrm{2}}\mathrm{\parallel}}_{\text{\mathit{T}\mathit{V}}}\mathrm{=}\frac{\mathrm{1}}{\mathrm{2}}\mathit{}{\mathrm{\sum}}_{x\mathrm{\in}\mathrm{X}}\mathrm{}{\nu}_{\mathrm{1}}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{}{\nu}_{\mathrm{2}}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{}$.
Denote ${P}^{k}(x,\cdot )$ as the distribution of ${X}_{k}$ with initial state $x$, and let
${d}_{\mathrm{max}}(k):=\underset{x\in \mathcal{X}}{\mathrm{max}}{\parallel {P}^{k}(x,\cdot )\mu \parallel}_{\text{TV}}$ 
be the maximal distance between ${P}^{k}(x,\cdot )$ and the stationary distribution $\mu $. Then the mixing time of the Markov chain $\{{X}_{k}\}$ with precision $\delta $ is defined by ${t}_{\text{mix}}(\delta )=\mathrm{min}\{k:{d}_{\mathrm{max}}(k)\le \delta \}$. Intuitively, smaller mixing time means faster convergence of the Markov chain $\{{X}_{k}\}$ to its stationary distribution. Consider the extreme case where $\{{X}_{k}\}$ is i.i.d., we have ${t}_{\text{mix}}(\delta )=1$ for any $\delta >0$.
Let ${t}_{\delta}={t}_{\text{mix}}(\frac{\delta}{2L})$, where $L$ is the Lipschitz constant given in Assumption 2.1. The mixing time ${t}_{\delta}$ is defined this way so that it accounts for both the Markov chain $\{{X}_{k}\}$ and the function $F(x,\theta )$ (through its Lipschitz constant). We will later use ${t}_{\delta}$ to state our requirement on the step size sequence $\{{\u03f5}_{k}\}$. Before that, we need the following property regarding ${t}_{\delta}$, whose proof is presented in Appendix A.1.
Claim 2.1.
There exists a constant ${L}_{\mathrm{1}}\mathrm{>}\mathrm{0}$ that depends only on the Markov chain $\mathrm{\{}{X}_{k}\mathrm{\}}$ and the Lipschitz constant $L$ such that ${t}_{\delta}\mathrm{\le}{L}_{\mathrm{1}}\mathit{}\mathrm{(}\mathrm{ln}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}\delta \mathrm{)}\mathrm{+}\mathrm{1}\mathrm{)}$.
Claim 2.1 states that the mixing time ${t}_{\delta}$ grows at most affinely in terms of $\mathrm{ln}(1/\delta )$, which further implies ${lim}_{\delta \to 0}\delta {t}_{\delta}=0$. Since we will always choose $\delta ={\u03f5}_{k}$, where ${\u03f5}_{k}$ is the step size, for simplicity, we will write ${t}_{k}$ for ${t}_{{\u03f5}_{k}}$ in the following. Now we state our condition on the step sizes $\{{\u03f5}_{k}\}$. For simplicity of notations, denote ${\u03f5}_{i,j}={\sum}_{k=i}^{j}{\u03f5}_{k}$.
Condition 2.1.
The step size sequence $\mathrm{\{}{\u03f5}_{k}\mathrm{\}}$ satisfies the following conditions.

(1)
The step size sequence $\{{\u03f5}_{k}\}$ is nonincreasing and ${\u03f5}_{0}\in (0,1)$.

(2)
There exists a minimal integer $K>0$ such that ${\u03f5}_{0,K1}\le \frac{1}{4L}$ and $$ for all $k\ge K$.
Condition 2.1 (1) is standard [4]. Condition 2.1 (2) is mainly used to control the discretization error and the stochastic error in Algorithm (2). In fact, since ${lim}_{\u03f5\to 0}\u03f5{t}_{\u03f5}=0$ under Claim 2.1, we can show that Condition 2.1 (2) is satisfied when using either small enough constant step sizes or diminishing step sizes of the form ${\u03f5}_{k}=\frac{\u03f5}{{(k+h)}^{\xi}}$ for all $\u03f5>0$, and $\xi \in (0,1]$, provided that $h$ is chosen properly.
2.3 FiniteSample Convergence Bounds for Nonlinear SA
In this subsection, we state our main results. We begin with the finitesample convergence bounds of Algorithm (2), whose proof is presented in Subsection 2.4.
Theorem 2.1.
Consider $\mathrm{\{}{\theta}_{k}\mathrm{\}}$ of Algorithm (2). Suppose that Assumptions 2.1 – 2.3 are satisfied, and $\mathrm{\{}{\u03f5}_{k}\mathrm{\}}$ satisfies Condition 2.1. Then we have for all $k\mathrm{\ge}K$:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le {\beta}_{1}{\displaystyle \prod _{j=K}^{k1}}(1\alpha {\u03f5}_{j})+{\beta}_{2}{\displaystyle \sum _{i=K}^{k1}}{\widehat{\u03f5}}_{i}{\displaystyle \prod _{j=i+1}^{k1}}(1\alpha {\u03f5}_{j}),$  (7) 
where ${\beta}_{\mathrm{1}}\mathrm{=}{\mathrm{(}\mathrm{\parallel}{\theta}_{\mathrm{0}}\mathrm{\parallel}\mathrm{+}\mathrm{\parallel}{\theta}_{\mathrm{0}}\mathrm{}{\theta}^{\mathrm{*}}\mathrm{\parallel}\mathrm{+}\mathrm{1}\mathrm{)}}^{\mathrm{2}}$, ${\beta}_{\mathrm{2}}\mathrm{=}\mathrm{114}\mathit{}{L}^{\mathrm{2}}\mathit{}{\mathrm{(}\mathrm{\parallel}{\theta}^{\mathrm{*}}\mathrm{\parallel}\mathrm{+}\mathrm{1}\mathrm{)}}^{\mathrm{2}}$, and ${\widehat{\u03f5}}_{i}\mathrm{=}{\u03f5}_{i}\mathit{}{\u03f5}_{i\mathrm{}{t}_{i}\mathrm{,}i\mathrm{}\mathrm{1}}$.
On the righthand side of Eq. (7), the first term represents the bias due to the initial guess ${\theta}_{0}$, and the second term captures the variance due to the noise. We will see later that the Markovian nature of the noise mainly impacts the second term.
Theorem 2.1 is one of our main contribution in that (1) The function $F(x,\theta )$ is allowed to be nonlinear, (2) It holds when $\{{X}_{k}\}$ is sampled in a Markovian manner, and (3) No modifications on Algorithm (2) (e.g., adding a projection step) is needed to establish the result.
After establishing the finitesample bound of Algorithm (2) in its general form, we next consider several common choices of step sizes, and see more explicitly what are the corresponding convergence rates. We begin by presenting the result for using constant step sizes, whose proof can be found in Appendix A.2.
Corollary 2.1.
Let ${\u03f5}_{k}\mathrm{=}\u03f5$ for all $k\mathrm{\ge}\mathrm{0}$, where $\u03f5$ is chosen such that $\u03f5\mathit{}{t}_{\u03f5}\mathrm{\le}\frac{\alpha}{\mathrm{114}\mathit{}{L}^{\mathrm{2}}}$. Then we have for all $k\mathrm{\ge}{t}_{\u03f5}$:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le {\beta}_{1}{(1\alpha \u03f5)}^{k{t}_{\u03f5}}+{\beta}_{2}\u03f5{t}_{\u03f5}/\alpha .$ 
We see from Corollary 2.1 that when using constant step sizes, the finitesample bound is composed by two terms. The first term converges to zero exponentially fast as the number of iterations tends to infinity, while the second term is a constant which is independent of $k$. Geometrically, this means that the iterates of Algorithm (2) achieves an exponential convergence rate in expectation to a ball containing ${\theta}^{*}$, with radius proportional to $\u03f5{t}_{\u03f5}$. Intuitively, constant step sizes efficiently eliminate the bias. However, since the noise is constantly added to the iterates without being progressively suppressed, the variance does not converge to zero.
We next consider using diminishing step sizes in Algorithm (2). Let ${\u03f5}_{k}=\frac{\u03f5}{{(k+h)}^{\xi}}$, where $\u03f5>0$, $\xi \in (0,1]$, and $h$ is chosen properly so that Condition 2.1 is satisfied for some $K>0$. The verification of Condition 2.1 and the proof of the following result can be found in Appendix A.3.
Corollary 2.2.
Suppose $\mathrm{\{}{\u03f5}_{k}\mathrm{\}}$ is chosen as above, then we have the following finitesample bounds.

(1)
When $\xi =1$, we have for all $k\ge K$:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le \{\begin{array}{cc}{\beta}_{1}{\left({\displaystyle \frac{K+h}{k+h}}\right)}^{\alpha \u03f5}+{\displaystyle \frac{8{\beta}_{2}{\u03f5}^{2}{L}_{1}}{1\alpha \u03f5}}{\displaystyle \frac{[\mathrm{ln}\left(\frac{k+h}{\u03f5}\right)+1]}{{(k+h)}^{\alpha \u03f5}}},\hfill & \alpha \u03f5\in (0,1),\hfill \\ {\beta}_{1}\left({\displaystyle \frac{K+h}{k+h}}\right)+8{\beta}_{2}{\u03f5}^{2}{L}_{1}{\displaystyle \frac{\mathrm{ln}(\frac{k+h}{K+h})[\mathrm{ln}\left(\frac{k+h}{\u03f5}\right)+1]}{k+h}},\hfill & \alpha \u03f5=1,\hfill \\ {\beta}_{1}{\left({\displaystyle \frac{K+h}{k+h}}\right)}^{\alpha \u03f5}+{\displaystyle \frac{8e{\beta}_{2}{\u03f5}^{2}{L}_{1}}{\alpha \u03f51}}{\displaystyle \frac{\left[\mathrm{ln}\left(\frac{k+h}{\u03f5}\right)+1\right]}{k+h}},\hfill & \alpha \u03f5\in (1,\mathrm{\infty}).\hfill \end{array}$ 
(2)
When $\xi \in (0,1)$ and $\u03f5>0$, suppose in addition that $K+h\ge {[2\xi /(\alpha \u03f5)]}^{1/(1\xi )}$, then we have for all $k\ge K$:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le {\beta}_{1}\mathrm{exp}\left[{\displaystyle \frac{\alpha \u03f5}{1\xi}}\left({(k+h)}^{1\xi}{(K+h)}^{1\xi}\right)\right]+{\displaystyle \frac{4{\beta}_{2}{\u03f5}^{2}{L}_{1}}{\alpha \u03f5}}{\displaystyle \frac{[\mathrm{ln}\left(\frac{k+h}{\u03f5}\right)+1]}{{(k+h)}^{\xi}}}.$
From Corollary 2.2 (1), we see that when using $\frac{\u03f5}{k+h}$ step size, the constant $\u03f5$ must be chosen very carefully (i.e., $\u03f5>1/\alpha $) to achieve the optimal $O(\mathrm{ln}(k)/k)$ convergence rate, otherwise the convergence rate is roughly $O(\mathrm{ln}(k)/{k}^{\alpha \u03f5})$, which can be arbitrarily slow. In the case when $\xi \in (0,1)$, the convergence rate is $O(\mathrm{ln}(k)/{k}^{\xi})$, and it is independent of $\u03f5$.
The above analysis indicates that our choice of step size should depend on how precise our estimate of the negative drift parameter $\alpha $ is. When our estimate to $\alpha $ is accurate, we should use ${\u03f5}_{k}=\frac{\u03f5}{k+h}$ with $\u03f5>1/\alpha $ so that the convergence rate is the optimal $O(\mathrm{ln}(k)/k)$. When our understanding to the system model is poor (therefore inaccurate estimate of $\alpha $), we should use ${\u03f5}_{k}=\frac{\u03f5}{{(k+h)}^{\xi}}$ with $\xi $ chosen close to unity. In that case, we sacrifice the convergence rate for robustness.
Unlike almost sure convergence, where the usual requirement for step sizes are ${\sum}_{k=0}^{\mathrm{\infty}}{\u03f5}_{k}=\mathrm{\infty}$ and $$ (which corresponds to $\xi \in (1/2,1]$ in our case), we have convergence in mean square error for all $\xi \in (0,1]$. Same phenomenon has been observed in [5, 32], where they study linear SA and nonlinear SA with martingale difference noise.
2.4 Proof of Theorem 2.1
This subsection is dedicated to the proof of Theorem 2.1. Before going into details, we first provide some intuition.
Recall that the Lyapunov function $V(\theta )={\parallel \theta {\theta}^{*}\parallel}^{2}$ can be used to show that ${\theta}^{*}$ is the unique GES equilibrium point of ODE (3). To analyze the convergence rate of the iterates $\{{\theta}_{k}\}$ generated by Algorithm (2), naturally we want to use the Lyapunov function $V(\cdot )$ on $\{{\theta}_{k}\}$ to show something like
$\mathbb{E}[V({\theta}_{k+1})]\mathbb{E}[V({\theta}_{k})]\le (\alpha {\u03f5}_{k}+{e}_{1})\mathbb{E}[V({\theta}_{k})]+{e}_{2}.$  (8) 
Here on the righthand side of Eq. (8), the $\alpha {\u03f5}_{k}$ term corresponds to the negative drift of the ODE, which is provided by Assumption 2.2, and the two terms ${e}_{1}$ and ${e}_{2}$ account for the discretization error and the stochastic error in Algorithm (2). The discretization error can be handled using the properties of the function $F(x,\theta )$ (Assumption 2.1) and properly chosen step sizes (Condition 2.1). As for the stochastic error, since Markovian noise natually produces bias in the update, we need to show that $\mathbb{E}[F({X}_{k},\theta )\mid {X}_{0}=x]$ converges to $\overline{F}(\theta )$ fast enough for any $\theta $, which implies that the overall cumulative bias is small. This result is provided by Assumption 2.3 and Claim 2.1. Once we show that both error terms are dominated by the drift term, i.e., ${e}_{1}=o({\u03f5}_{k})$ and ${e}_{2}=o({\u03f5}_{k})$, Eq. (8) can be recursively used to establish a finitesample error bound of Algorithm (2).
Following the high level idea we stated above, we now prove Theorem 2.1. The proofs of all the technical lemmas we used can be found in the Appendices A.4 – A.8.
To begin with, we apply $V(\theta )={\parallel \theta {\theta}^{*}\parallel}^{2}$ on the iterates ${\theta}_{k}$ of Algorithm (2). Moreover, to utilize the mixing time of the Markov chain, we take expectation conditioning on ${X}_{k{t}_{k}}$ and ${\theta}_{k{t}_{k}}$. For simplicity of notations, we use ${\mathbb{E}}_{k}[\cdot ]$ for $\mathbb{E}[\cdot \mid {X}_{k{t}_{k}},{\theta}_{k{t}_{k}}]$ in the following. Using the update rule (2), we have for all $k\ge {t}_{k}$:
${\mathbb{E}}_{k}[{\parallel {\theta}_{k+1}{\theta}^{*}\parallel}^{2}]{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]$  
$=$  $2{\mathbb{E}}_{k}[\u27e8{\theta}_{k}{\theta}^{*},{\theta}_{k+1}{\theta}_{k}\u27e9]+{\mathbb{E}}_{k}[{\parallel {\theta}_{k+1}{\theta}_{k}\parallel}^{2}]$  
$=$  $\underset{(a)}{\underset{\u23df}{2{\u03f5}_{k}{\mathbb{E}}_{k}[{({\theta}_{k}{\theta}^{*})}^{\top}\overline{F}({\theta}_{k})]}}+\underset{(b)}{\underset{\u23df}{2{\u03f5}_{k}{\mathbb{E}}_{k}[{({\theta}_{k}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]}}$  
$+\underset{(c)}{\underset{\u23df}{{\u03f5}_{k}^{2}{\mathbb{E}}_{k}[{\parallel F({X}_{k},{\theta}_{k})\parallel}^{2}]}}.$  (9) 
The term $(a)$ corresponds to the drift of the ODE (3), and we have by Assumption 2.2 that
$(a)\le 2\alpha {\u03f5}_{k}{\mathbb{E}}_{k}\left[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}\right].$ 
The terms $(b)$ and $(c)$ correspond to the stochastic error and the discretization error respectively. What remains to show is that they are dominated by the term $(a)$. We begin by bounding the term $(c)$ in the following lemma.
Lemma 2.1.
$(c)\le 2{L}^{2}{\u03f5}_{k}^{2}\left[{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}\right]$ for all $k\mathrm{\ge}{t}_{k}$.
Observe that Lemma 2.1 implies that $(c)=O({\u03f5}_{k}^{2})=o({\u03f5}_{k})$, which is what we desire. We next turn to the term $(b)$, to control it, we need the following two results.
Lemma 2.2.
Given precision $\delta \mathrm{>}\mathrm{0}$, the following inequality holds for all $x\mathrm{\in}\mathrm{X}$, $\theta \mathrm{\in}{\mathrm{R}}^{d}$, and $k\mathrm{\ge}{t}_{\delta}$: $\mathrm{\parallel}\mathrm{E}\mathrm{[}F\mathrm{(}{X}_{k}\mathrm{,}\theta \mathrm{)}\mathrm{\mid}{X}_{\mathrm{0}}\mathrm{=}x\mathrm{]}\mathrm{}\overline{F}\mathrm{(}\theta \mathrm{)}\mathrm{\parallel}\mathrm{\le}\delta \mathrm{(}\mathrm{\parallel}\theta \mathrm{\parallel}\mathrm{+}\mathrm{1}\mathrm{)}$.
Lemma 2.3.
For any $$ satisfying ${\u03f5}_{{k}_{\mathrm{1}}\mathrm{,}{k}_{\mathrm{2}}\mathrm{}\mathrm{1}}\mathrm{\le}\frac{\mathrm{1}}{\mathrm{4}\mathit{}L}$, the following two inequalities hold:
$\parallel {\theta}_{{k}_{2}}{\theta}_{{k}_{1}}\parallel $  $\le 2L{\u03f5}_{{k}_{1},{k}_{2}1}(\parallel {\theta}_{{k}_{1}}\parallel +1),$  
$\parallel {\theta}_{{k}_{2}}{\theta}_{{k}_{1}}\parallel $  $\le 4L{\u03f5}_{{k}_{1},{k}_{2}1}(\parallel {\theta}_{{k}_{2}}\parallel +1).$ 
Lemma 2.2 utilizes the mixing time ${t}_{\delta}={t}_{\text{mix}}(\frac{\delta}{2L})$ to show that the bias in the update of Algorithm (2) is exponentially small after the Markov chain is sufficiently mixed. Lemma 2.3 controls the difference between ${\theta}_{{k}_{1}}$ and ${\theta}_{{k}_{2}}$ when ${k}_{2}{k}_{1}$ is not too large. With the help of Lemmas 2.2 and 2.3, we are now ready to bound the term $(b)$ in the following lemma.
Lemma 2.4.
The following inequality holds for all $k$ such that ${\u03f5}_{k\mathrm{}{t}_{k}\mathrm{,}k\mathrm{}\mathrm{1}}\mathrm{\le}\frac{\mathrm{1}}{\mathrm{4}\mathit{}L}$:
$(b)\le 112{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}\left[{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}\right].$ 
Substituting the upper bounds we obtained for the terms $(a)$, $(b)$, and $(c)$ into Eq. (9), we obtain the following recursive relation between ${\theta}_{k+1}$ and ${\theta}_{k}$.
Lemma 2.5.
The following inequality holds for all $k$ such that ${\u03f5}_{k\mathrm{}{t}_{k}\mathrm{,}k\mathrm{}\mathrm{1}}\mathrm{\le}\frac{\mathrm{1}}{\mathrm{4}\mathit{}L}$:
$\mathbb{E}[{\parallel {\theta}_{k+1}{\theta}^{*}\parallel}^{2}]\le (12\alpha {\u03f5}_{k}+114{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1})\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+114{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}{(\parallel {\theta}^{*}\parallel +1)}^{2}.$  (10) 
Eq. (10) is of the desired recursive form (8). Therefore, as long as the drift term dominates the error terms, i.e., $2\alpha {\u03f5}_{k}>114{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}$, we can use Eq. (10) to derive a finitesample error bound of Algorithm (2).
Proof of Theorem 2.1.
When $k\ge K$ (see Condition 2.1 for the definition of $K$), we have by Eq. (10) that $\mathbb{E}[{\parallel {\theta}_{k+1}{\theta}^{*}\parallel}^{2}]\le (1\alpha {\u03f5}_{k})\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{\beta}_{2}{\widehat{\u03f5}}_{k}$, where ${\widehat{\u03f5}}_{k}$ and ${\beta}_{2}$ are defined in Theorem 2.1. Recursively using the preceding inequality starting from $K$, we obtain
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le \mathbb{E}[{\parallel {\theta}_{K}{\theta}^{*}\parallel}^{2}]{\displaystyle \prod _{j=K}^{k1}}(1\alpha {\u03f5}_{j})+{\beta}_{2}{\displaystyle \sum _{i=K}^{k1}}{\widehat{\u03f5}}_{i}{\displaystyle \prod _{j=i+1}^{k1}}(1\alpha {\u03f5}_{j}).$  (11) 
To bound $\mathbb{E}[{\parallel {\theta}_{K}{\theta}^{*}\parallel}^{2}]$, using Lemma 2.3, and our assumption that ${\u03f5}_{0,K1}\le \frac{1}{4L}$, we have $\mathbb{E}[{\parallel {\theta}_{K}{\theta}^{*}\parallel}^{2}]\le \mathbb{E}[{(\parallel {\theta}_{K}{\theta}_{0}\parallel +\parallel {\theta}^{*}{\theta}_{0}\parallel )}^{2}]\le {\beta}_{1}$, which gives the desired finitesample bound (7). ∎
In summary, we have stated and proved a finitesample convergence bound for stochastic approximation algorithm (2). Moreover, we analyzed the convergence rates for using both constant and diminishing step sizes. In the following section, we will use these results to establish the convergence rates of the $Q$learning with linear function approximation algorithm for solving the Reinforcement Learning problem.
3 Applications in Reinforcement Learning
3.1 $Q$Learning with Linear Function Approximation
We begin by introducing the Markov Decision Processes and the Reinforcement Learning problem.
Definition 3.1 (Markov Decision Processes).
An infinite horizon discounted Markov decision process $\mathrm{M}$ is comprised by a tuple $\mathrm{(}\mathrm{S}\mathrm{,}\mathrm{A}\mathrm{,}P\mathrm{,}\mathrm{R}\mathrm{,}\gamma \mathrm{)}$, where $\mathrm{S}$ is a finite set of states, $\mathrm{A}$ is a finite set of actions, ${P}_{a}\mathit{}\mathrm{(}s\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{)}$ stands for the probability of transition from state $s$ to state ${s}^{\mathrm{\prime}}$ under action $a$, $\mathrm{R}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ is the reward when taking action $a$ at state $s$, and $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$ is the discount factor.
We assume that the reward is nonnegative and is uniformly bounded such that $\mathcal{R}(s,a)\in [0,{r}_{\mathrm{max}}]$ for all stateaction pairs $(s,a)$. The underlying model of the Reinforcement Learning problem is essentially an MDP except that the transition probabilities and the reward function are unknown to the agent.
The goal of the agent is to find a strategy (called policy) of selecting actions based on the state of the environment such that the expected longterm discounted reward is maximized. More precisely, define the stateaction value function (also known as the $Q$function) of a policy $\pi $ at stateaction pairs $(s,a)$ by
${Q}_{\pi}(s,a)={\mathbb{E}}_{\pi}\left[{\displaystyle \sum _{k=0}^{\mathrm{\infty}}}{\gamma}^{k}\mathcal{R}({s}_{k},{a}_{k})\right{s}_{0}=s,{a}_{0}=a],\forall (s,a)\in \mathcal{S}\times \mathcal{A},$ 
where ${\{{a}_{k}\}}_{k\ge 1}$ is executed according to policy $\pi $. In words, ${Q}_{\pi}(s,a)$ is the expected total reward the agent would receive if it starts at state $s$, first taking action $a$, and then following the policy $\pi $ thereafter. We want to find an optimal policy ${\pi}^{*}$ such that its corresponding $Q$function, denote by ${Q}^{*}$, is such that ${Q}^{*}(s,a)\ge {Q}_{\pi}(s,a)$ for all stateaction pairs $(s,a)$ and all policy $\pi $. A fundamental property of the function ${Q}^{*}$ is that, if one just select actions greedy based on ${Q}^{*}$, then that is an optimal policy, i,e, ${\pi}^{*}(s)\in \mathrm{arg}{\mathrm{max}}_{a\in \mathcal{A}}{Q}^{*}(s,a)$ for all state $s$. Therefore, solving the RL problem reduces to finding the optimal $Q$function.
The $Q$learning algorithm proposed in [36] is a popular approach for estimating the optimal $Q$function. However, it is wellknown that a major challenge in the $Q$learning algorithm is the curse of dimensionality. That is, $Q$learning becomes intractable when the number of stateaction pairs is very large. Therefore, it was suggested that we approximate the optimal $Q$function within a chosen function space which has much smaller dimension than $\mathcal{S}\mathcal{A}$ [4]. We next describe the approximation architecture.
Let $\{{\varphi}_{i}:\mathcal{S}\times \mathcal{A}\mapsto \mathbb{R}\mid 1\le i\le d\}$ be a set of basis functions (called features). Define the feature matrix $\mathrm{\Phi}\in {\mathbb{R}}^{\mathcal{S}\mathcal{A}\times d}$ by $\mathrm{\Phi}=[{\varphi}_{1},{\varphi}_{2},\mathrm{\dots},{\varphi}_{d}]$. Then the linear subspace $\mathcal{W}$ spanned by the features $\{{\varphi}_{i}\}$ can be written as $\mathcal{W}=\{{\stackrel{~}{Q}}_{\theta}=\mathrm{\Phi}\theta \mid \theta \in {\mathbb{R}}^{d}\}$. We will use $\mathcal{W}$ as our approximating function space, and the goal here is to find ${\theta}^{*}$ such that ${\stackrel{~}{Q}}_{{\theta}^{*}}$ best represents ${Q}^{*}$.
Using the notations above, we now present the $Q$learning algorithm under linear function approximation [4, 19]. Let $\{({s}_{k},{a}_{k})\}$ be a trajectory generated by applying some policy $\pi $ to the system model, where $\pi $ is called the behavior policy. Then, the parameter $\theta $ of the approximation ${\stackrel{~}{Q}}_{\theta}$ is iteratively updated by:
${\theta}_{k+1}={\theta}_{k}+{\u03f5}_{k}\varphi ({s}_{k},{a}_{k})(\mathcal{R}({s}_{k},{a}_{k})+\gamma \underset{a\in \mathcal{A}}{\mathrm{max}}\varphi {({s}_{k+1},a)}^{\top}{\theta}_{k}\varphi {({s}_{k},{a}_{k})}^{\top}{\theta}_{k}),$  (12) 
where $\{{\u03f5}_{k}\}$ is a sequence of step sizes. Algorithm (12) can be viewed as a stochastic approximation algorithm for solving the equation:
${\mathbb{E}}_{\mu}[\varphi (s,a)(\mathcal{R}(s,a)+\gamma \underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}^{\prime})}^{\top}\theta \varphi {(s,a)}^{\top}\theta )]=0,$  (13) 
where $\mu $ is the stationary distribution of the Markov chain $\{{s}_{k}\}$ under policy $\pi $. Under some mild conditions, Eq. (13) is equivalent to a socalled projected Bellman’s equation (see [19] for more details). In the special case where the feature matrix $\mathrm{\Phi}$ is a identity matrix with dimension $\mathcal{S}\mathcal{A}$, Algorithm (12) reduces to the tabular $Q$learning algorithm [36], and Eq. (13) becomes the regular Bellman’s equation [4].
In general, Eq. (13) may not necessarily admit a solution [9], and the iteration in (12) may diverge [1]. However, it was shown in [19, 16] that under an assumption on the behavior policy $\pi $, ${\theta}_{k}$ converges to the solution of Eq. (13), denoted by ${\theta}^{*}$, almost surely. In this paper, we focus on establishing the finitesample convergence guarantee of Algorithm (12). We begin by stating our main assumptions.
Assumption 3.1.
The features ${\mathrm{\{}{\varphi}_{i}\mathrm{\}}}_{\mathrm{1}\mathrm{\le}i\mathrm{\le}d}$ are linearly independent and are normalized so that $\mathrm{\parallel}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\parallel}\mathrm{\le}\mathrm{1}$ for all stateaction pairs $\mathrm{(}s\mathrm{,}a\mathrm{)}$.
Assumption 3.1 is indeed without loss of generality since we can disregard dependent features. Moreover, since the state and action spaces are finite, we can always make sure ${\mathrm{max}}_{(s,a)}\parallel \varphi (s,a)\parallel \le 1$ by proper scaling.
Assumption 3.2.
The Markov chain ${\mathrm{\{}{s}_{k}\mathrm{\}}}_{k\mathrm{\ge}\mathrm{0}}$ induced by the behavior policy $\pi $ is irreducible and aperiodic, with unique stationary distribution $\mu $.
Assumption 3.2 essentially requires that the behavior policy $\pi $ has enough exploration, which is necessary in studying valuebased RL algorithms.
Assumption 3.3.
Eq. (13) has a unique solution, denoted by ${\theta}^{\mathrm{*}}$, and there exists $\kappa \mathrm{>}\mathrm{0}$ such that the following inequality holds for all $\theta \mathrm{\in}{\mathrm{R}}^{d}$:
${\gamma}^{2}{\mathbb{E}}_{\mu}[\underset{a\in \mathcal{A}}{\mathrm{max}}{(\varphi {(s,a)}^{\top}\theta )}^{2}]{\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}\theta )}^{2}]\le \kappa \parallel \theta {\parallel}^{2}.$  (14) 
3.2 FiniteSample Convergence Bounds for $Q$Learning with Linear Function Approximation
To begin with, we rewrite Algorithm (12) in the form of stochastic iterative algorithm (2). Let ${X}_{k}=({s}_{k},{a}_{k},{s}_{k+1})$ for all $k\ge 0$. It is clear that $\{{X}_{k}\}$ is a Markov chain with finite state space $\mathcal{X}=\{(s,a,{s}^{\prime})\mid s\in \mathcal{S},\pi (as)>0,{P}_{a}(s,{s}^{\prime})>0\}$. Define
$F(x,\theta )=F(s,a,{s}^{\prime},\theta ):=\varphi (s,a)(\mathcal{R}(s,a)+\gamma \underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}^{\prime})}^{\top}\theta \varphi {(s,a)}^{\top}\theta ).$  (15) 
With $F(x,\theta )$ being defined above, Algorithm (12) can be written as:
${\theta}_{k+1}={\theta}_{k}+{\u03f5}_{k}F({X}_{k},{\theta}_{k}).$  (16) 
Let $\overline{F}(\theta )={\mathbb{E}}_{\mu}[F(X,\theta )]$, where $\mu $ is the stationary distribution of the Markov chain $\{{s}_{k}\}$ under policy $\pi $. Wee see that $\overline{F}(\theta )=0$ is exactly the target equation (13) of Algorithm (12).
To apply Theorem 2.1, we need to show that Assumptions 2.12.3 are naturally satisfied in the context of $Q$learning under Assumptions 3.13.3, which is provided by the following proposition. See Appendix B.1 for its proof.
Proposition 3.1.
Consider $Q$learning algorithm (12). Suppose that Assumptions 3.13.3 are satisfied, then we have the following results.

(1)
The Markov chain $\{{X}_{k}\}$ with state space $\mathcal{X}$ is irreducible and aperiodic.

(2)
The function $F(x,\theta )$ defined in (15) is Lipschitz continuous with respect to $\theta $ uniformly in $x$, and $M=1+\gamma +{r}_{\mathrm{max}}$ is a valid Lipschitz constant. Moreover, we have $M\ge {\mathrm{max}}_{x\in \mathcal{X}}\parallel F(x,0)\parallel $.

(3)
The equation $\overline{F}(\theta )=0$ has a unique solution ${\theta}^{*}$, and we have ${(\theta {\theta}^{*})}^{\top}\overline{F}(\theta )\le \frac{\kappa}{2}{\parallel \theta {\theta}^{*}\parallel}^{2}$ for all $\theta \in {\mathbb{R}}^{d}$, where $\kappa $ is given in Assumption 3.3.
Similarly as in Section 2, given precision $\delta >0$, we define the mixing time ${t}_{\delta}={t}_{\text{mix}}(\frac{\delta}{2M})$, where ${t}_{\text{mix}}(\cdot )$ is for the Markov chain $\{{X}_{k}\}$, and $M$ is the Lipschitz constant given in Proposition 3.1. Moreover, there exists a constant ${M}_{1}$ that depends only on the Markov chain $\{{X}_{k}\}$ and the Lipschitz constant $M$ such that ${t}_{\delta}\le {M}_{1}(\mathrm{ln}(1/\delta )+1)$ for any $\delta >0$.
We next use Theorem 2.1 to derive finitesample error bounds of Algorithm (12) for different choice of step sizes. We begin with the result for using constant step sizes, which directly follows from Proposition 3.1 and Corollary 2.1. Let ${\eta}_{1}={(\parallel {\theta}_{0}\parallel +\parallel {\theta}_{0}{\theta}^{*}\parallel +1)}^{2}$, and ${\eta}_{2}=114{M}^{2}{(\parallel {\theta}^{*}\parallel +1)}^{2}$.
Theorem 3.1.
Consider $\mathrm{\{}{\theta}_{k}\mathrm{\}}$ of Algorithm (12). Suppose that Assumptions 3.13.3 are satisfied, and ${\u03f5}_{k}\mathrm{\equiv}\u03f5$ with $\u03f5$ chosen such that $\u03f5\mathit{}{t}_{\u03f5}\mathrm{\le}\frac{\kappa}{\mathrm{228}\mathit{}{M}^{\mathrm{2}}}$. Then we have for all $k\mathrm{\ge}{t}_{\u03f5}$:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le {\eta}_{1}{\left(1\kappa \u03f5/2\right)}^{k{t}_{\u03f5}}+2{\eta}_{2}\u03f5{t}_{\u03f5}/\kappa .$ 
Theorem 3.1 is qualitatively similar to Corollary 2.1 in that the iterates of $Q$learning converges exponentially fast to a ball centered at ${\theta}^{*}$, and the size of the ball is proportional to $\u03f5{t}_{\u03f5}$. Consider the special case where $\{({s}_{k},{a}_{k})\}$ is sampled in an i.i.d. manner. Since ${t}_{\u03f5}=1$ for any $\u03f5>0$, the size of the ball is only proportional to the step size $\u03f5$. Thus, we see that the Markovian nature of the noise produces an additional mixing time factor. This agrees with results in [28, 5], where the popular TDlearning with linear function approximation algorithm was studied.
We next present the result for using diminishing step sizes. For simplicity, we only present the case of the optimal convergence rate in Corollary 2.2. Let ${\u03f5}_{k}=\frac{\u03f5}{k+h}$ with $\u03f5>2/\kappa $, where $h$ is chosen properly so that there exists a minimal ${K}^{\prime}>0$ such that ${\u03f5}_{0,{K}^{\prime}1}\le \frac{1}{4M}$ and $$ for all $k\ge {K}^{\prime}$. The following result directly follows from Proposition 3.1 and Corollary 2.2.
Theorem 3.2.
Consider $\mathrm{\{}{\theta}_{k}\mathrm{\}}$ of Algorithm (12). Suppose that Assumptions 3.13.3 are satisfied, and ${\u03f5}_{k}$ is chosen as above. Then we have for all $k\mathrm{\ge}{K}^{\mathrm{\prime}}$:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le {\eta}_{1}{\left({\displaystyle \frac{{K}^{\prime}+h}{k+h}}\right)}^{\frac{\kappa \u03f5}{2}}+{\displaystyle \frac{16e{\eta}_{2}{\u03f5}^{2}{M}_{1}}{\kappa \u03f52}}{\displaystyle \frac{\left[\mathrm{ln}\left(\frac{k+h}{\u03f5}\right)+1\right]}{k+h}}.$ 
3.3 Discussion about Assumption 3.3 on the Behavior Policy
In this subsection, we take a closer look at Assumption 3.3 and Eq. (14), which are essential for the stability of the $Q$learning with linear function approximation algorithm. First note that Eq. (14) is equivalent to
$$  (17) 
The direction Eq. (14) implying Eq. (17) is trivial. As for the other direction, let
$\kappa =\underset{\theta :\parallel \theta \parallel =1}{\mathrm{max}}\{{\gamma}^{2}{\mathbb{E}}_{\mu}[\underset{a\in \mathcal{A}}{\mathrm{max}}{(\varphi {(s,a)}^{\top}\theta )}^{2}]{\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}\theta )}^{2}]\}.$ 
By Weierstrass extreme value theorem [24], $\kappa $ is welldefined and strictly positive, which immediately gives Eq. (14).
We next analyze condition (17). We first look at the terms inside the expectation on both sides of the inequality. On the righthand side, the quantity ${(\varphi {(s,a)}^{\top}\theta )}^{2}$ can be interpreted as ${\stackrel{~}{Q}}_{\theta}^{2}(s,{a}_{1})$ with the action ${a}_{1}$ chosen according to the behavior policy $\pi $. On the lefthand side, the term ${\mathrm{max}}_{a\in \mathcal{A}}{(\varphi {(s,a)}^{\top}\theta )}^{2}$ is essentially ${\stackrel{~}{Q}}_{\theta}^{2}(s,{a}_{2})$ where ${a}_{2}$ is chosen greedily with respect to ${\stackrel{~}{Q}}_{\theta}^{2}$. Therefore, it is clear that
${\mathbb{E}}_{\mu}[\underset{a\in \mathcal{A}}{\mathrm{max}}{(\varphi {(s,a)}^{\top}\theta )}^{2}]\ge {\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}\theta )}^{2}],\forall \theta \in {\mathbb{R}}^{d}.$ 
To meet condition (17), besides the presence of ${\gamma}^{2}$, the behavior policy $\pi $ should be somehow close to the policy induced greedily by ${\stackrel{~}{Q}}_{\theta}^{2}$ for any nonzero $\theta $, which is a relatively restrictive requirement.
Similar assumptions on the behavior policy were also proposed in [19, 16]. Although the exact form of the conditions are different, they all follow the same spirit. That is, with a chosen Lyapunov function, the condition should enable us to show that the corresponding ODE
$\dot{\theta}(t)=\overline{F}(\theta (t))$  (18) 
of the $Q$learning algorithm (12) is GAS. We next briefly compare our condition to those proposed in [19] and [16]. The condition in [19] (their Eq. (7)) implies
$$  (19) 
for all nonzero $\theta $ ^{1}^{1} 1 The factor of 2 appears to be missing in [19]..The righthand side is the same for both Eqs. (19) and (17). On the lefthand side, Eq. (19) has an additional factor of $2$, and the square is outside the max operator. Although they are similar, our condition and the condition proposed in [19] do not imply each other. As for the condition proposed in [16], while it is not clear if it is less restrictive than ours, it is shown that the condition in [16] implies the condition in [19] under more restrictive assumptions. However, [16] assumes i.i.d. sampling, and studies only the asymptotic convergence rather than finitesample error bounds.
We next analyze how the features $\{{\varphi}_{i}\}$ and the behavior policy $\pi $ impact condition (17) in the following two examples.
Example 3.1.
Consider the case when $d\mathrm{=}\mathrm{1}$. That is, there is only one feature ${\varphi}_{\mathrm{1}}$, and the weight $\theta $ is a scalar on the real line. Condition (17) reduces to
$$  (20) 
Let ${r}_{\pi}\mathrm{=}{\mathrm{E}}_{\mu}\mathit{}\mathrm{[}\varphi \mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathit{}\mathrm{R}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{]}$,
${h}^{+}={\mathbb{E}}_{\mu}[\gamma \varphi (s,a)\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi ({s}^{\prime},{a}^{\prime})\varphi {(s,a)}^{2}],\text{\mathit{a}\mathit{n}\mathit{d}}\mathit{\hspace{1em}}{h}^{}={\mathbb{E}}_{\mu}[\gamma \varphi (s,a)\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{min}}\varphi ({s}^{\prime},{a}^{\prime})\varphi {(s,a)}^{2}].$ 
Then we have the following result, whose proof can be found in Appendix B.2.
Proposition 3.2.
Proposition 3.2 implies that the condition (20) is “almost necessary" for the GAS of ODE (18). Moreover, it is clear from Eq. (20) that when $d\mathrm{=}\mathrm{1}$, there always exists a behavior policy $\pi $ such that Eq. (20) is satisfied, e.g. $\pi \mathit{}\mathrm{(}s\mathrm{)}\mathrm{\in}\mathrm{arg}\mathit{}{\mathrm{max}}_{a\mathrm{\in}\mathrm{A}}\mathit{}\varphi \mathit{}{\mathrm{(}s\mathrm{,}a\mathrm{)}}^{\mathrm{2}}$ is a feasible behavior policy.
Example 3.2.
We next consider another extreme case where we have $d\mathrm{=}\mathrm{}\mathrm{S}\mathrm{}\mathit{}\mathrm{}\mathrm{A}\mathrm{}$, i.e., there is no dimension reduction at all. We show in the following proposition that, in the fulldimension case, condition (17) is feasible in terms of the behavior $\pi $ only when the discount factor $\gamma $ is sufficiently small. See Appendix B.3 for its proof.
Proposition 3.3.
When ${\gamma}^{\mathrm{2}}\mathrm{\ge}\mathrm{1}\mathrm{/}m$, condition (17) is infeasible for any behavior policy $\pi $.
We now compare the results for the two extreme cases, i.e., $d=1$ and $d=\mathcal{S}\mathcal{A}$. We see that in the unidimensional case, Eq. (17) implies a condition which is almost sufficient and necessary for the GAS of the equilibrium ${\theta}^{*}$ to the ODE (18). Moreover, there always exists a behavior policy $\pi $ satisfying $(\mathit{\text{17}})$. However, in the fulldimensional case, condition (17) is infeasible in terms of the behavior policy $\pi $ when ${\gamma}^{2}\ge 1/m$, which is usually the case in practice where $\gamma $ is chosen close to unity to ensure enough exploration.
In the next section, we present numerical experiments to demonstrate the sufficiency of condition (17), as well as the resulting convergence rate of the $Q$learning with linear function approximation algorithm.
3.4 Numerical Experiments
To present the numerical experiments, we need to rewrite condition (17) in a different way. Defining
$\delta (\pi )=\underset{\{\theta :\parallel \theta \parallel =1\}}{\mathrm{min}}{\displaystyle \frac{{\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}\theta )}^{2}]}{{\mathbb{E}}_{\mu}[{\mathrm{max}}_{a\in \mathcal{A}}{(\varphi {(s,a)}^{\top}\theta )}^{2}]}},$  (21) 
we see that condition (17) is equivalent to $\delta (\pi )>{\gamma}^{2}$. One way to compute $\delta (\pi )$ is presented in Appendix B.4.
In our simulation, we consider the divergent example of $Q$learning with linear function approximation introduced in [1], where there are $7$ states and $2$ actions, and the number of features is $14$. See [1] or Appendix B.5 for more details about this example. Note that since $\mathcal{S}\mathcal{A}=d$ in this example, it belongs to the fulldimensional case discussed in Subsection 3.3, and is essentially a change of basis. Surprisingly, even in this case, the $Q$learning algorithm (12) can diverge.
To demonstrate the effectiveness of condition (17) for the stability of $Q$learning, in our first set of simulations, the reward function is set to zero. Since the reward function is identically zero, ${Q}^{*}$ is zero, implying ${\theta}^{*}$ is zero. Because of this structure, it is possible for the $Q$learning algorithm to converge even when constant step size is used. We choose the behavior policy $\pi $ which takes each action with equal probability. It turns out that $\delta (\pi )\approx 0.5$, giving the threshold for $\gamma $ to satisfy Eq. (17) being $\delta {(\pi )}^{1/2}\approx 0.7$. In the numerical experiments, we choose constant step size $\u03f5=0.01$, discount factor $\gamma \in \{0.7,0.9,0.97\}$, and plot $\parallel {\theta}_{k}\parallel $ as a function of the number of iterations $k$ in Figure 2. Here, ${\theta}_{k}$ converges when $\gamma =0.7,0.9$, but diverges when $\gamma =0.97$. This demonstrates that condition (17) is sufficient but not necessary for convergence. This also shows that by ensuring (17), the counter example from [1] can be made to converge.
To show the exponential convergence rate for using constant step size, we consider the convergence of ${\theta}_{k}$ when $\gamma =0.7$ given in Figure 2, where we plot $\mathrm{ln}\mathbb{E}[{\parallel {\theta}_{k}\parallel}^{2}]$ as a function on the number of iterations $k$. In this case, ${\theta}_{k}$ seems to converge exponentially to $0$, which agrees with Corollary 3.1.
In our second set of experiments, we illustrate the convergence rate of $Q$learning with linear function approximation for using diminishing step sizes ${\u03f5}_{k}=\u03f5/{k}^{\xi}$. We use the same MDP model, but the reward is sampled independently from a uniform distribution on $(0,1)$ for all stateaction pairs. The behavior policy is still to take each action with equal probability. The constant $\kappa $ given in Eq. (14) is estimated by numerical optimization, and the discount factor $\gamma $ is set to be $0.7$. In figure 4, we plot the quantity $\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]$ as a function of $k$ for $\xi \in \{0.4,0.6,0.8,1\}$. In the case where $\xi =1$, the constant coefficient $\u03f5$ is chosen such that $\kappa \u03f5\ge 2$ in order to achieve the optimal convergence rate. We see that the iterates converge for all $\xi \in (0,1]$. Moreover, the larger the value of $\xi $ is, the faster ${\theta}_{k}$ converges.
To further demonstrate the convergence rate, we plot the quantity $\mathrm{ln}\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]$ as a function of $\mathrm{ln}k$ in Figure 4 and look at its asymptotic behavior. We see that the slope is approximately $\xi $, which agrees with Corollary 3.2.
4 Conclusion
In this paper we establish finitesample convergence guarantee for a general nonlinear stochastic approximation algorithm with Markovian noise. The result is used to derive the finitesample error bounds of the $Q$learning with linear function approximation algorithm for solving the RL problem. Since such an algorithm is known to diverge in general, we study it under a condition on the behavior policy that ensures stability. Sufficiency of this condition on the behavior policy, as well as the rate of convergence of $Q$learning is also verified numerically in the context of a wellknown counter example.
Appendix A Proof of All Technical Results in Section 2
A.1 Proof of Claim 2.1
Since the finite state Markov chain $\{{X}_{k}\}$ is irreducible and aperiodic, we have by Theorem 4.9 in [17] that there exists $C>0$ and $\rho \in (0,1)$ such that ${d}_{\mathrm{max}}(k)\le C{\rho}^{k}$ for all $k\ge 0$. Note that the inequality $C{\rho}^{k}\le \frac{\delta}{2L}$ holds when $k\ge \frac{\mathrm{ln}(1/\delta )+\mathrm{ln}(2CL)}{\mathrm{ln}(1/\rho )}$. Therefore, we have
${t}_{\delta}={t}_{\text{mix}}\left({\displaystyle \frac{\delta}{2L}}\right)\le {\displaystyle \frac{\mathrm{ln}(1/\delta )+\mathrm{ln}(2CL)}{\mathrm{ln}(1/\rho )}}\le {L}_{1}(\mathrm{ln}(1/\delta )+1),$ 
where ${L}_{1}=\mathrm{max}(1,\mathrm{ln}(2CL))/\mathrm{ln}(1/\rho )$.
A.2 Proof of Corollary 3.1
Let $K={t}_{\u03f5}$. Since $\u03f5$ is chosen such that $\u03f5{t}_{\u03f5}\le \frac{\alpha}{114{L}^{2}}$, it is easy to verify that Condition 2.1 is satisfied. Moreover, we have

(1)
${\prod}_{i={t}_{\u03f5}}^{k1}(1\alpha {\u03f5}_{i})={(1\alpha \u03f5)}^{k{t}_{\u03f5}}$

(2)
${\sum}_{i={t}_{\u03f5}}^{k1}{\widehat{\u03f5}}_{i}{\prod}_{j=i+1}^{k1}(1\alpha {\u03f5}_{j})\le {\u03f5}^{2}{t}_{\u03f5}{\sum}_{i=0}^{\mathrm{\infty}}{(1\alpha \u03f5)}^{i}=\frac{\u03f5{t}_{\u03f5}}{\alpha}$.
The result then follows from the above two observations.
A.3 Proof of Corollary 3.2
We first show that when ${\u03f5}_{k}=\frac{\u03f5}{{(k+h)}^{\xi}}$ ($\u03f5>0$, $\xi \in (0,1]$) with properly chosen $h$, there exists $K>0$ such that Condition 2.1 is satisfied.
It is enough to verify Condition 2.1 (2). Note that ${\u03f5}_{0,K1}\le \frac{1}{4L}$ and ${\u03f5}_{k{t}_{k},k1}\le \frac{\alpha}{114{L}^{2}}$ for all $k\ge K$ are satisfied when $K\ge {t}_{K}$ and ${\u03f5}_{0,K1}\le \frac{\alpha}{114{L}^{2}}$. Using Claim 2.1 and ${\u03f5}_{k}\le \frac{\u03f5}{{h}^{\xi}}$ for all $k\ge 0$, we have
$K\ge {t}_{K}$  $\u27f8$  $K\ge {L}_{1}\left[\mathrm{ln}{\displaystyle \frac{{(K+h)}^{\xi}}{\u03f5}}+1\right]$  (22)  
${\u03f5}_{0,K1}\le {\displaystyle \frac{\alpha}{114{L}^{2}}}$  $\u27f8$  $\frac{\u03f5K}{{h}^{\xi}}}\le {\displaystyle \frac{\alpha}{114{L}^{2}}}.$  (23) 
Due to the logarithmic function in Eq. (22), it is easy to reason the existence of $h$ and $K$ satisfying Eqs. (22) and (23). In particular, denote $N=\mathrm{max}(0,{L}_{1}(\mathrm{ln}(1+\frac{\alpha}{114{L}^{2}})+1\mathrm{ln}\u03f5))$, Eqs. (22) and (23) are satisfied when
$h\ge {\left({\displaystyle \frac{228\u03f5{L}^{2}({L}_{1}+N)\mathrm{ln}({L}_{1}+N)}{\alpha}}\right)}^{1/\xi},$ 
and $K=\frac{\alpha {h}^{\xi}}{114\u03f5{L}^{2}}$.
We next plug in the expression of the step sizes into the finitesample bound (7). We first simplify ${\widehat{\u03f5}}_{k}$ in the following. Using Claim 2.1 and we have
$\frac{{\widehat{\u03f5}}_{k}}{{t}_{k}{\u03f5}_{k}^{2}}}={\displaystyle \frac{{\u03f5}_{k{t}_{k},k1}}{{t}_{k}{\u03f5}_{k}}}\le {\displaystyle \frac{{\u03f5}_{k{t}_{k}}}{{\u03f5}_{k}}}\le {\displaystyle \frac{{(k+h)}^{\xi}}{{(k+h{L}_{1}(\xi \mathrm{ln}(k+h)\mathrm{ln}\u03f5+1))}^{\xi}}}\to 1\text{as}k\to \mathrm{\infty}.$ 
Assume with out loss of generality that $\frac{{\widehat{\u03f5}}_{k}}{{t}_{k}{\u03f5}_{k}^{2}}\le 2$ for all $k\ge K$, we have:
$\mathbb{E}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]$  
$\le $  ${\beta}_{1}{\displaystyle \prod _{j=K}^{k1}}\left(1\alpha {\u03f5}_{j}\right)+{\beta}_{2}{\displaystyle \sum _{i=K}^{k1}}{\widehat{\u03f5}}_{i}{\displaystyle \prod _{j=i+1}^{k1}}\left(1\alpha {\u03f5}_{j}\right)$  
$\le $  ${\beta}_{1}{\displaystyle \prod _{j=K}^{k1}}\left(1\alpha {\u03f5}_{j}\right)+{\beta}_{2}{\displaystyle \sum _{i=K}^{k1}}2{t}_{i}{\u03f5}_{i}^{2}{\displaystyle \prod _{j=i+1}^{k1}}\left(1\alpha {\u03f5}_{j}\right)$  
$\le $  ${\beta}_{1}{\displaystyle \prod _{j=K}^{k1}}\left(1\alpha {\u03f5}_{j}\right)+2{\beta}_{2}{t}_{k}{\displaystyle \sum _{i=K}^{k1}}{\u03f5}_{i}^{2}{\displaystyle \prod _{j=i+1}^{k1}}\left(1\alpha {\u03f5}_{j}\right)$  
$\le $  ${\beta}_{1}\underset{{A}_{1}}{\underset{\u23df}{{\displaystyle \prod _{j=K}^{k1}}\left(1{\displaystyle \frac{\alpha \u03f5}{{(j+h)}^{\xi}}}\right)}}+2{\beta}_{2}{\u03f5}^{2}{L}_{1}\left[\mathrm{ln}\left({\displaystyle \frac{k+h}{\u03f5}}\right)+1\right]\underset{{A}_{2}}{\underset{\u23df}{{\displaystyle \sum _{i=K}^{k1}}{\displaystyle \frac{1}{{(i+h)}^{2\xi}}}{\displaystyle \prod _{j=i+1}^{k1}}\left(1{\displaystyle \frac{\alpha \u03f5}{{(j+h)}^{\xi}}}\right)}}.$  (24) 
We now evaluate the terms ${A}_{1}$ and ${A}_{2}$. For the term ${A}_{1}$, using the fact that ${\int}_{a}^{b+1}f(x)\mathit{d}x\le {\sum}_{k=a}^{b}f(k)\le {\int}_{a1}^{b}f(x)\mathit{d}x$ for all nonincreasing function $f(x)$, we have
$\prod _{j=K}^{k1}}\left(1{\displaystyle \frac{\alpha \u03f5}{{(j+h)}^{\xi}}}\right)\le $  $\mathrm{exp}\left(\alpha \u03f5{\displaystyle \sum _{j=K}^{k1}}{\displaystyle \frac{1}{{(j+h)}^{\xi}}}\right)$  ($x+1\le {e}^{x}$ for all $x\in \mathbb{R}$)  
$\le $  $\mathrm{exp}\left(\alpha \u03f5{\displaystyle {\int}_{K}^{k}}{\displaystyle \frac{1}{{(x+h)}^{\xi}}}\mathit{d}x\right)$  
$=$  $\{\begin{array}{cc}{\left({\displaystyle \frac{K+h}{k+h}}\right)}^{\alpha \u03f5},\hfill & \xi =1,\hfill \\ \mathrm{exp}\left[{\displaystyle \frac{\alpha \u03f5}{1\xi}}\left({(k+h)}^{1\xi}{(K+h)}^{1\xi}\right)\right],\hfill & \xi \in (0,1).\hfill \end{array}$  (25) 
We next evaluate the term ${A}_{2}$. When $\xi =1$, using the same line of analysis for evaluating ${A}_{1}$, i.e., bounding the summation by its corresponding integral, we have
${A}_{2}$  $={\displaystyle \sum _{i=K}^{k1}}{\displaystyle \frac{1}{{(i+h)}^{2}}}{\displaystyle \prod _{j=i+1}^{k1}}\left(1{\displaystyle \frac{\alpha \u03f5}{j+h}}\right)$  
$\le {\displaystyle \sum _{i=K}^{k1}}{\displaystyle \frac{1}{{(i+h)}^{2}}}{\left({\displaystyle \frac{i+1+h}{k+h}}\right)}^{\alpha \u03f5}$  
$={\displaystyle \frac{1}{{(k+h)}^{\alpha \u03f5}}}{\displaystyle \sum _{i=K}^{k1}}{\displaystyle \frac{1}{{(i+1+h)}^{2\alpha \u03f5}}}{\left({\displaystyle \frac{i+1+h}{i+h}}\right)}^{2}$  
$\le {\displaystyle \frac{4}{{(k+h)}^{\alpha \u03f5}}}\underset{{A}_{3}}{\underset{\u23df}{{\displaystyle \sum _{i=K}^{k1}}{\displaystyle \frac{1}{{(i+1+h)}^{2\alpha \u03f5}}}}}.$ 
The upper bound for the term ${A}_{3}$ depends on the value $\alpha \u03f5$. Specifically, we have the following results.

(1)
When $\alpha \u03f5\in (0,1)$, we have
${A}_{3}\le {\displaystyle {\int}_{K1}^{k1}}{\displaystyle \frac{1}{{(x+1+h)}^{2\alpha \u03f5}}}\mathit{d}x\le {\displaystyle \frac{1}{1\alpha \u03f5}}.$ 
(2)
When $\alpha \u03f5=1$, we have
${A}_{3}\le {\displaystyle {\int}_{K1}^{k1}}{\displaystyle \frac{1}{x+1+h}}\mathit{d}x\le \mathrm{ln}\left({\displaystyle \frac{k+h}{K+h}}\right).$ 
(3)
When $\alpha \u03f5\in (1,2)$, we have
${A}_{3}\le {\displaystyle {\int}_{K1}^{k1}}{\displaystyle \frac{1}{{(x+1+h)}^{2\alpha \u03f5}}}\mathit{d}x\le {\displaystyle \frac{1}{\alpha \u03f51}}{(k+h)}^{\alpha \u03f51}.$ 
(4)
When $\alpha \u03f5=2$, we have
${A}_{3}=kK.$ 
(5)
When $\alpha \u03f5\in (2,\mathrm{\infty})$, we have
${A}_{3}$ $\le {\displaystyle {\int}_{K}^{k}}{\displaystyle \frac{1}{{(x+1+h)}^{2\alpha \u03f5}}}\mathit{d}x$ $\le {\displaystyle \frac{1}{\alpha \u03f51}}\left[{(k+1+h)}^{\alpha \u03f51}{(K+1+h)}^{\alpha \u03f51}\right]$ $\le {\displaystyle \frac{1}{\alpha \u03f51}}{(k+1+h)}^{\alpha \u03f51}$ $\le {\displaystyle \frac{e}{\alpha \u03f51}}{(k+h)}^{\alpha \u03f51}.$
Combine the above five cases and we have
$$ 
We next consider the case when $\xi \in (0,1)$. Let ${\{{u}_{k}\}}_{k\ge K}$ be a sequence defined by
${u}_{K}=0,{u}_{k+1}=\left(1{\displaystyle \frac{\alpha \u03f5}{{(k+h)}^{\xi}}}\right){u}_{k}+{\displaystyle \frac{1}{{(k+h)}^{2\xi}}}.$ 
It is easy to verify that ${u}_{k}={A}_{2}$. We next use induction to show that when $K+h\ge {[2\xi /(\alpha \u03f5)]}^{1/(1\xi )}$, we have ${u}_{k}\le \frac{2}{\alpha \u03f5}\frac{1}{{(k+h)}^{\xi}}$ for all $k\ge K$.
Since ${u}_{K}=0\le \frac{2}{\alpha \u03f5}\frac{1}{{(K+h)}^{\xi}}$, we have the base case. Now suppose ${u}_{k}\le \frac{2}{\alpha \u03f5}\frac{1}{{(k+h)}^{\xi}}$ for some $k\ge K$. Using the definition of ${u}_{k}$ and the induction hypothesis, we have
$\frac{2}{\alpha \u03f5}}{\displaystyle \frac{1}{{(k+1+h)}^{\xi}}}{u}_{k+1}=$  $\frac{2}{\alpha \u03f5}}{\displaystyle \frac{1}{{(k+1+h)}^{\xi}}}\left(1{\displaystyle \frac{\alpha \u03f5}{{(k+h)}^{\xi}}}\right){u}_{k}{\displaystyle \frac{1}{{(k+h)}^{2\xi}}$  
$\ge $  $\frac{2}{\alpha \u03f5}}{\displaystyle \frac{1}{{(k+1+h)}^{\xi}}}\left(1{\displaystyle \frac{\alpha \u03f5}{{(k+h)}^{\xi}}}\right){\displaystyle \frac{2}{\alpha \u03f5}}{\displaystyle \frac{1}{{(k+h)}^{\xi}}$  
${\displaystyle \frac{1}{{(k+h)}^{2\xi}}}$  
$=$  $\frac{2}{\alpha \u03f5}}{\displaystyle \frac{1}{{(k+h)}^{2\xi}}}\left[{\displaystyle \frac{\alpha \u03f5}{2}}{(k+h)}^{\xi}\left(1{\left({\displaystyle \frac{k+h}{k+1+h}}\right)}^{\xi}\right)\right]$  
$\ge $  $\frac{2}{\alpha \u03f5}}{\displaystyle \frac{1}{{(k+h)}^{2\xi}}}\left[{\displaystyle \frac{\alpha \u03f5}{2}}{\displaystyle \frac{\xi}{{(k+h)}^{1\xi}}}\right]$  (26)  
$\ge $  $0,$  (27) 
where (26) follows from
${\left({\displaystyle \frac{k+1}{k+1+h}}\right)}^{\xi}$  $={\left[{\left(1+{\displaystyle \frac{1}{k+h}}\right)}^{k+h}\right]}^{\xi /(k+h)}$  
$\ge {e}^{\xi /(k+h)}$  
$\ge 1{\displaystyle \frac{\xi}{k+h}},$ 
and (27) follows from $k+h\ge K+h\ge {[2\xi /(\alpha \u03f5)]}^{1/(1\xi )}$.
A.4 Proof of Lemma 2.1
Using Eq. (4) and we have for all $k\ge {t}_{k}$:
${\mathbb{E}}_{k}[{\parallel F({X}_{k},{\theta}_{k})\parallel}^{2}]\le {L}^{2}{\mathbb{E}}_{k}[{(\parallel {\theta}_{k}\parallel +1)}^{2}]\le 2{L}^{2}\left[{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}\right].$ 
A.5 Proof of Lemma 2.2
Using Eq. (4), Definition 2.1, and the definition of the maximal distance ${d}_{\mathrm{max}}(k)$, we have for all $x$, $\theta $, and $k\ge {t}_{\delta}$:
$\parallel \mathbb{E}[F({X}_{k},\theta )\mid {X}_{0}=x]\overline{F}(\theta )\parallel =$  $\parallel {\displaystyle \sum _{y\in \mathcal{X}}}({P}^{k}(x,y)\mu (y))F(y,\theta )\parallel $  
$\le $  $\sum _{y\in \mathcal{X}}}{P}^{k}(x,y)\mu (y)\parallel F(y,\theta )\parallel $  
$\le $  $L(\parallel \theta \parallel +1){\displaystyle \sum _{y\in \mathcal{X}}}{P}^{k}(x,y)\mu (y)$  
$\le $  $2L(\parallel \theta \parallel +1){\parallel {P}^{k}(x,\cdot )\mu \parallel}_{\text{TV}}$  
$\le $  $2L{d}_{\mathrm{max}}(k)(\parallel \theta \parallel +1)$  
$\le $  $\delta (\parallel \theta \parallel +1).$ 
A.6 Proof of Lemma 2.3
For any $$, we first upper bound $\parallel {\theta}_{t}\parallel $ for any $t\in [{k}_{1},{k}_{2}]$. Using triangle inequality and Eq. (4), we have
$\parallel {\theta}_{t+1}\parallel \parallel {\theta}_{t}\parallel \le \parallel {\theta}_{t+1}{\theta}_{t}\parallel \le L{\u03f5}_{t}(\parallel {\theta}_{t}\parallel +1),$  (28) 
which gives $(\parallel {\theta}_{t+1}\parallel +1)\le (L{\u03f5}_{t}+1)(\parallel {\theta}_{t}\parallel +1)$. Recursively applying the preceding inequality, then using the fact that $1+x\le {e}^{x}$ for all $x\in \mathbb{R}$, we have for all $t\in [{k}_{1},{k}_{2}]$:
$\parallel {\theta}_{t}\parallel +1\le {\displaystyle \prod _{j={k}_{1}}^{t1}}(L{\u03f5}_{j}+1)(\parallel {\theta}_{{k}_{1}}\parallel +1)\le \mathrm{exp}(L{\u03f5}_{{k}_{1},{k}_{2}1})(\parallel {\theta}_{{k}_{1}}\parallel +1).$ 
Since ${e}^{x}\le 1+2x$ for all $x\in [0,1/2]$ and ${\u03f5}_{{k}_{1},{k}_{2}1}\le \frac{1}{4L}$, we further obtain
$\parallel {\theta}_{t}\parallel +1\le (1+2L{\u03f5}_{{k}_{1},{k}_{2}1})(\parallel {\theta}_{{k}_{1}}\parallel +1)\le 2(\parallel {\theta}_{{k}_{1}}\parallel +1).$ 
Note that the previous inequality holds for all $t\in [{k}_{1},{k}_{2}]$. Then we have by Eq. (28) that
$\parallel {\theta}_{{k}_{2}}{\theta}_{{k}_{1}}\parallel \le {\displaystyle \sum _{t={k}_{1}}^{{k}_{2}1}}\parallel {\theta}_{t+1}{\theta}_{t}\parallel \le 2L{\u03f5}_{{k}_{1},{k}_{2}1}(\parallel {\theta}_{{k}_{1}}\parallel +1).$ 
Moreover, since
$\parallel {\theta}_{{k}_{2}}{\theta}_{{k}_{1}}\parallel \le 2L{\u03f5}_{{k}_{1},{k}_{2}1}(\parallel {\theta}_{{k}_{1}}\parallel +1)\le 2L{\u03f5}_{{k}_{1},{k}_{2}1}(\parallel {\theta}_{{k}_{2}}{\theta}_{{k}_{1}}\parallel +\parallel {\theta}_{{k}_{2}}\parallel +1)$ 
and ${\u03f5}_{{k}_{1},{k}_{2}1}\le \frac{1}{4L}$, we have $\parallel {\theta}_{{k}_{2}}{\theta}_{{k}_{1}}\parallel \le 4L{\u03f5}_{{k}_{1},{k}_{2}1}(\parallel {\theta}_{{k}_{2}}\parallel +1)$.
A.7 Proof of Lemma 2.4
We begin by decomposing the lefthand side of the desired inequality in the following way:
${\mathbb{E}}_{k}[{({\theta}_{k}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]$  
$=$  $\underset{({T}_{1})}{\underset{\u23df}{{\mathbb{E}}_{k}[{({\theta}_{k}{\theta}_{k{t}_{k}})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]}}+\underset{({T}_{2})}{\underset{\u23df}{{\mathbb{E}}_{k}[{({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]}}.$ 
We first consider the term $({T}_{1})$. Since ${\u03f5}_{k{t}_{k},k1}\le \frac{1}{4L}$, Lemma 2.3 is applicable for ${k}_{1}=k{t}_{k}$ and ${k}_{2}=k$. Therefore, we have by Lemma 2.3 that:
$({T}_{1})=$  ${\mathbb{E}}_{k}[{({\theta}_{k}{\theta}_{k{t}_{k}})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]$  
$\le $  ${\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel \parallel F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k})\parallel ]$  
$\le $  ${\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel (\parallel F({X}_{k},{\theta}_{k})\parallel +\parallel \overline{F}({\theta}_{k})\parallel )]$  
$\le $  $2L{\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel (\parallel {\theta}_{k}\parallel +1)]$  
$\le $  $8{L}^{2}{\u03f5}_{k{t}_{k},k1}{\mathbb{E}}_{k}\left[{(\parallel {\theta}_{k}\parallel +1)}^{2}\right]$  
$\le $  $16{L}^{2}{\u03f5}_{k{t}_{k},k1}({\mathbb{E}}_{k}\left[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}\right]+{(\parallel {\theta}^{*}\parallel +1)}^{2}).$  (29) 
For the term (${T}_{2}$), Assumption 2.1 gives
${({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k})){({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k{t}_{k}})\overline{F}({\theta}_{k{t}_{k}}))$  
$\le $  ${({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})F({X}_{k},{\theta}_{k{t}_{k}}))+{({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(\overline{F}({\theta}_{k})\overline{F}({\theta}_{k{t}_{k}}))$  
$\le $  $\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel (\parallel F({X}_{k},{\theta}_{k})F({X}_{k},{\theta}_{k{t}_{k}})\parallel +\parallel \overline{F}({\theta}_{k})\overline{F}({\theta}_{k{t}_{k}})\parallel )$  
$\le $  $2L\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel \parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel .$ 
Thus, we have
$({T}_{2})=$  ${\mathbb{E}}_{k}[{({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]$  
$\le $  ${\mathbb{E}}_{k}[{({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k{t}_{k}})\overline{F}({\theta}_{k{t}_{k}}))]+2L\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel {\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel ]$  
$=$  ${({\theta}_{k{t}_{k}}{\theta}^{*})}^{\top}({\mathbb{E}}_{k}[F({X}_{k},{\theta}_{k{t}_{k}})]\overline{F}({\theta}_{k{t}_{k}}))+2L\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel {\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel ]$  
$\le $  $\underset{({T}_{2,1})}{\underset{\u23df}{\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel \parallel {\mathbb{E}}_{k}[F({X}_{k},{\theta}_{k{t}_{k}})]\overline{F}({\theta}_{k{t}_{k}})\parallel}}+\underset{({T}_{2,2})}{\underset{\u23df}{2L\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel {\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel ]}}.$ 
Consider the term $({T}_{2,1})$. Using Lemma 2.2, we have
$\parallel {\mathbb{E}}_{k}[F({X}_{k},{\theta}_{k{t}_{k}})]\overline{F}({\theta}_{k{t}_{k}})\parallel \le {\u03f5}_{k}(\parallel {\theta}_{k{t}_{k}}\parallel +1).$ 
Therefore, we obtain
$({T}_{2,1})\le {\u03f5}_{k}{\mathbb{E}}_{k}[(\parallel {\theta}_{k{t}_{k}}\parallel +1)\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel ].$ 
Since Lemma 2.3 together with our assumption that ${\u03f5}_{k{t}_{k},k1}\le \frac{1}{4L}$ gives
$\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel \le 4L{\u03f5}_{k{t}_{k},k1}(\parallel {\theta}_{k}\parallel +1)\le (\parallel {\theta}_{k}\parallel +1),$ 
we have
$\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel (\parallel {\theta}_{k{t}_{k}}\parallel +1)$  
$\le $  $(\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel +\parallel {\theta}_{k}{\theta}^{*}\parallel )(\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel +\parallel {\theta}_{k}{\theta}^{*}\parallel +\parallel {\theta}^{*}\parallel +1)$  
$\le $  $(\parallel {\theta}_{k}\parallel +\parallel {\theta}_{k}{\theta}^{*}\parallel +1)(\parallel {\theta}_{k}\parallel +\parallel {\theta}_{k}{\theta}^{*}\parallel +\parallel {\theta}^{*}\parallel +2)$  
$\le $  $(2\parallel {\theta}_{k}{\theta}^{*}\parallel +\parallel {\theta}^{*}\parallel +1)(2\parallel {\theta}_{k}{\theta}^{*}\parallel +2\parallel {\theta}^{*}\parallel +2)$  
$\le $  $4{(\parallel {\theta}_{k}{\theta}^{*}\parallel +\parallel {\theta}^{*}\parallel +1)}^{2}$  
$\le $  $8[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}+{(\parallel {\theta}^{*}\parallel +1)}^{2}].$ 
Hence we can further control the term $({T}_{2,1})$ by
$({T}_{2,1})\le 8{\u03f5}_{k}\left[{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}\right].$  (30) 
As for the term $({T}_{2,2})$, similarly Lemma 2.3 gives
$({T}_{2,2})$  $=2L\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel {\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel ]$  
$\le 8{L}^{2}{\u03f5}_{k{t}_{k},k1}{\mathbb{E}}_{k}[\parallel {\theta}_{k{t}_{k}}{\theta}^{*}\parallel (\parallel {\theta}_{k}\parallel +1)]$  
$\le 8{L}^{2}{\u03f5}_{k{t}_{k},k1}{\mathbb{E}}_{k}[(\parallel {\theta}_{k}{\theta}_{k{t}_{k}}\parallel +\parallel {\theta}_{k}{\theta}^{*}\parallel )(\parallel {\theta}_{k}\parallel +1)]$  
$\le 16{L}^{2}{\u03f5}_{k{t}_{k},k1}{\mathbb{E}}_{k}[{(\parallel {\theta}_{k}{\theta}^{*}\parallel +\parallel {\theta}^{*}\parallel +1)}^{2}]$  
$\le 32{L}^{2}{\u03f5}_{k{t}_{k},k1}{\mathbb{E}}_{k}[[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}].$  (31) 
Using Eqs. (30) and (31) and the fact that ${\u03f5}_{k}\le {\u03f5}_{k{t}_{k},k1}$, we have
$({T}_{2})\le ({T}_{2,1})+({T}_{2,2})40{L}^{2}{\u03f5}_{k{t}_{k},k1}\left[{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}\right].$  (32) 
Using Eqs. (29) and (32), we finally obtain
$(b)=$  $2{\u03f5}_{k}{\mathbb{E}}_{k}[{({\theta}_{k}{\theta}^{*})}^{\top}(F({X}_{k},{\theta}_{k})\overline{F}({\theta}_{k}))]$  
$=$  $2{\u03f5}_{k}[({T}_{1})+({T}_{2})]$  
$\le $  $112{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}\left[{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]+{(\parallel {\theta}^{*}\parallel +1)}^{2}\right].$ 
A.8 Proof of Lemma 2.5
Substituting the upper bounds we obtained for the terms $(a)$, $(b)$, and $(c)$ into Eq. (9), we have
${\mathbb{E}}_{k}[{\parallel {\theta}_{k+1}{\theta}^{*}\parallel}^{2}]{\mathbb{E}}_{k}[{\parallel {\theta}_{k}{\theta}^{*}\parallel}^{2}]\le $  $(2\alpha {\u03f5}_{k}+112{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}+2{L}^{2}{\u03f5}_{k}^{2}){\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}^{*}\parallel ]$  
$+(112{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}+2{L}^{2}{\u03f5}_{k}^{2}){(\parallel {\theta}^{*}\parallel +1)}^{2}$  
$\le $  $(2\alpha {\u03f5}_{k}+114{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}){\mathbb{E}}_{k}[\parallel {\theta}_{k}{\theta}^{*}\parallel ]$  
$+114{L}^{2}{\u03f5}_{k}{\u03f5}_{k{t}_{k},k1}{(\parallel {\theta}^{*}\parallel +1)}^{2},$ 
where in the last line we used ${\u03f5}_{k}\le {\u03f5}_{k{t}_{k},k1}$. The result then follows by taking the total expectation on both side of the previous inequality.
Appendix B Proof of All Technical Results in Section 3
B.1 Proof of Proposition 3.1

(1)
Consider two arbitrary states ${x}_{1}=({s}_{1},{a}_{1},{s}_{1}^{\prime}),{x}_{2}=({s}_{2},{a}_{2},{s}_{2}^{\prime})\in \mathcal{X}$. Since $\{{s}_{k}\}$ is irreducible, there exists $k>0$ such that ${P}^{k}({s}_{1}^{\prime},{s}_{2})>0$. Hence we have
${P}^{k+1}({x}_{1},{x}_{2})={P}^{k}({s}_{1}^{\prime},{s}_{2})\pi ({a}_{2}{s}_{2}){P}_{{a}_{2}}({s}_{2},{s}_{2}^{\prime})>0.$ It follows that $\{{X}_{k}\}$ is irreducible.
To show $\{{X}_{k}\}$ is aperiodic, assume for a contradiction that $\{{X}_{k}\}$ is periodic with period $T\ge 2$. Since $\{{X}_{k}\}$ is irreducible, every state in $\mathcal{X}$ has the same period. Therefore, for any $x=(s,a,{s}^{\prime})\in \mathcal{X}$, we must have ${P}^{k}(x,x)=0$ for all $k$ not divisible by $T$. However, we have for any $k$ not divisible by $T$:
${P}^{k}({s}^{\prime},{s}^{\prime})$ $={\displaystyle \sum _{s\in \mathcal{S}}}{P}^{k1}({s}^{\prime},s)P(s,{s}^{\prime})$ $={\displaystyle \sum _{s\in \mathcal{S}}}{\displaystyle \sum _{a\in \mathcal{A}}}{P}^{k1}({s}^{\prime},s)\pi (as){P}_{a}(s,{s}^{\prime})$ $={\displaystyle \sum _{s\in \mathcal{S}}}{\displaystyle \sum _{a\in \mathcal{A}}}{P}^{k}((s,a,{s}^{\prime}),(s,a,{s}^{\prime}))$ (33) $=0.$ (34) To see (33), since $\{{s}_{k}\}$ is a Markov chain, we have
${P}^{k}((s,a,{s}^{\prime}),(s,a,{s}^{\prime}))=$ $\mathbb{P}({s}_{k}=s,{a}_{k}=a,{s}_{k+1}={s}^{\prime}{s}_{0}=s,{a}_{0}=a,{s}_{1}={s}^{\prime})$ $=$ $\mathbb{P}({s}_{k}=s,{a}_{k}=a,{s}_{k+1}={s}^{\prime}{s}_{1}={s}^{\prime})$ $=$ ${P}^{k1}({s}^{\prime},s)\pi (as){P}_{a}(s,{s}^{\prime}).$ Therefore, Eq. (34) shows that the period of ${s}^{\prime}$ is at least $T$, which is a contradiction. Hence the Markov chain $\{{X}_{k}\}$ must be aperiodic.

(2)
Using CauchySchwarz inequality, and our assumption that $\parallel \varphi (s,a)\parallel \le 1$ for all stateaction pairs, we have for any ${\theta}_{1},{\theta}_{2}$ and $x$:
$\parallel F(x,{\theta}_{1})F(x,{\theta}_{2})\parallel =$ $\parallel \varphi (s,a)(\mathcal{R}(s,a)+\gamma \underset{{a}_{1}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}{\theta}_{1}\varphi {(s,a)}^{\top}{\theta}_{1})$ $\varphi (s,a)(\mathcal{R}(s,a)+\gamma \underset{{a}_{2}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{2})}^{\top}{\theta}_{2}\varphi {(s,a)}^{\top}{\theta}_{2})\parallel $ $\le $ $\gamma \parallel \varphi (s,a)(\underset{{a}_{1}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}{\theta}_{1}\underset{{a}_{2}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{2})}^{\top}{\theta}_{2})\parallel $ $+\parallel \varphi (s,a)\varphi {(s,a)}^{\top}({\theta}_{1}{\theta}_{2})\parallel $ $\le $ $\gamma \underset{{a}_{1}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}{\theta}_{1}\underset{{a}_{2}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{2})}^{\top}{\theta}_{2}+\parallel {\theta}_{1}{\theta}_{2}\parallel .$ Since
$\underset{{a}_{1}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}{\theta}_{1}\underset{{a}_{2}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}{\theta}_{2}\le $ $\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}^{\prime})}^{\top}({\theta}_{1}{\theta}_{2})$ (35) $\le $ $\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\parallel \varphi ({s}^{\prime},{a}^{\prime})\parallel \parallel {\theta}_{1}{\theta}_{2}\parallel $ $\le $ $\parallel {\theta}_{1}{\theta}_{2}\parallel ,$ we have for any ${\theta}_{1},{\theta}_{2}$ and $x$:
$\parallel F(x,{\theta}_{1})F(x,{\theta}_{2})\parallel \le $ $\gamma \underset{{a}_{1}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}{\theta}_{1}\underset{{a}_{2}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{2})}^{\top}{\theta}_{2}+\parallel {\theta}_{1}{\theta}_{2}\parallel $ $\le $ $(\gamma +1)\parallel {\theta}_{1}{\theta}_{2}\parallel .$ Therefore, the quantity $M=1+\gamma +{r}_{\mathrm{max}}$ is a valid Lipschitz constant. Moreover, since $\parallel F(x,0)\parallel =\parallel \varphi (s,a)\mathcal{R}(s,a)\parallel \le {r}_{\mathrm{max}}$ for any $(s,a)$, we have $M\ge {\mathrm{max}}_{x\in \mathcal{X}}\parallel F(x,0)\parallel $ as required.

(3)
Using the fact that $\overline{F}({\theta}^{*})=0$, we have
${(\theta {\theta}^{*})}^{\top}(\overline{F}(\theta )\overline{F}({\theta}^{*}))$ $=$ $\gamma {(\theta {\theta}^{*})}^{\top}{\mathbb{E}}_{\mu}[\varphi (s,a)(\underset{{a}_{1}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{1})}^{\top}\theta \underset{{a}_{2}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}_{2})}^{\top}{\theta}^{*})]$ ${\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}]$ $\le $ $\gamma {\mathbb{E}}_{\mu}[\varphi {(s,a)}^{\top}(\theta {\theta}^{*})\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi {({s}^{\prime},{a}^{\prime})}^{\top}(\theta {\theta}^{*})]{\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}]$ (36) $\le $ $\gamma \sqrt{{\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}]}\sqrt{{\mathbb{E}}_{\mu}[\underset{a\in \mathcal{A}}{\mathrm{max}}{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}]}$ ${\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}].$ (37) Eq. (36) follows from Eq. (35). Eq. (37) follows from the fact that when $s$ is sampled from the stationary distribution $\mu $, its successor state ${s}^{\prime}$ has the same distribution. For simplicity of notations, denote $A=\sqrt{{\mathbb{E}}_{\mu}[{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}]}$ and $B=\sqrt{{\mathbb{E}}_{\mu}[{\mathrm{max}}_{a\in \mathcal{A}}{(\varphi {(s,a)}^{\top}(\theta {\theta}^{*}))}^{2}]}$. Since Assumption 3.3 gives ${\gamma}^{2}{B}^{2}{A}^{2}\le \kappa {\parallel \theta {\theta}^{*}\parallel}^{2}$, we have
${(\theta {\theta}^{*})}^{\top}\overline{F}(\theta )\le \gamma AB{A}^{2}={\displaystyle \frac{{\gamma}^{2}{B}^{2}{A}^{2}}{\gamma B/A+1}}\le {\displaystyle \frac{\kappa}{2}}{\parallel \theta {\theta}^{*}\parallel}^{2}.$
B.2 Proof of Proposition 3.2
We first show that Eq. (20) implies $$, and $$. Note that Jensen’s inequality implies
${\mathbb{E}}_{\mu}[\underset{a\in \mathcal{A}}{\mathrm{max}}\varphi {(s,a)}^{2}]=$  ${\mathbb{E}}_{\mu}\left\{\mathrm{max}[{(\underset{a\in \mathcal{A}}{\mathrm{max}}\varphi (s,a))}^{2},{(\underset{a\in \mathcal{A}}{\mathrm{min}}\varphi (s,a))}^{2}]\right\}$  
$\ge $  $\mathrm{max}\{{\mathbb{E}}_{\mu}[{(\underset{a\in \mathcal{A}}{\mathrm{max}}\varphi (s,a))}^{2}],{\mathbb{E}}_{\mu}[{(\underset{a\in \mathcal{A}}{\mathrm{min}}\varphi (s,a))}^{2}]\}.$  (38) 
Thus, using Eq. (20), we have
${h}^{+}=$  ${\mathbb{E}}_{\mu}[\gamma \varphi (s,a)\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi ({s}^{\prime},{a}^{\prime})]{\mathbb{E}}_{\mu}[\varphi {(s,a)}^{2}]$  
$=$  ${\mathbb{E}}_{\mu}[\gamma \varphi (s,a)\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi ({s}^{\prime},{a}^{\prime})]\sqrt{{\mathbb{E}}_{\mu}[\varphi {(s,a)}^{2}]{\mathbb{E}}_{\mu}[\varphi {(s,a)}^{2}]}$  
$$  ${\mathbb{E}}_{\mu}[\gamma \varphi (s,a)\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}\varphi ({s}^{\prime},{a}^{\prime})]\gamma \sqrt{{\mathbb{E}}_{\mu}[\underset{a\in \mathcal{A}}{\mathrm{max}}\varphi {(s,a)}^{2}]{\mathbb{E}}_{\mu}[\varphi {(s,a)}^{2}]}$  
$\le $  $0,$ 
where the last inequality follows from CauchySchwarz inquality and the fact that ${s}^{\prime}$ and $s$ are equal in distribution if $s\sim \mu (\cdot )$. Similarly, we also have $$.
We next show the three statements hold in Proposition 3.2. By definition of ${h}^{+}$ and ${h}^{}$, in unidimensional case, ODE (18) can be equivalently written as
$$ 
In the case where ${r}_{\pi}=0$, it is easy to see that ODE (18) is globally asymptotically stable if and only if $$. Now we without loss of generality assume ${r}_{\pi}>0$, the proof for the other case is entirely similar.

•
Sufficiency: We first note that ${\theta}^{*}={r}_{\pi}/{h}^{+}>0$. Let $V(\theta )=\frac{1}{2}{(\theta {\theta}^{*})}^{2}$ be a candidate Lyapunov function. It is clear that $V(\theta )\ge 0$ for all $\theta \in \mathbb{R}$, and $V(\theta )=0$ if and only if $\theta ={\theta}^{*}$. Moreover,
$\dot{V}(\theta (t))$ $=(\theta (t){\theta}^{*})\dot{\theta}(t)$ $$ It is clear that $$ when $\theta (t)\in [0,{\theta}^{*})\cup ({\theta}^{*},\mathrm{\infty})$. For $$, since $$, $$, and ${h}^{}\theta (t)\ge 0$, we must also have $$. Therefore, the time derivative of the Lyapunov function $V(\theta )$ along the trajectory of ODE (18) is strictly negative when $\theta (t)\ne {\theta}^{*}$. It then follows from the Lyapunov stability theorem [12, 14] that ${\theta}^{*}$ is globally asymptotically stable.

•
Necessity: We prove by contradiction. Suppose that the equilibrium point ${\theta}^{*}$ is globally asymptotically stable, but ${h}^{+}\ge 0$ or ${h}^{}>0$.
If ${h}^{+}\ge 0$, when $\theta (0)>\mathrm{max}(0,{\theta}^{*})$, we have $\dot{\theta}(t)={h}^{+}\theta (t)+{r}_{\pi}\ge {r}_{\pi}>0$. It follows that $\theta (t)>\theta (0)>{\theta}^{*}$ for all $t\ge 0$, which contradict to the fact that ${\theta}^{*}$ is a globally asymptotically stable equilibrium point.
Similarly, if ${h}^{}>0$, when $$, we have $$. It follows that $$ for all $t\ge 0$, which also contradict to the fact ${\theta}^{*}$ being globally asymptotically stable.
B.3 Proof of Proposition 3.3
Since $d=\mathcal{S}\mathcal{A}$, there are total number of $\mathcal{S}\mathcal{A}$ feature vectors, and the feature matrix $\mathrm{\Phi}$ is a fullrank square matrix with dimension $\mathcal{S}\mathcal{A}$. Define
${\mathrm{\Theta}}_{s,a}=\text{span}{\left(\{\varphi ({s}^{\prime},{a}^{\prime})\mid ({s}^{\prime},{a}^{\prime})\in \mathcal{S}\times \mathcal{A},({s}^{\prime},{a}^{\prime})\ne (s,a)\}\right)}^{\u27c2}.$ 
Note that ${\mathrm{\Theta}}_{s,a}$ exists for all stateaction pairs since $\mathrm{\Phi}$ is full rank. Now for a given stateaction pair $(s,a)$, let $\theta \in {\mathrm{\Theta}}_{s,a}$ ($\theta \ne 0$), Eq. (17) implies
$$ 
which further gives $$. Therefore, by running though all stateaction pairs, we have $$. Thus, if ${\gamma}^{2}\ge 1/m$, there is no behavior policy $\pi $ that satisfies condition (17).
B.4 Computing $\delta (\pi )$ Given in Eq. (21)
We here present one way to compute $\delta (\pi )$ for an MDP with a chosen policy $\pi $ when the underlying model is known. Before that, the following definitions are needed.
Definition B.1.
Let ${\mathrm{\Sigma}}_{\mu \mathrm{,}\pi}\mathrm{:=}{\mathrm{\Phi}}^{\mathrm{\top}}\mathit{}{D}_{\mu \mathrm{,}\pi}\mathit{}\mathrm{\Phi}\mathrm{\in}{\mathrm{R}}^{d\mathrm{\times}d}$, where ${D}_{\mu \mathrm{,}\pi}\mathrm{\in}{\mathrm{R}}^{\mathrm{}\mathrm{S}\mathrm{}\mathit{}\mathrm{}\mathrm{A}\mathrm{}\mathrm{\times}\mathrm{}\mathrm{S}\mathrm{}\mathit{}\mathrm{}\mathrm{A}\mathrm{}}$ is a diagonal matrix with diagonal entries ${\mathrm{\{}\mu \mathit{}\mathrm{(}s\mathrm{)}\mathit{}\pi \mathit{}\mathrm{(}a\mathrm{}s\mathrm{)}\mathrm{\}}}_{\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{\in}\mathrm{S}\mathrm{\times}\mathrm{A}}$, and $\mathrm{\Phi}$ is the feature matrix.
Definition B.2.
Let $\mathrm{B}\mathrm{=}{\mathrm{A}}^{n}\mathrm{\subseteq}{\mathrm{R}}^{n}$ be the set of all deterministic policies.
Definition B.3.
Let ${D}_{\mu}\mathrm{\in}{\mathrm{R}}^{\mathrm{}\mathrm{S}\mathrm{}\mathrm{\times}\mathrm{}\mathrm{S}\mathrm{}}$ be a diagonal matrix with diagonal entries ${\mathrm{\{}\mu \mathit{}\mathrm{(}s\mathrm{)}\mathrm{\}}}_{s\mathrm{\in}\mathrm{S}}$, and let ${\mathrm{\Sigma}}_{\mu \mathrm{,}b}\mathrm{:=}{\mathrm{\Phi}}_{b}^{\mathrm{\top}}\mathit{}{D}_{\mu}\mathit{}{\mathrm{\Phi}}_{b}\mathrm{\in}{\mathrm{R}}^{d\mathrm{\times}d}$, where ${\mathrm{\Phi}}_{b}\mathrm{\in}{\mathrm{R}}^{\mathrm{}\mathrm{S}\mathrm{}\mathrm{\times}d}$ ($b\mathrm{\in}\mathrm{B}$) is such that
${\mathrm{\Phi}}_{b}^{\top}=[\varphi ({s}_{1},{b}_{1}),\varphi ({s}_{2},{b}_{2}),\mathrm{\cdots},\varphi ({s}_{n},{b}_{n})].$ 
We now compute $\delta (\pi )$ given in the following lemma.
Lemma B.1.
Let $\delta \mathit{}\mathrm{(}\pi \mathrm{)}$ be defined in (21) and let ${\lambda}_{\mathrm{max}}\mathit{}\mathrm{(}M\mathrm{)}$ be the largest eigenvalue of a positive semidefinite matrix $M$. Then we have
$\delta (\pi )=\underset{b\in \mathcal{B}}{\mathrm{min}}\left[1/{\lambda}_{\mathrm{max}}({\mathrm{\Sigma}}_{\mu ,\pi}^{1/2}{\mathrm{\Sigma}}_{\mu ,b}{\mathrm{\Sigma}}_{\mu ,\pi}^{1/2})\right].$ 
Proof of Lemma B.1.
Recall our definition for $\delta (\pi )$:
$\delta (\pi )=\underset{\theta \ne 0}{\mathrm{min}}{\displaystyle \frac{{\sum}_{s\in \mathcal{S}}\mu (s){\sum}_{a\in \mathcal{A}}\pi (as){(\varphi {(s,a)}^{\top}\theta )}^{2}}{{\sum}_{s\in \mathcal{S}}\mu (s){\mathrm{max}}_{a\in \mathcal{A}}{(\varphi {(s,a)}^{\top}\theta )}^{2}}}\cdot $  (39) 
Let $f(\theta )$ be the numerator, we have
$f(\theta )={\displaystyle \sum _{s\in \mathcal{S}}}\mu (s){\displaystyle \sum _{a\in \mathcal{A}}}\pi (as){(\varphi {(s,a)}^{\top}\theta )}^{2}={\theta}^{\top}{\mathrm{\Phi}}^{\top}{D}_{\mu ,\pi}\mathrm{\Phi}\theta ={\theta}^{\top}{\mathrm{\Sigma}}_{\mu ,\pi}\theta .$ 
Since the diagonal entries of ${D}_{\mu ,\pi}$ are all positive, and $\mathrm{\Phi}$ is full column rank, ${\mathrm{\Sigma}}_{\mu ,\pi}$ is symmetric and positive definite. To represent the denominator of (39) in a similar form, let
$g(\theta ,b)={\displaystyle \sum _{i=1}^{n}}\mu ({s}_{i}){(\varphi {({s}_{i},{b}_{i})}^{\top}\theta )}^{2}={\theta}^{\top}{\mathrm{\Phi}}_{b}^{\top}{D}_{\mu}{\mathrm{\Phi}}_{b}\theta ={\theta}^{\top}{\mathrm{\Sigma}}_{\mu ,b}\theta ,\text{where}b\in \mathcal{B}.$ 
Since the columns of ${\mathrm{\Phi}}_{b}$ can be dependent, ${\mathrm{\Sigma}}_{\mu ,b}$ is in general only symmetric and positive semidefinite. With the definition of $f(\theta )$ and $g(\theta ,b)$, $\delta (\pi )$ can be represented as
$\delta (\pi )=\underset{\theta \ne 0}{\mathrm{min}}{\displaystyle \frac{f(\theta )}{{\mathrm{max}}_{b\in \mathcal{B}}g(\theta ,b)}}=\underset{\theta \ne 0}{\mathrm{min}}\underset{b\in \mathcal{B}}{\mathrm{min}}{\displaystyle \frac{f(\theta )}{g(\theta ,b)}}=\underset{b\in \mathcal{B}}{\mathrm{min}}\underset{\theta \ne 0}{\mathrm{min}}{\displaystyle \frac{f(\theta )}{g(\theta ,b)}}.$ 
Now since ${\mathrm{\Sigma}}_{\mu ,\pi}$ is positive definite, ${\mathrm{\Sigma}}_{\mu ,\pi}^{1/2}$ and ${\mathrm{\Sigma}}_{\mu ,\pi}^{1/2}$ are both welldefined and positive definite, we have
$\underset{\theta \ne 0}{\mathrm{min}}{\displaystyle \frac{f(\theta )}{g(\theta ,b)}}$  $={\left[\underset{\theta \ne 0}{\mathrm{max}}{\displaystyle \frac{g(\theta ,b)}{f(\theta )}}\right]}^{1}$  
$={\left[\underset{\theta \ne 0}{\mathrm{max}}{\displaystyle \frac{{\theta}^{\top}{\mathrm{\Sigma}}_{\mu ,b}\theta}{{\theta}^{\top}{\mathrm{\Sigma}}_{\mu ,\pi}\theta}}\right]}^{1}$  
$={\left[{\left(\underset{x\ne 0}{\mathrm{max}}{\displaystyle \frac{\parallel {\mathrm{\Sigma}}_{\mu ,b}^{1/2}{\mathrm{\Sigma}}_{\mu ,\pi}^{1/2}x\parallel}{\parallel x\parallel}}\right)}^{2}\right]}^{1}$  
$={\displaystyle \frac{1}{{\lambda}_{\mathrm{max}}({\mathrm{\Sigma}}_{\mu ,\pi}^{1/2}{\mathrm{\Sigma}}_{\mu ,b}{\mathrm{\Sigma}}_{\mu ,\pi}^{1/2})}},$ 
where the function ${\lambda}_{\mathrm{max}}(\cdot )$ returns the largest eigenvalue. It follows that
$\delta (\pi )=\underset{b\in \mathcal{B}}{\mathrm{min}}[1/{\lambda}_{\mathrm{max}}({\mathrm{\Sigma}}_{\mu ,\pi}^{1/2}{\mathrm{\Sigma}}_{\mu ,b}{\mathrm{\Sigma}}_{\mu ,\pi}^{1/2})].$ 
∎
B.5 The MDP Model Used in Numerical Experiments
Our numerical experiments adopt the MDP model of the classical divergent example of $Q$learning with linear function approximation introduced in [1]. Consider the infinitehorizon sevenstate, twoaction MDP shown in Figure 5. The dashed action takes the system to one of the six upper states with equal probability, whereas the solid action takes the system to the seventh state with probability one. The behavior policy $\pi $ selects the dashed and solid actions with equal probability. Consider estimating the $Q$function under the linear parameterization indicated by the expression showing along each arrow in Figure 5. For example, the estimated value of state $1$ taking the solid action is ${\theta}_{0}+2{\theta}_{1}$, where the subscript corresponds to the component of the overall weight vector $\theta \in {\mathbb{R}}^{14}$. It is easy to verify that the feature matrix $\mathrm{\Phi}$ in this example is a square matrix with full rank.
References
 Baird [1995] Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
 Beck and Srikant [2013] Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant stepsize $Q$learning. In 2013 American Control Conference, pages 1926–1931. IEEE.
 Benveniste et al. [2012] Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
 Bertsekas and Tsitsiklis [1996] Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neurodynamic programming, volume 5. Athena Scientific Belmont, MA.
 Bhandari et al. [2018] Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference On Learning Theory, pages 1691–1692.
 Borkar [2009] Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer.
 Borkar and Meyn [2000] Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469.
 Chen et al. [2020] Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finitesample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874.
 De Farias and Van Roy [2000] De Farias, D. P. and Van Roy, B. (2000). On the existence of fixed points for approximate value iteration and temporaldifference learning. Journal of Optimization theory and Applications, 105(3):589–608.
 Duchi et al. [2012] Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. (2012). Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578.
 EvenDar and Mansour [2003] EvenDar, E. and Mansour, Y. (2003). Learning rates for $Q$learning. Journal of Machine Learning Research, 5(Dec):1–25.
 Haddad and Chellaboina [2011] Haddad, W. M. and Chellaboina, V. (2011). Nonlinear dynamical systems and control: a Lyapunovbased approach. Princeton University Press.
 Jaakkola et al. [1994] Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710.
 Khalil and Grizzle [2002] Khalil, H. K. and Grizzle, J. W. (2002). Nonlinear systems, volume 3. Prentice hall Upper Saddle River, NJ.
 Kober et al. [2013] Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
 Lee and He [2019] Lee, D. and He, N. (2019). A unified switching system perspective and ode analysis of $Q$learning algorithms. Preprint arXiv:1912.02270.
 Levin and Peres [2017] Levin, D. A. and Peres, Y. (2017). Markov chains and mixing times, volume 107. American Mathematical Soc.
 Li et al. [2020] Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous $Q$learning: Sharper analysis and variance reduction. Preprint arXiv:2006.03041.
 Melo et al. [2008] Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM.
 Nemirovski et al. [2009] Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
 Puterman [2014] Puterman, M. L. (2014). Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
 Qu and Wierman [2020] Qu, G. and Wierman, A. (2020). Finitetime analysis of asynchronous stochastic approximation and $Q$learning. Preprint arXiv:2002.00260.
 Robbins and Monro [1951] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407.
 Rudin et al. [1964] Rudin, W. et al. (1964). Principles of mathematical analysis, volume 3. McGrawhill New York.
 Ryu and Boyd [2016] Ryu, E. K. and Boyd, S. (2016). Primer on monotone operator methods. Appl. Comput. Math, 15(1):3–43.
 ShalevShwartz et al. [2016] ShalevShwartz, S., Shammah, S., and Shashua, A. (2016). Safe, multiagent, reinforcement learning for autonomous driving. Preprint arXiv:1610.03295.
 Silver et al. [2017] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676):354.
 Srikant and Ying [2019] Srikant, R. and Ying, L. (2019). Finitetime error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830.
 Sutton [1988] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44.
 Sutton and Barto [2018] Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press Cambridge, 2 edition.
 Tesauro et al. [2008] Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., and Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In Advances in Neural Information Processing Systems, pages 1497–1504.
 Thoppe and Borkar [2019] Thoppe, G. and Borkar, V. (2019). A concentration bound for stochastic approximation via alekseev’s formula. Stochastic Systems, 9(1):1–26.
 Tsitsiklis [1994] Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and $Q$learning. Machine learning, 16(3):185–202.
 Tsitsiklis and Van Roy [1997] Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporaldiffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081.
 Wainwright [2019] Wainwright, M. J. (2019). Stochastic approximation with conecontractive operators: Sharp ${\mathrm{\ell}}_{\mathrm{\infty}}$bounds for $Q$learning. Preprint arXiv:1905.06265.
 Watkins and Dayan [1992] Watkins, C. J. and Dayan, P. (1992). $Q$learning. Machine learning, 8(34):279–292.