Abstract
Motivated by applications in Reinforcement Learning (RL), in this paper, westudy a nonlinear Stochastic Approximation (SA) algorithm under Markoviannoise, and derive its finite-sample convergence bounds. Our proof is based onthe Lyapunov drift arguments, and to handle the Markovian noise, we exploit thefast mixing of the underlying Markov chain. Our result is used to show the finite-sample bounds of the popular Q-learningwith linear function approximation algorithm for solving the RL problem. SinceQ-learning with linear function approximation may diverge in general, we studyit under a condition on the behavior policy that ensures the stability of thealgorithm. Due to the generality of our SA results, we do not need to make theunnatural assumption that the samples are i.i.d. (since they are Markovian),and do not require an additional projection step in the algorithm to maintainthe boundedness of the iterates.