### Abstract

In this work, we consider the popular tree-based search strategy within theframework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in thecontext of infinite-horizon discounted cost Markov Decision Process (MDP).While MCTS is believed to provide an approximate value function for a givenstate with enough simulations, the claimed proof in the seminal works isincomplete. This is due to the fact that the variant, the Upper ConfidenceBound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonusterm for balancing exploration and exploitation within the tree-based search,following the insights from stochastic multi-arm bandit (MAB) literature. Ineffect, such an approach assumes that the regret of the underlying recursivelydependent non-stationary MABs concentrates around their mean exponentially inthe number of steps, which is unlikely to hold as pointed out in literature,even for stationary MABs. As the key contribution of this work, we establishpolynomial concentration property of regret for a class of non-stationary MAB.This in turn establishes that the MCTS with appropriate polynomial rather thanlogarithmic bonus term in UCB has the claimed property. Using this as abuilding block, we argue that MCTS, combined with nearest neighbor supervisedlearning, acts as a "policy improvement" operator: it iteratively improvesvalue function approximation for all states, due to combining with supervisedlearning, despite evaluating at only finitely many states. In effect, weestablish that to learn an $\varepsilon$ approximation of the value functionwith respect to $\ell_\infty$ norm, MCTS combined with nearest neighborrequires a sample size scaling as$\widetilde{O}\big(\varepsilon^{-(d+4)}\big)$, where $d$ is the dimension ofthe state space. This is nearly optimal due to a minimax lower bound of$\widetilde{\Omega}\big(\varepsilon^{-(d+2)}\big)$.