### Abstract

Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo TreeSearch (MCTS) with Supervised Learning via Neural Network to learn the optimalpolicy and value function, we focus on establishing formally that such anapproach indeed finds the optimal solution asymptotically, as well asestablishing non-asymptotic guarantees. We shall focus on infinite-horizondiscounted Markov Decision Process (MDP) to establish the results. To startwith, this requires establishing that for any given query state, MCTS providesan approximate value function for the state with enough simulation steps ofMDP. This property of MCTS was claimed in the literature, but the proof in theseminal works is incomplete. As an important contribution of this work, weestablish the correctness of MCTS with appropriate polynomial bonus term inUCB. In the process, we establish polynomial concentration properties of regretfor non-stationary Multi-Arm Bandits (MAB), which might be of interest in itsown right. Interestingly enough, AGZ utilizes a polynomial form of MCTS assuggested by our result. Using the above result, we argue that MCTS, combinedwith expressive enough supervised learning techniques, finds the optimal valueat nearly minimax optimal rate. Specifically, when using the nearest neighborsupervised learning, we show that MCTS acts as a "policy improvement" operator:it has a natural "bootstrapping" property to improve value functionapproximation for all states, due to combining with supervised learning,despite evaluating at only finitely many states. To learn an $\epsilon$approximation of the value function with respect to $\ell_\infty$ norm, MCTScombined with nearest neighbor requires a sample size scaling as$\tilde{O}(\epsilon^{-(d+4)})$, where $d$ is the dimension of the state space.This is nearly optimal due to a minimax lower bound of $\tilde{\Omega}(\epsilon^{-(d+2)}).$