Non-Asymptotic Analysis of Monte Carlo Tree Search

  • 2020-01-13 15:22:39
  • Devavrat Shah, Qiaomin Xie, Zhi Xu
  • 0

Abstract

In this work, we consider the popular tree-based search strategy within theframework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in thecontext of infinite-horizon discounted cost Markov Decision Process (MDP).While MCTS is believed to provide an approximate value function for a givenstate with enough simulations, the claimed proof in the seminal works isincomplete. This is due to the fact that the variant, the Upper ConfidenceBound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonusterm for balancing exploration and exploitation within the tree-based search,following the insights from stochastic multi-arm bandit (MAB) literature. Ineffect, such an approach assumes that the regret of the underlying recursivelydependent non-stationary MABs concentrates around their mean exponentially inthe number of steps, which is unlikely to hold as pointed out in literature,even for stationary MABs. As the key contribution of this work, we establishpolynomial concentration property of regret for a class of non-stationary MAB.This in turn establishes that the MCTS with appropriate polynomial rather thanlogarithmic bonus term in UCB has the claimed property. Using this as abuilding block, we argue that MCTS, combined with nearest neighbor supervisedlearning, acts as a "policy improvement" operator: it iteratively improvesvalue function approximation for all states, due to combining with supervisedlearning, despite evaluating at only finitely many states. In effect, weestablish that to learn an $\varepsilon$ approximation of the value functionwith respect to $\ell_\infty$ norm, MCTS combined with nearest neighborrequires a sample size scaling as$\widetilde{O}\big(\varepsilon^{-(d+4)}\big)$, where $d$ is the dimension ofthe state space. This is nearly optimal due to a minimax lower bound of$\widetilde{\Omega}\big(\varepsilon^{-(d+2)}\big)$.

 

Quick Read (beta)


[10pt]

Non-Asymptotic Analysis of Monte Carlo Tree Search

Devavrat Shah

LIDS, Massachusetts Institute of Technology, Cambridge, MA 02139, [email protected]u

Qiaomin Xie

ORIE, Cornell University, Ithaca, NY 14853, [email protected]

Zhi Xu

LIDS, Massachusetts Institute of Technology, Cambridge, MA 02139, [email protected]

In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP). While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. Kocsis and Szepesvári (2006), Kocsis et al. (2006), the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes “logarithmic” bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature, cf. Agrawal (1995), Auer et al. (2002). In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in Audibert et al. (2009), even for stationary MABs.

As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary Multi-Arm Bandits (MAB). This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property of Kocsis and Szepesvári (2006), Kocsis et al. (2006). Interestingly enough, empirically successful approaches (cf. Silver et al. (2017b)) utilize a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a “policy improvement” operator, i.e., it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε approximation of the value function with respect to norm, MCTS combined with nearest neighbor requires a sample size scaling as O~(ε-(d+4)), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of Ω~(ε-(d+2)), suggesting the strength of the variant of MCTS we propose here and our resulting analysis.thanks: An extended abstract of this paper is accepted for presentation at ACM SIGMETRICS 2020.

Key words: Monte Carlo Tree Search, Non Stationary Multi-Arm Bandit, Reinforcement Learning

 

Monte Carlo Tree Search (MCTS) is a search framework for finding optimal decisions, based on the search tree built by random sampling of the decision space (Browne et al. 2012). MCTS has been widely used in sequential decision makings that have a tree representation, exemplified by games and planning problems. Since MCTS was first introduced, many variations and enhancements have been proposed. Recently, MCTS has been combined with deep neural networks for reinforcement learning, achieving remarkable success for games of Go (Silver et al. 2016, 2017b), chess and shogi (Silver et al. 2017a). In particular, AlphaGo Zero (AGZ) (Silver et al. 2017b) employs supervised learning to learn a policy/value function (represented by a neural network) based on samples generated via MCTS; the neural network is recursively used to estimate the value of leaf nodes in the next iteration of MCTS for simulation guidance.

Despite the wide application and empirical success of MCTS, there is only limited work on theoretical guarantees of MCTS and its variants. One exception is the work of Kocsis and Szepesvári (2006) and Kocsis et al. (2006), which propose running tree search by applying the Upper Confidence Bound algorithm — originally designed for stochastic multi-arm bandit (MAB) problems (Agrawal 1995, Auer et al. 2002) — to each node of the tree. This leads to the so-called UCT (Upper Confidence Bounds for Trees) algorithm, which is one of the popular forms of MCTS. In Kocsis and Szepesvári (2006), certain asymptotic optimality property of UCT is claimed. The proof therein is, however, incomplete, as we discuss in greater details in Section id1. More importantly, UCT as suggested in Kocsis and Szepesvári (2006) requires exponential concentration of regret for the underlying non-stationary MAB, which is unlikely to hold in general even for stationary MAB as pointed out in Audibert et al. (2009).

Indeed, rigorous analysis of MCTS is subtle, even though its asymptotic convergence may seem natural. A key challenge is that the tree policy (e.g., UCT) for selecting actions typically needs to balance exploration and exploitation, so the random sampling process at each node is non-stationary (non-uniform) across multiple simulations. A more severe difficulty arises due to the hierarchical/iterative structure of tree search, which induces complicated probabilistic dependency between a node and the nodes within its sub-tree. Specifically, as part of simulation within MCTS, at each intermediate node (or state), the action is chosen based on the outcomes of the past simulation steps within the sub-tree of the node in consideration. Such strong dependencies across time (i.e., depending on the history) and space (i.e., depending on the sub-trees downstream) among nodes makes the analysis non-trivial.

The goal of this paper is to provide a rigorous theoretical foundation for MCTS. In particular, we are interested in the following:

  • What is the appropriate form of MCTS for which the asymptotic convergence property claimed in the literature (cf. Kocsis and Szepesvári (2006), Kocsis et al. (2006)) holds?

  • Can we rigorously establish the “strong policy improvement” property of MCTS when combined with supervised learning as observed in the literature (e.g., in Silver et al. (2017b))? If yes, what is the quantitative form of it?

  • Does supervised learning combined with MCTS lead to the optimal policy, asymptotically? If so, what is its finite-sample (non-asymptotic) performance?

As the main contribution of this work, we provide affirmative answers to all of the above questions. In what follows, we provide a brief overview of our contributions and results.

Non-stationary MAB and recursive polynomial concentration. In stochastic Multi Arm Bandit (MAB), the goal is to discover amongst finitely many actions (or arms), the one with the best average reward while choosing as few non-optimal actions as possible in the process. The rewards for any given arm are assumed to be independent and identically distributed (i.i.d.). The usual exponential concentration for such i.i.d. and hence stationary processes leads to UCB algorithm with a logarithmic bonus term: at each time, choose action with maximal index (ties broken arbitrarily), where the index of an arm is defined as the empirical mean reward plus constant times logt/s, where t is the total number of trials so far and st is the number of times the particular action is chosen in these t trials.

The goal in the Monte Carlo Tree Search (MCTS) is very similar to the MAB setup described above – choose an action at a given query state that gives the best average reward. However, the reward depends on the future actions. Therefore, to determine the best action for the given state, one has to take future actions into account and MCTS does this by simulating future via effectively expanding all possible future actions recursively in the form of (decision-like) trees. In essence, the optimal action at the root of such a tree is determined by finding optimal path in the tree. And determining this optimal path requires solving multiple MABs, one per each intermediate node within the tree. Apart from the MABs associated with the lowest layer of the tree, all the MABs associated with the intermediate nodes turn out to have rewards that are the rewards generated by MAB algorithms for nodes downstream. This creates complicated, hierarchically inter-dependent MABs.

To determine the appropriate, UCB-like index algorithm for each node of the MCTS tree, it is essential to understand the concentration property of the rewards, i.e., concentration of regret for MABs associated with nodes downstream. While the rewards at leaf level may enjoy exponential concentration due to independence, the regret of any algorithm even for such an MAB is unlikely to have exponential concentration in general, cf. Audibert et al. (2009), Salomon and Audibert (2011). Further, the MAB of our interest has non-stationary rewards due to strong dependence across hierarchy. Indeed, an oversight of this complication led Kocsis and Szepesvári (2006), Kocsis et al. (2006) to suggest UCT inspired by the standard UCB algorithm for MABs with stationary, independent rewards.

As an important contribution of this work, we formulate an appropriate form of non-stationary MAB which correctly models the MAB at each of the node in the MCTS tree. For such a non-stationary MAB, we define UCB algorithm with appropriate index and under which we establish appropriate concentration of the induced regret. This, in turn, allows us to recursively define the UCT algorithm for MCTS by appropriately defining index for each of the node-action within the MCTS tree. Here we provide a brief summary.

Given [K]={1,,K} actions or arms, let Xi,t denote the reward generated by playing arm i[K] for the tth time. Let empirical mean over n trials for arm i be X¯i,n=1nt=1nXi,t, and let μi,n=𝔼[X¯i,n] be its expectation. Suppose μi,nμi as n for all i[K] and let there exist constants, β>1, ξ>0, and 1/2η<1 such that for every z1 and every integer n1,

(|nX¯i,n-nμi|nηz) βzξ.

Note that for i.i.d. bounded rewards, above holds for η=1/2 for any finite ξ due to exponential concentration. We propose to utilize the UCB algorithm where at time t, the arm It is chosen according to

Itargmaxi[K]{X¯i,Ti(t-1)+Bt-1,Ti(t-1)}, (1)

where Ti(t)=l=1t𝕀{Il=i} is the number of times arm i has been played, up to (including) time t, and the bias or bonus term Bt,s is defined as

Bt,s=β1/ξtη(1-η)s1-η.

Let μ*=maxi[K]μi and let X¯n denote the empirical average of the rewards collected. Then, we establish that 𝔼[X¯n] converges to μ*, and that for every n1 and every z1, a similar polynomial concentration holds:

(|nX¯n-nμ*|nηz) βzξ,

where ξ=ξη(1-η)-1, and β>1 is a large enough constant. The precise statement can be found as Theorem 3 in Section id1.

Corrected UCT for MCTS and non-asymptotic analysis. For MCTS, as discussed above, the leaf nodes have rewards that can be viewed as generated per standard stationary MAB. Therefore, the rewards for each arm (or action) at the leaf level in MCTS satisfy the required concentration property with η=1/2 due to independence. Hence, from our result for non-stationary MAB, we immediately obtain that we can recursively apply the UCB algorithm per (1) at each level in the MCTS with η=1/2 and appropriately adjusted constants β and ξ. In effect, we obtain modified UCT where the bias or bonus term Bt,s scales as t1/4/s1/2. This is in constrast to Bt,s scaling as logt/s in the standard UCB as well as UCT suggested in the literature, cf. Kocsis and Szepesvári (2006), Kocsis et al. (2006).

By recursively applying the convergence and concentration property of the non-stationary MAB for the resulting algorithm for MCTS, we establish that for any query state s of the MDP, using n simulations of the MCTS, we can obtain a value function estimation within error δε0+O(n-1/2), if we start with a value function estimation for all the leaf nodes within error ε0 for some δ<1 (independent of n, dependent on depth of MCTS tree). That is, MCTS is indeed asymptotically correct as was conjectured in the prior literature. For details, see Theorem 1 in Section id1.

MCTS with supervised learning, strong policy improvement, and near optimality. The result stated above for MCTS implies its “bootstrapping” property – if we start with a value function estimation for all state within error ε, then MCTS can produce estimation of value function for a given query state within error less than ε with enough simulations. By coupling such improved estimations of value function for a number of query states, combined with expressive enough supervised learning, one can hope to generalize such improved estimations of value function for all states. That is, MCTS coupled with supervised learning can be “strong policy improvement operator”.

Indeed, this is precisely what we establish by utilizing nearest neighbor supervised learning. Specifically, we establish that with O~(1ε4+d) number of samples, MCTS with nearest neighbor finds an ε approximation of the optimal value function with respect to -norm; here d is the dimension of the state space. This is nearly optimal in view of a minimax lower bound of Ω~(1ε2+d) (Shah and Xie 2018). For details, see Theorem 2 in Section id1.

An Implication. As mentioned earlier, the modified UCT policy per our result suggests using bias or bonus term Bt,s that scales as t1/4/s1/2 at each node within the MCTS. Interestingly enough, the empirical results of AGZ are obtained by utilizing Bt,s that scales as t1/2/s. This is qualitatively similar to what our results suggests and in contrast to the classical UCT.

Reinforcement learning aims to approximate the optimal value function and policy directly from experimental data. A variety of algorithms have been developed, including model-based approaches, model-free approaches like tabular Q-learning  (Watkins and Dayan 1992), and parametric approximation such as linear architectures (Sutton 1988). More recent work approximates the value function/policy by deep neural networks (Mnih et al. 2015, Schulman et al. 2015, 2017, Silver et al. 2017b, Yang et al. 2019), which can be trained using temporal-difference learning or Q-learning (Van Hasselt et al. 2016, Mnih et al. 2016, 2013).

MCTS is an alternative approach, which as discussed, estimates the (optimal) value of states by building a search tree from Monte-Carlo simulations (Kocsis and Szepesvári 2006, Chang et al. 2005, Coulom 2006, Browne et al. 2012). Kocsis and Szepesvári (2006) and Kocsis et al. (2006) argue for the asymptotic convergence of MCTS with standard UCT. However, the proof is incomplete (Szepesvári 2019). A key step towards proving the claimed result is to show the convergence and concentration properties of the regret for UCB under non-stationary reward distributions. In particular, to establish an exponential concentration of regret (Theorem 5, (Kocsis et al. 2006)), Lemma 14 is applied. However, it requires conditional independence of {Zi} sequence, which does not hold, hence making the conclusion of exponential concentration questionable. Therefore, the proof of the main result (Theorem 7, (Kocsis et al. 2006)), which applies Theorem 5 with an inductive argument, is incorrect as stated.

In fact, it may be infeasible to prove Theorem 5 in (Kocsis et al. 2006) as it was stated. For example, the work of Audibert et al. (2009) shows that for bandit problems, the regret under UCB concentrates around its expectation polynomially and not exponentially as desired in Kocsis et al. (2006). Further, Salomon and Audibert (2011) prove that for any strategy that does not use the knowledge of time horizon, it is infeasible to improve this polynomial concentration and establish exponential concentration. Our result is consistent with these fundamental bound of stationary MAB — we establish polynomial concentration of regret for non-stationary MAB, which plays a crucial role in our analysis of MCTS. Also see the work Munos et al. (2014) for a discussion of the issues with logarithmic bonus terms for tree search.

While we focus on UCT in this paper, we note that there are other variants of MCTS developed for a diverse range of applications. The work of Coquelin and Munos (2007) introduces flat UCB in order to improve the worst case regret bounds of UCT. Schadd et al. (2008) modifies MCTS for single-player games by adding to the standard UCB formula a term that captures the possible deviation of the node. In the work by Sturtevant (2008), a variant of MCTS is introduced for multi-player games by adopting the maxn idea. In addition to turn-based games like Go and Chess, MCTS has also been applied to real-time games (e.g., Ms. PacMan, Tron and Starcraft) and nondeterministic games with imperfect information. The applications of MCTS go beyond games, and appear in areas such as optimization, scheduling and other decision-making problems. We refer to the survey on MCTS by Browne et al. (2012) for other variations and applications.

It has become popular recently to combine MCTS with deep neural networks, which serve to approximate the value function and/or policy (Silver et al. 2016, 2017b, 2017a). For instance, in AlphaGo Zero, MCTS uses the neural network to query the value of leaf nodes for simulation guidance; the neural network is then updated with sample data generated by MCTS-based policy and used in tree search in the next iteration. Azizzadenesheli et al. (2018) develop generative adversarial tree search that generates roll-outs with a learned GAN-based dynamic model and reward predictor, while using MCTS for planning over the simulated samples and a deep Q-network to query the Q-value of leaf nodes.

In terms of theoretical results, the closest work to our paper is Jiang et al. (2018), where they also consider a batch, MCTS-based reinforcement learning algorithm, which is a variant of AlphaGo Zero algorithm. The key algorithmic difference from ours lies in the leaf-node evaluator of the search tree: they use a combination of an estimated value function and an estimated policy. The latest observations at the root node are then used to update the value and policy functions (leaf-node evaluator) for the next iteration. They also give a finite sample analysis. However, their result and ours are quite different: in their analysis, the sample complexity of MCTS, as well as the approximation power of value/policy architectures, are imposed as an assumption; here we prove an explicit finite-sample bound for MCTS and characterize the non-asymptotic error prorogation under MCTS with non-parametric regression for leaf-node evaluation. Therefore, they do not establish “strong policy improvement” property of the MCTS.

Two other closely related papers are Teraoka et al. (2014) and Kaufmann and Koolen (2017), which study a simplified MCTS for two-player zero-sum games. There, the goal is to identify the best action of the root in a given game tree. For each leaf node, a stochastic oracle is provided to generate i.i.d. samples for the true reward. Teraoka et al. (2014) give a high probability bound on the number of oracle calls needed for obtaining ε-accurate score at the root. The more recent paper Kaufmann and Koolen (2017) develops refined, instance-dependent sample complexity bounds. Compared to classical MCTS (e.g., UCT), both the setting and the algorithms in the above papers are simpler: the game tree is given in advance, rather than being built gradually through samples; the algorithm proposed in Teraoka et al. (2014) operates on the tree in a bottom-up fashion with uniform sampling at the leaf nodes. As a result, the analysis is significantly simpler and it is unclear whether the techniques can be extended to analyze other variants of MCTS.

It is important to mention the work of Chang et al. (2005) that explores the idea of using UCB for adaptive sampling in MDPs. The approximate value computed by the algorithm is shown to converge to the optimal value. We remark that their algorithm is different from the algorithm we analyze in this paper. In particular, their algorithm proceeds in a depth-first, recursive manner, and hence involves using UCB for a stationary MAB at each node. In contrast, the UCT algorithm we study involves non-stationary MABs, hence our analysis is significantly different from theirs. We refer the readers to the work by Kocsis and Szepesvári (2006) and Coulom (2006) for further discussion of this difference. Another related work by Kearns et al. (2002) studies a sparse sampling algorithm for large MDPs. This algorithm is also different from the MCTS family we analyze in this paper. Recently, Efroni et al. (2018) study multiple-step lookahead policies in reinforcement learning, which can be implemented via MCTS.

Section id1 describes the setting of Markov Decision Process considered in this work. Section id1 describes the Monte Carlo Tree Search algorithm and the main result about its non-asymptotic analysis. Section id1 describes a reinforcement learning method that combines the Monte Carlo Tree Search with nearest neighbor supervised learning. It describes the finite-sample analysis of the method for finding ε approximate value function with respect to norm. Section id1 introduces a form of non-stationary multi-arm bandit and an upper confidence bound policy for it. For this setting, we present the concentration of induced regret which serves as a key result for establishing the property of MCTS. The proofs of all the technical results are delegated to Sections id1 - id1 and Appendices.

We consider the setup of discrete-time discounted Markov decision process (MDP). An MDP is described by a five-tuple (𝒮,𝒜,𝒫,,γ), where 𝒮 is the set of states, 𝒜 is the set of actions, 𝒫𝒫(s|s,a) is the Markovian transition kernel, (s,a) is a random reward function, and γ(0,1) is a discount factor. At each time step, the system is in some state s𝒮. When an action a𝒜 is taken, the state transits to a next state s𝒮 according to the transition kernel 𝒫 and an immediate reward is generated according to (s,a).

A stationary policy π(a|s) gives the probability of performing action a𝒜 given the current state s𝒮. The value function for each state s𝒮 under policy π, denoted by Vπ(s), is defined as the expected discounted sum of rewards received following the policy π from initial state s, i.e.,

Vπ(s)=𝔼π[t=0γt(st,at)|s0=s].

The goal is to find an optimal policy π* that maximizes the value from each initial state. The optimal value function V* is defined as V*(s)=Vπ*(s)=supπVπ(s), s𝒮. It is well understood that such an optimal policy exists in reasonable generality. In this paper, we restrict our attention to the MDPs with the following assumptions:

Assumption 1 (MDP Regularity)

(A1.) The action space A is a finite set and the state space S is a compact subset of d dimensional set; without loss of generality, let S=[0,1]d; (A2.) The immediate rewards are random variables, uniformly bounded such that R(s,a)[-Rmax,Rmax],sS,aA for some Rmax>0; (A3.) The state transitions are deterministic, i.e. PP(s|s,a){0,1} for all s,sS,aA.

Define β1/(1-γ) and VmaxβRmax. Since all the rewards are bounded by Rmax, it is easy to see that the absolute value of the value function for any state under any policy is bounded by Vmax (Even-Dar and Mansour 2004, Strehl et al. 2006).

On Deterministic Transition. We note that the deterministic transition in MDP should not be viewed as restriction or assumption. Traditional AI game research has been focused on deterministic games with a tree representation. It is this context within which historically MCTS was introduced, has been extensively studied and utilized in practice (Browne et al. 2012). This includes the recent successes of MCTS in Go (Silver et al. 2017b), Chess (Silver et al. 2017a) and Atari games (Guo et al. 2014). There is a long theoretical literature on the analysis of MCTS and related methods (Browne et al. 2012, Hren and Munos 2008, Munos et al. 2014, Bartlett et al. 2019) that considers deterministic transition. The principled extension of MCTS algorithm itself as well as theoretical results similar to ours for the stochastic setting are important future work.

A classical approach to find optimal value function, V*, is an iterative approach called value function iteration. The Bellman equation characterizes the optimal value function as

V*(s) =maxa𝒜(𝔼[(s,a)]+γV*(sa)), (2)

where sa𝒮 is the notation to denote the state reached by applying action a on state s. Under Assumption 1, the transitions are deterministic and hence sa represents a single, deterministic state rather than a random state.

The value function iteration effectively views (2) as a fixed point equation and tries to find a solution to it through a natural iteration. Precisely, let V(t)() be the value function estimation in iteration t with V(0) being arbitrarily initialized. Then, for t0, for all s𝒮,

V(t+1)(s) =maxa𝒜(𝔼[(s,a)]+γV(t)(sa)). (3)

It is well known (cf. Bertsekas (2017)) that value iteration is contractive with respect to norm for all γ<1. Specifically, for t0, we have

V(t+1)-V* γV(t)-V*. (4)

Monte Carlo Tree Search (MCTS) has been quite popular recently in many of the reinforcement learning tasks. In effect, given a state s𝒮 and a value function estimate V^, it attempts to run the value function iteration for a fixed number of steps, say H, to evaluate V(H)(s) starting with V(0)=V^ per (3). This, according to (4), would provide an estimate within error γHV^-V* – an excellent estimate of V*(s) if H is large enough. The goal is to perform computation for value function iteration necessary to evaluate V(H) for state s only and not necessarily for all states as required by traditional value function iteration. MCTS achieves this by simply ‘unrolling’ the associated ‘computation tree’. Another challenge that MCTS overcomes is the fact that value function iteration as in (3) assumes knowledge of model so that it can compute 𝔼[(,)] for any state-action pair. But in reality, rewards are observed through samples, not a direct access to 𝔼[(,)]. MCTS tries to utilize the samples in a careful manner to obtain accurate estimation for V(H)(s) over the computation tree suggested by the value function iteration as discussed above. The concern of careful use of samples naturally connects it to multi-arm bandit like setting.

Next, we present a detailed description of the MCTS algorithm in Section id1. This can be viewed as a correction of the algorithm presented in Kocsis and Szepesvári (2006), Kocsis et al. (2006). We state its theoretical property in Section id1.

We provide details of a specific form of MCTS, which replaces the logarithmic bonus term of UCT with a polynomial one. Overall, we fix the search tree to be of depth H. Similar to most literature on this topic, it uses a variant of the Upper Confidence Bound (UCB) algorithm to select an action at each stage. At a leaf node (i.e., a state at depth H), we use the current value oracle V^ to evaluate its value. Note that since we consider deterministic transitions, consequently, the tree is fixed once the root node (state) is chosen, and we use the notation sa to denote the next state after taking action a at state s. Each edge represents a state-action pair, while each node represents a state. For clarity, we use superscript to distinguish quantities related to different depth. The pseudo-code for the MCTS procedure is given in Algorithm 1, and Figure 1 shows the structure of the search tree and related notation.

{algorithm}

[!ht] Fixed-Depth Monte Carlo Tree Search \[email protected]@algorithmic[1] \STATEInput: (1) current value oracle V^, root node s(0) and search depth H;
(2) number of MCTS simulations n;
(3) algorithmic constants, {α(i)}i=1H, {β(i)}i=1H, {ξ(i)}i=1H and {η(i)}i=1H. \STATEInitialization: for each depth h, initialize the cumulative node value v~(h)(s)=0 and visit count N(h)(s)=0 for every node s and initialize the cumulative edge value q(h)(s,a)=0. \FOReach MCTS simulation t=1,2,,n \STATE/*  Simulation:  select actions until reaching depth H*/ \FORdepth h=0,1,2,,H-1 \STATEat state s(h) of depth h, select an action (edge) according to

a(h+1)=argmaxa𝒜q(h+1)(s(h),a)+γv~(h+1)(s(h)a)N(h+1)(s(h)a)+(β(h+1))1/ξ(h+1)(N(h)(s(h)))α(h+1)/ξ(h+1)(N(h+1)(s(h)a))1-η(h+1), (5)

where dividing by zero is assumed to be +. \STATEupon taking the action a(h+1), receive a random reward r(h+1)(s(h),a(h+1)) and transit to a new state s(h+1) at depth h+1. \ENDFOR\STATE/*  Evaluation:  call value oracle for leaf nodes*/ \STATEreach s(H) at depth H, call the current value oracle and let v~(H)(s(H))=V^(s(H)). \STATE/*  Update Statistics:  quantities on the search path*/ \FORdepth h=0,1,2,,H-1 \STATEupdate statistics of nodes and edges that are on the search path of current simulation:

visit count: N(h+1)(s(h+1))=N(h+1)(s(h+1))+1
edge value: q(h+1)(s(h),a(h+1))=q(h+1)(s(h),a(h+1))+r(h+1)
node value: v~(h)(s(h))=v~(h)(s(h))+r(h+1)+γr(h+2)++γH-1-hr(H)+γH-hv~(H)(s(H))
\ENDFOR\ENDFOR\STATE

Output: average of the value for the root node v~(0)(s(0))/n.

In Algorithm 1, there are certain sequences of algorithmic parameters required, namely, α, β, ξ and η. The choices for these constants will become clear in our non-asymptotic analysis. At a higher level, the constants for the last layer (i.e., depth H), α(H), β(H), ξ(H) and η(H) depend on the properties of the leaf nodes, while the rest are recursively determined by the constants one layer below.

Figure 1: Notation and a sample simulation path of MCTS (thick lines).

We note that in selecting action a(h+1) at each depth h (i.e., Line 6 of Algorithm 1), the upper confidence term is polynomial in n while a typical UCB algorithm would be logarithmic in n, where n is the number of visits to the corresponding state thus far. The logarithmic factor in the original UCB algorithm was motivated by the exponential tail probability bounds. In our case, it turns out that exponential tail bounds for each layer seems to be infeasible without further structural assumptions. As mentioned in Section id1, prior work (Audibert et al. 2009, Salomon and Audibert 2011) has justified the polynomial concentration of the regret for the classical UCB in stochastic (independent rewards) multi-arm bandit setting. This implies that the concentration at intermediate depth (i.e., depth less than H) is at most polynomial. Indeed, we will prove these polynomial concentration bounds even for non-stationary (dependent, non-stationary rewards) multi-arm bandit that show up in MCTS and discuss separately in Section id1.

Now, we state the following result on the non-asymptotic performance of the MCTS as described above.

Theorem 1

Consider an MDP satisfying Assumption 1. Let H1, and for 1/2η<1, let

η(h) =η(H)η,h[H], (6)
α(h) =η(1-η)(α(h+1)-1),h[H-1], (7)
ξ(h) =α(h+1)-1,h[H-1]. (8)

Suppose that a large enough ξ(H) is chosen such that α(1)>2. Then, there exist corresponding constants {β(i)}i=1H such that for each query state sS, the following claim holds for the output V^n(s) of MCTS with n simulations:

|𝔼[V^n(s)]-V*(s)|γHε0+O(nη-1),

where ε0=V^-V* with V^ being the estimate of V* utilized by the MCTS algorithm for leaf nodes.

Since η[1/2,1), Theorem 1 implies a best case convergence rate of O(n-1/2) by setting η=1/2. With these parameter choices, the bias term in the upper confidence bound (line 6 of Algorithm 1) scales as (N(h)(s(h)))1/4/N(h+1)(s(h)a), that is, in the form of t1/4/S as mentioned in the introduction, where tN(h)(s(h)) is the number of times that state s(h) at depth h has been visited, and SN(h+1)(s(h)a) is the number of times action a has been selected at state s(h).

Recently, MCTS has been utilized prominently in various empirical successes of reinforcement learning including AlphaGo Zero (AGZ). Here, MCTS is combined with expressive supervised learning method to iteratively improve the policy as well as the value function estimation. In effect, MCTS combined with supervised learning acts as a “policy improvement” operator.

Intuitively, MCTS produces an improved estimation of value function for a given state of interest, starting with a given estimation of value function by “unrolling” the “computation tree” associated with value function iteration. And MCTS achieves this using observations obtained through simulations. Establishing this improvement property rigorously was the primary goal of Section id1. Now, given such improved estimation of value function for finitely many states, a good supervised learning method can learn to generalize such an improvement to all states. If so, this is like performing value function iteration, but using simulations. Presenting such a policy and establishing such guarantees is the crux of this section.

To that end, we present a reinforcement learning method that combines MCTS with nearest neighbor supervised learning. For this method, we establish that indeed, with sufficient number of samples, the resulting policy improves the value function estimation just like value function iteration. Using this, we provide a finite-sample analysis for learning the optimal value function within a given tolerance. We find it nearly matching a minimax lower bound in Shah and Xie (2018) which we recall in Section 2, and thus establishes near minimax optimality of such a reinforcement learning method.

Here we describe the policy to produce estimation of optimal value function V*. Similar approach can be applied to obtain estimation of policy as well. Let V(0) be the initial estimation of V*, and for simplicity, let V(0)()=0. We describe a policy that iterates between use of MCTS and supervised learning to iteratively obtain estimation V() for 1, so that iteratively better estimation of V* is produced as increases. To that end, for 1,

  • For appropriately sampled states S={si}i=1m, apply MCTS to obtain improved estimations of value function {V^()(si)}i=1m using V(-1) to evaluate leaf nodes during simulations.

  • Using {(si,V^()(si)}i=1m with a variant of nearest neighbor supervised learning with parameter δ(0,1), produce estimation V() of the optimal value function.

For completeness, the pseudo-code is provided in Algorithm 1. {algorithm}[htp] Reinforcement Learning Policy \[email protected]@algorithmic[1] \STATEInput: initial value function oracle V(0)(s)=0, s𝒮 \FORl=1,2,,L \STATE/*  improvement via MCTS  */ \STATEuniformly and independently sample states S={si}i=1m. \FOReach sampled state si \STATEapply the MCTS algorithm, which takes as inputs the current value oracle V(l-1), the depth H(l), the number of simulation nl, and the root node si, and outputs an improved estimate for V*(si):

V^(l)(si)=MCTS(V(l-1),H(l),nl,si) (9)
\ENDFOR\STATE

/*  supervised learning  */ \STATEwith the collected data 𝒟(l)={(si,V^(l)(si))}i=1ml, build a new value oracle V(l) via a nearest neighbor regression with parameter δl :

V(l)(s)=Nearest Neigbhor(𝒟(l),δl,s),s𝒮. (10)
\ENDFOR\STATE

Output: final value oracle V(L).

For simplicity, we shall utilize the following variant of the nearest neighbor supervised learning parametrized by δ(0,1). Given state space 𝒮=[0,1]d, we wish to cover it with minimal (up to scaling) number of balls of radius δ (with respect to 2-norm). To that end, since 𝒮=[0,1]d, one such construction is where we have balls of radius δ with centers being {(θ1,θ2,,θd):θ1,,θd𝒬(δ)} where

𝒬(δ)={12δi:i,0i2δ}{1-12δi:i,0i2δ}.

Let the collection of these balls be denoted by c1,,cK(δ,d) with K(δ,d)=|𝒬(δ)|. It is easy to verify that 𝒮i[K(δ,d)]ci, K(δ,d)=Θ(δ-d) and Cdδd𝗏olume(ci𝒮)Cdδd for strictly positive constants Cd,Cd that depends on d but not δ. For any s𝒮, let j(s)=min{j:scj}. Given observations {(si,V^()(si)}i=1m, we produce an estimate V()(s) for all s𝒮 as follows:

V()(s) ={i:sicj(s)V^()(si)|{i:sicj(s)}|,if|{i:sicj(s)}|0,0otherwise. (11)

For finite-sample analysis of the proposed reinforcement learning policy, we make the following structural assumption about the MDP. Specifically, we assume that the optimal value function (i.e., true regression function) is smooth in some sense. We note that some form of smoothness assumption for MDPs with continuous state/action space is typical for guarantee. The Lipschitz continuous assumption stated below is natural and representative in the literature on MDPs with continuous state spaces, cf.  Munos et al. (2014), Dufour and Prieto-Rumeau (2012, 2013) and Bertsekas (1975).

Assumption 2 (Smoothness)

The optimal value function V*:SR satisfies Lipschitz continuity with parameter C, i.e., s,sS=[0,1]d, |V*(s)-V*(s)|Cs-s2.

Now we state the result characterizing the performance of the reinforcement learning policy described above.

Theorem 2

Let Assumptions 1 and 2 hold. Let ε>0 be a given error tolerance. Then, for L=Θ(logεVmax), with appropriately chosen m,δ for [L] as well as parameters in MCTS, the reinforcement learning algorithm produces estimation of value function V(L) such that

𝔼[sups𝒮|V(L)(s)-V*(s)|]ε,

by selecting m states uniformly at random in S within iteration . This, in total, requires T number of state transitions (or samples), where

T =O(ε-(4+d)(log1ε)5).

Leveraging the the minimax lower bound for the problem of non-parametric regression (Tsybakov 2009, Stone 1982), recent work Shah and Xie (2018) establishes a lower bound on the sample complexity for reinforcement learning algorithms for general MDPs without additional structural assumptions. Indeed the lower bound also holds for MDPs with deterministic transitions (the proof is provided in Appendix id1), which is stated in the following proposition.

Proposition 1

Given an algorithm, let VT be the estimation of V* after T samples of state transitions for the given MDP. Then, for each ε(0,1), there exists an instance of deterministic MDP such that in order to achieve P[V^T-V*<ε]1-ε, it must be that

TCdε-(d+2)log(ε-1),

where C>0 is a constant independent of the algorithm.

Proposition 1 states that for any policy to learn the optimal value function within ε approximation error, the number of samples required must scale as Ω~(ε-(2+d)). Theorem 2 implies that the sample complexity of the proposed algorithm scales as O~(ε-(4+d)) (omitting the logarithmic factor). Hence, in terms of the dependence on the dimension, the proposed algorithm is nearly optimal. Optimizing the dependence of the sample complexity on other parameters is an important direction for future work.

We introduce a class of non-stationary multi-arm bandit (MAB) problems, which will play a crucial role in analyzing the MCTS algorithm. To that end, let there be K1 arms or actions of interest. Let Xi,t denote the random reward obtained by playing the arm i[K] for the tth time with t1. Let X¯i,n=1nt=1nXi,t denote the empirical average of playing arm i for n times, and let μi,n=𝔼[X¯i,n] be its expectation. For each arm i[K], the reward Xi,t is bounded in [-R,R] for some R>0, and we assume that the reward sequence, {Xi,t:t1}, is a non-stationary process satisfying the following convergence and concentration properties:

  1. A.

    (Convergence) the expectation μi,n converges to a value μi, i.e.,

    μi=limn𝔼[X¯i,n]. (12)
  2. B.

    (Concentration) there exist three constants, β>1, ξ>0, and 1/2η<1 such that for every z1 and every integer n1,

    (nX¯i,n-nμi nηz)βzξ,(nX¯i,n-nμi-nηz)βzξ. (13)

Consider applying the following variant of Upper Confidence Bound (UCB) algorithm to the above non-stationary MAB. Define upper confidence bound for arm or action i when it is played s times in total of ts time steps as

Ui,s,t =X¯i,s+Bt,s, (14)

where Bt,s is the “bonus term”. Denote by It the arm played at time t1. Then,

Itargmaxi[K]{X¯i,Ti(t-1)+Bt-1,Ti(t-1)}, (15)

where Ti(t)=l=1t𝕀{Il=i} is the number of times arm i has been played, up to (including) time t. We shall make specific selection of the bonus or bias term Bt,s as

Bt,s=β1/ξtα/ξs1-η. (16)

A tie is broken arbitrarily when selecting an arm. In the above, α>0 is a tuning parameter that controls the exploration and exploitation trade-off. Let μ*=maxi[K]μi be the optimal value with respect to the converged expectation, and i*argmaxi[K]μi be the corresponding optimal arm. We assume that the optimal arm is unique. Let δi*,n=μi*,n-μi*, which measures how fast the mean of the optimal non-stationary arm converges. For simplicity, quantities related to the optimal arm i* will be simply denoted with subscript *, e.g., δ*,n=δi*,n. Finally, denote by Δmin=mini[K],ii*Δi the gap between the optimal arm and the second optimal arm with notation Δi=μ*-μi.

Let X¯n1ni=1KTi(n)X¯i,Ti(n) denote the empirical average under the UCB algorithm (15). Then, X¯n satisfies the following convergence and concentration properties.

Theorem 3

Consider a non-stationary MAB satisfying (12) and (13). Suppose that algorithm (15) is applied with parameter α such that ξη(1-η)α<ξ(1-η) and α>2. Then, the following holds:

  1. A.

    Convergence:

    |𝔼[X¯n]-μ*| |δ*,n|+2R(K-1)((2Δminβ1/ξ)11-ηnαξ(1-η)+2α-2+1)n.
  2. B.

    Concentration: there exist constants, β>1 and ξ>0 and 1/2η<1 such that for every n1 and every z1,

    (nX¯n-nμ* nηz)βzξ,(nX¯n-nμ*-nηz)βzξ,

    where η=αξ(1-η), ξ=α-1, β depends on R,K,Δmin,β, ξ,α,η.

We establish the convergence and concentration properties of the variant of the Upper Confidence Bound algorithm described in Section id1 and specified through (14), (15) and (16).

Establishing the Convergence Property. We define a useful notation

Φ(n,δ)=nη(βδ)1/ξ. (17)

We begin with a useful lemma, which shows that the probability that a non-optimal arm or action has a large upper confidence is polynomially small. Proof is provided in Section id1.

Lemma 1

Let i[K],ii* be a sub-optimal arm and define

Ai(t) minu{Φ(u,t-α)uΔi2}=(2Δiβ1/ξtα/ξ)11-η. (18)

For each s and t such that, Ai(t)st, we have

(Ui,s,t>μ*)t-α.

Lemma 1 implies that as long as each arm is played enough, the sub-optimal ones become less likely to be selected. This allows us to upper bound the expected number of sub-optimal plays as follows.

Lemma 2

Let i[K],ii*, then

𝔼[Ti(n)](2Δiβ1/ξ)11-ηnαξ(1-η)+2α-2+1.

Proof of Lemma 2 is deferred to Section id1.

Completing Proof of Convergence. By the triangle inequality,

|μ*-𝔼[X¯n]|=|μ*-μ*,n|+|μ*,n-𝔼[X¯n]|=|δ*,n|+|μ*,n-𝔼[X¯n]|.

The second term can be bounded as follows:

n|μ*,n-𝔼[X¯n]|
=|𝔼[t=1nXi*,t]-𝔼[i=1KTi(n)X¯i,Ti(n)]|
|𝔼[t=1nXi*,t]-𝔼[T*(n)X¯i*,T*(n)]|+|𝔼[i=1,ii*KTi(n)X¯i,Ti(n)]|
=|𝔼[t=T*(n)+1nXi*,t]|+|𝔼[i=1,ii*KTi(n)X¯i,Ti(n)]|. (19)

Recall that the reward sequences are assumed to be bounded in [-R,R]. Therefore, the first term of (19) can be bounded as follows:

|𝔼[t=T*(n)+1nXi*,t]|𝔼[t=T*(n)+1n|Xi*,t|]R𝔼[i=1,ii*KTi(n)].

The second term can also be bounded as:

|𝔼[i=1,ii*KTi(n)X¯i,Ti(n)]| R𝔼[i=1,ii*KTi(n)].

Hence, we obtain that

|μ*-𝔼[X¯n]|=|δ*,n|+|μ*,n-𝔼[X¯n]||δ*,n|+2R𝔼[i=1,ii*KTi(n)]n.

Combining the above bounds and Lemma 2 yields the desired convergence result in Theorem 3.

Establishing the Concentration Property. Having proved the convergence property, the next step is to show that a similar concentration property (cf. (13)) also holds for X¯n. We aim to precisely capture the relationship between the original constants assumed in the assumption and the new constants obtained for X¯n. To begin with, recall the definition of Ai(t) in Lemma 1 and define

A(t) =maxi[K]Ai(t)=(2Δminβ1/ξ)11-ηtαξ(1-η). (20)

It can be checked that replacing β with any larger number still makes the concentration inequalities (13) hold. Without loss of generality, we hence let β be large enough so that 2Δminβ1/ξ>1. We further denote by Np the first time such that tA(t), i.e.,

Np =min{t1:tA(t)}=Θ((2ξβΔminξ)1ξ(1-η)-α). (21)

We first state the following concentration property, which will be further refined to match the desired form in Theorem 3. We defer the proof to Section id1.

Lemma 3

For any nNp and x1, let r0=nη+2R(K-1)(3+A(n)). Then,

(nX¯n-nμ*r0x)βxξ+2(K-1)(α-1)((1+A(n))x)α-1,
(nX¯n-nμ*-r0x)βxξ+2(K-1)(α-1)((1+A(n))x)α-1.

Lemma 3 confirms that indeed, as n becomes large, the average X¯n also satisfies certain concentration inequalities. However, the particular form of concentration in Theorem 3 does not quite match the form of concentration in Theorem 3 which we conclude next.

Completing Proof of Concentration Property. Let Np be a constant defined as follows:

Np=min{t1:tA(t) and 2RA(t)tη+2R(4K-3)}.

Recall the definition of A(t) and that αξη(1-η) and α<ξ(1-η). Hence, Np is guaranteed to exist. In addition, note that by definition, NpNp. For each nNp,

2RK(2Δminβ1/ξ)11-ηnαξ(1-η) =2RK[(2Δminβ1/ξ)11-ηnαξ(1-η)+1-1]
2RKA(n)-2RK
=2R(K-1)A(n)+2RA(n)-2RK
2R(K-1)A(n)+nη+2R(4K-3)-2RK
=2R(K-1)(A(n)+3)+nη=r0

Now, let us apply Lemma 3: for every nNp and x1, we have

(nX¯n-nμ*nαξ(1-η)[2RK(2Δminβ1/ξ)11-η]x) (nX¯n-nμ*r0x)
βxξ+2(K-1)(α-1)((1+A(n))x)α-1
2max(β,2(K-1)(α-1)(1+A(Np))α-1)xα-1, (22)

where the last inequality follows because nNp and A(n) is a non-decreasing function. In addition, since α<ξ(1-η)<ξ, we have α-1<ξ. For convenience, we define a constant

c12RK(2Δminβ1/ξ)11-η. (23)

Equivalently, by a change of variable, i.e., letting z=c1x, then for every nNp and z1, we obtain that

(nX¯n-nμ*nαξ(1-η)z) 2c1α-1max(β,2(K-1)(α-1)(1+A(Np))α-1)zα-1. (24)

The above inequality holds because: (1) if zc1, then (24) directly follows from (22); (2) if 1zc1, then the R.H.S. of (24) is at least 1 (by assumption, β>1) and the inequality trivially holds. The concentration inequality, i.e., Eq. (24), is now almost the same as the desired form in Theorem 3. The only difference is that it only holds for nNp. This is not hard to resolve. The easiest approach, which we show in the following, is to refine the constants to ensure that when 1n<Np, Eq. (24) is trivially true. To this end, we note that |nX¯n-nμ*|2Rn. For each 1n<Np, there is a corresponding z¯(n) such that nαξ(1-η)z¯(n)=2Rn. That is,

z¯(n)2Rn1-αξ(1-η),1n<Np.

This then implies that for each 1n<Np, the following inequality trivially holds:

(nX¯n-nμ*nαξ(1-η)z)z¯(n)α-1zα-1,z1.

To see why, note that for each 1n<Np: (1) if zz¯(n), then nαξ(1-η)z2Rn and the above probability should be 0. Hence, any positive number on the R.H.S. makes the inequality trivially true; (2) if 1z<z¯(n), the R.H.S. is at least 1, which again makes the inequality hold. For convenience, define

c2max1n<Npz¯(n)=2R(Np-1)1-αξ(1-η). (25)

Then, it is easy to see that for every n1 and every z1, we have

(nX¯n-nμ*nηz)βzξ,

where the constants are given by

η =αξ(1-η), (26)
ξ =α-1, (27)
β =max{c2, 2c1α-1max(β,2(K-1)(α-1)(1+A(Np))α-1)}. (28)

Finally, notice that since αξη(1-η) and α<ξ(1-η), we have 1/2ηη<1. Note that per (23), c1 depends on R,K,Δmin,β,ξ and η. In addition, c2 depends on R,K,Δmin,β,ξ,α,η and Np depends on R,K,Δmin,β,ξ,α,η. Therefore, β depends on R,K,Δmin,β, ξ,α,η. The other direction follows exactly the same reasoning, and this completes the proof of Theorem 3.

By the choice of Ai(t), s and t, we have Bt,s=Φ(s,t-α)sΦ(Ai(t),t-α)Ai(t)Δi2. Therefore,

(Ui,s,t>μ*) =(X¯i,s+Bt,s>μ*)
=(X¯i,s-μi>Δi-Bt,s)
(X¯i,s-μi>Bt,s) Δi2Bt,s
t-α. by concentration (13).

If a sub-optimal arm i is chosen at time t+1, i.e., It+1=i, then at least one of the following two equations must be true: with notation T*()=Ti*(),

Ui*,T*(t),t μ*, (29)
Ui,Ti(t),t >μ*. (30)

Indeed, if both inequalities are false, we have Ui*,T*(t),t>μ*Ui,Ti(t),t, which is a contradiction to It+1=i. We now use this fact to prove Lemma 2.

Case 1: n>Ai(n). Note that such n exists because Ai(n) grows with a polynomial order O(nαξ(1-η)) and α<ξ(1-η), i.e., Ai(n)=o(n). Then,

Ti(n) =t=0n-1𝕀{It+1=i}=(a)1+t=Kn-1𝕀{It+1=i}
=1+t=Kn-1(𝕀{It+1=i,Ti(t)<Ai(n)}+𝕀{It+1=i,Ti(t)Ai(n)})
Ai(n)+t=Kn-1𝕀{It+1=i,Ti(t)Ai(n)},

where equality (a) follows from the fact that Bt,s= if s=0.

To analyze the above summation, we note that from (29) and (30),

𝕀{It+1=i,Ti(t)Ai(n)} 𝕀{Ui*,T*(t),tμ* or Ui,Ti(t),t>μ*,Ti(t)Ai(n)}
𝕀{Ui,Ti(t),t>μ*,Ti(t)Ai(n)}+𝕀{Ui*,T*(t),tμ*,Ti(t)Ai(n)}
𝕀{Ui,Ti(t),t>μ*,Ti(t)Ai(n)}+𝕀{Ui*,T*(t),tμ*}
=𝕀{s:Ai(n)st, s.t. Ui,s,t>μ*}+𝕀{s*:1s*t, s.t. Ui*,s*,tμ*}.

To summarize, we have proved that

𝔼[Ti(n)] Ai(n)+t=Ai(n)n-1((29) or (30) is true, and Ti(t)Ai(n))
Ai(n)+t=Ai(n)n-1[(s:Ai(n)st, s.t. Ui,s,t>μ*E1)+(s*:1s*t, s.t. Ui*,s*,tμ*E2)]. (31)

To complete the proof of Lemma 2, it suffices to bound the probabilities of the two events E1 and E2. To this end, we use a union bound:

(E1) s=Ai(n)t(Ui,s,t>μ*)(a)s=Ai(n)tt-αtt-α=t1-α,

where the step (a) follows from Ai(n)Ai(t) and Lemma 1. We bound (E2) in a similar way:

(E2) s*=1t(Ui*,s*,tμ*)=s*=1t(X¯i*,s*+Bt,s*μ*)(a)s*=1tt-αt1-α,

where step (a) follows from concentration (cf. (13)). By substituting the bounds of (E1) and (E2) into (31), we have:

𝔼[Ti(n)] Ai(n)+t=Ai(n)n-12t1-α
Ai(n)+Ai(n)-12t1-α𝑑t α>2
=Ai(n)+2(Ai(n)-1)2-αα-2
Ai(n)+2α-2
(2Δiβ1/ξ)11-ηnαξ(1-η)+2α-2+1.

Case 2: nAi(n). Note that if n is such that nAi(n), then the above bound trivially holds because Ti(n)nAi(n). This completes the proof of Lemma 2.

We first prove one direction, namely, (nμ*-nX¯nr0x). The other direction follows the similar steps, and we will comment on that at the end of this proof. The general idea underlying the proof is to rewrite the quantity nμ*-nX¯n as sums of terms that can be bounded using previous lemmas or assumptions. To begin with, note that

nμ*-nX¯n =nμ*-i=1KTi(n)X¯i,Ti(n)
=nμ*-t=1T*(n)Xi*,t-ii*Ti(n)X¯i,Ti(n)
=nμ*-t=1nXi*,t+t=T*(n)+1nXi*,t-ii*t=1Ti(n)Xi,t
nμ*-t=1nXi*,t+2Rii*Ti(n),

because Xi,t[-R,R] for all i,t. Therefore, we have

(nμ*-nX¯nr0x) (nμ*-t=1nXi*,t+2Rii*Ti(n)r0x)
(nμ*-t=1nXi*,tnηx)+ii*(Ti(n)(3+A(n))x), (32)

where the last inequality follows from the union bound.

To prove the theorem, we now bound the two terms in (32). By our concentration assumption, we can upper bound the first term as follows:

(nμ*-t=1nXi*,tnηx)βxξ. (33)

Next, we bound each term in the summation of (32). Fix n and a sub-optimal edge i. Let u be an integer satisfying uA(n). For any τ, consider the following two events:

E1 ={For each integer t[u,n], we have Ui,u,tτ},
E2 ={For each integer s[1,n-u], we have Ui*,s,u+s>τ}.

As a first step, we want to show that

E1E2Ti(n)u. (34)

To this end, let us condition on both events E1 and E2. Recall that Bt,s is non-decreasing with respect to t. Then, for each s such that 1sn-u, and each t such that u+stn, it holds that

Ui*,s,t=X¯i*,s+Bt,sX¯i*,s+Bu+s,s=Ui*,s,u+s>τUi,u,t.

This implies that Ti(n)u. To see why, suppose that Ti(n)>u and denote by t the first time that arm i has been played u times, i.e., t=min{t:tn,Ti(t)=u}. Note that by definition tu+T*(t). Hence, for any time t such that t<tn, the above inequality implies that Ui*,T*(t),t>Ui,u,t. That is, i* always has a higher upper confidence bound than i, and arm i will not be selected, i.e., arm i will not be played the (u+1)-th time. This contradicts our assumption that Ti(n)>u, and hence we have the inequality Ti(n)u.

To summarize, we have established the fact that E1E2Ti(n)u. As a result, we have:

{Ti(n)>u} (E1cE2c)=({t:utn s.t. Ui,u,t>τ}{s:1sn-u, s.t. Ui*,s,u+sτ}).

Using union bound, we obtain that

(Ti(n)>u)t=un(Ui,u,t>τ)+s=1n-u(Ui*,s,u+sτ). (35)

Note that for the above bound, we are free to choose u and τ as long as uA(n). To connect with our goal (cf. (32)), in the following, we set u=(1+A(n))x+1 (recall that x1) and τ=μ* to bound (Ti(n)>u). Since uA(n)Ai(n), by Lemma 1, we have

t=un(Ui,u,t>μ*)t=unt-αu-1t-αdt =(u-1)1-αα-1
=((1+A(n))x)1-αα-1((1+A(n))x)1-αα-1.

As for the second summation in the R.H.S. of (35), we have that

s=1n-u(Ui*,s,u+sτ) =s=1n-u(Ui*,s,u+sμ*)
=s=1n-u(X¯i*,s+Bu+s,sμ*)
s=1n-u(s+u)-α=t=1+unt-α
u-1t-α𝑑t=(u-1)1-αα-1((1+A(n))x)1-αα-1,

where the first inequality follows from the concentration property, cf. (13). Combining the above inequalities and note that (3+A(n))x>(1+A(n))x+1:

(Ti(n)(3+A(n))x)(Ti(n)>u)2((1+A(n))x)1-αα-1. (36)

Substituting (33) and (36) into (32), we obtain

(nμ*-nX¯nr0x)βxξ+ii*2((1+A(n))x)1-αα-1,

which is the desired inequality in Lemma 3.

To complete the proof, we need to consider the other direction, i.e., (nX¯n-nμ*r0x). The proof is almost identical. Note that

nX¯n-nμ* =i=1KTi(n)X¯i,Ti(n)-nμ*
=t=1nXi*,t-nμ*-t=T*(n)+1nXi*,t+ii*t=1Ti(n)Xi,t
t=1nXi*,t-nμ*+2Rii*Ti(n),

because Xi,t[-R,R] for all i,t. Therefore,

(nX¯n-nμ*r0x) (t=1nXi*,t-nμ*+2Rii*Ti(n)r0x)
(t=1nXi*,t-nμ*nηx)+ii*(Ti(n)(3+Ai(n))x).

The desired inequality then follows exactly from the same reasoning of our previous proof.

In this section, we give a complete analysis for the fixed-depth Monte Carlo Tree Search (MCTS) algorithm illustrated in Algorithm 1 and prove Theorem 1. In effect, as discussed in Section id1, one can view a depth-H MCTS as a simulated version of H steps value function iterations. Given the current value function proxy V^, let V(H)() be the value function estimation after H steps of value function iteration starting with the proxy V^. Then, we prove the result in two parts. First, we argue that due to the MCTS sampling process, the mean of the empirical estimation of value function at the query node s, or the root node of MCTS tree, is within O(nη-1) of V(H)(s) after n simulations, with the given proxy V^ being the input to the MCTS algorithm. Second, we argue that V(H)(s) is within γHV^-V*γHε0 of the optimal value function. Putting this together leads to Theorem 1.

We start by a preliminary probabilistic lemma in Section id1 that will be useful throughout. Sections 4 and id1 argue the first part of the proof as explained above. Section id1 provides proof of the second part. And Section id1 concludes the proof of Theorem 1.

We state the following probabilistic lemma that is useful throughout. Proof can be found in Section id1.

Lemma 4

Consider real-valued random variables Xi,Yi for i1 such that Xs are independent and identically distributed taking values in [-B,B] for some B>0, Xs are independent of Ys, and Ys satisfy

  1. A.

    Convergence: for n1, with notation Y¯n=1n(i=1nYi),

    limn𝔼[Y¯n] =μY.
  2. B.

    Concentration: there exist constants, β>1, ξ>0, 1/2η<1 such that for n1 and z1,

    (nY¯n-nμYnηz)βzξ,(nY¯n-nμY-nηz)βzξ.

Let Zi=Xi+ρYi for some ρ>0. Then, Zs satisfy

  1. A.

    Convergence: for n1, with notation Z¯n=1n(i=1nZi), and μX=𝔼[X1],

    limn𝔼[Z¯n] =μX+ρμY.
  2. B.

    Concentration: there exist constant β>1 depending upon ρ,ξ,β and B, such that for n1 and z1,

    (nZ¯n-n(μX+ρμY)nηz)βzξ,(nZ¯n-n(μX+ρμY)-nηz)βzξ.

The goal is to understand the empirical reward observed at the query node for MCTS or the root node of the MCTS tree. In particular, we argue that the mean of the empirical reward at the root node of the MCTS tree is within O(nη-1) of the mean reward obtained at it assuming access to infinitely many samples. We start by analyzing the reward collected at the nodes that are at leaf level H and level H-1.

The nodes at leaf level, i.e., level H, are children of nodes at level H-1 in the MCTS tree. Let there be nH-1 nodes at level H-1 corresponding to states s1,H-1,,snH-1,H-1𝒮. Consider node i[nH-1] at level H-1, corresponding to state si,H-1. As part of the algorithm, whenever this node is visited, one of the K feasible actions is taken. When an action a[K] is taken, the node sH=si,H-1a, at the leaf level H is reached. This results in reward at node si,H-1 (at level H-1) being equal to (si,H-1,a)+γv~(H)(sH). Here, for each s𝒮 and a[K], the reward (s,a) is an independent, bounded random variable taking value in [-Rmax,Rmax] with distribution dependent on s,a; v~(H)() is the input of value function proxy to the MCTS algorithm denoted as V^(), and γ[0,1) is the discount factor. Recall that ε0=V^-V* and V*Vmax. Therefore, v~(H)=V^Vmax+ε0, and the reward collected at node si,H-1 by following any action is bounded, in absolute value, by R~max(H-1)=Rmax+γ(Vmax+ε0).

As part of the MCTS algorithm as described in (5), when node si,H-1 is visited for the t+1 time with t0, the action taken is

argmaxa𝒜 {1uaj=1ua(r(si,H-1,a)(j)+γv~(H)(si,H-1a)(j))+(β(H))1/ξ(H)(t)α(H)/ξ(H)(ua)1-η(H)},

where uat is the number of times action a has been chosen thus far at state si,H-1 in the t visits so far, r(si,H-1,a)(j) is the jth sample of random variable per distribution (si,H-1,a), and v~(H)(si,H-1a)(j) is the reward evaluated at leaf node si,H-1a for the jth time. Note that for all j, the reward evaluated at leaf node si,H-1a is the same and equals to v~(H)(), the input value function proxy for the algorithm. When ua=0, we use notation to represent quantity inside the argmax. The net discounted reward collected by node si,H-1 during its total of t1 visits is simply the sum of rewards obtained by selecting the actions per the policy – which includes the reward associated with taking an action and the evaluation of v~(H)() for appropriate leaf node, discounted by γ. In effect, at each node si,H-1, we are using the UCB policy described in Section id1 with parameters α(H),β(H),ξ(H),η(H) with K possible actions, where the rewards collected by playing any of these K actions each time is simply the summation of bounded independent and identical (for a given action) random variable and a deterministic evaluation. By applying Lemma 4, where Xs correspond to independent rewards, ρ=γ, and Ys correspond to deterministic evaluations of v~(H)(), we obtain that for given ξ(H)>0 and η(H)[12,1), there exists β(H) such that the collected rewards at si,H-1 (i.e., sum of i.i.d. reward and deterministic evaluations) satisfy the convergence property cf. (12) and concentration property cf. (13) stated in Section id1. Therefore, by an application of Theorem 3, we conclude Lemma 5 stated below. We define some notations first:

μa(H-1)(si,H-1) =𝔼[(si,H-1,a)]+γv~(H)(si,H-1a),
μ*(H-1)(si,H-1) =maxa[K]μa(H-1)(si,H-1)
a*(H-1)(si,H-1) argmaxa[K]μa(H-1)(si,H-1) (37)
Δmin(H-1)(si,H-1) =μ*(H-1)(si,H-1)-maxaa*(H-1)(si,H-1)μa(H-1)(si,H-1).

We shall assume that the maximizer in the set argmaxa[K]μa(H-1)(si,H-1) is unique, i.e. Δmin(H-1)(si,H-1)>0. And further note that all rewards belong to [-R~max(H-1),R~max(H-1)].

Lemma 5

Consider a node corresponding to state si,H-1 at level H-1 within the MCTS for i[nH-1]. Let v~(H-1)(si,H-1)n be the total discounted reward collected at si,H-1 during n1 visits of it, to one of its K leaf nodes under the UCB policy. Then, for the choice of appropriately large β(H)>0, for a given ξ(H)>0, η(H)[12,1) and α(H)>2, we have

  1. A.

    Convergence:

    |𝔼[1nv~(H-1)(si,H-1)n]-μ*(H-1)(si,H-1)|
    2R~max(H-1)(K-1)((2(β(H))1ξ(H)Δmin(H-1)(si,H-1))11-η(H)nα(H)ξ(H)(1-η(H))+2α(H)-2+1)n.
  2. B.

    Concentration: there exist constants, β>1 and ξ>0 and 1/2η<1 such that for every n1 and every z1,

    (v~(H-1)(si,H-1)n-nμ*(H-1)(si,H-1)nηz)βzξ,
    (v~(H-1)(si,H-1)n-nμ*(H-1)(si,H-1)-nηz)βzξ,

    where η=α(H)ξ(H)(1-η(H)), ξ=α(H)-1, and β is a large enough constant that is function of parameters α(H),β(H),ξ(H), η(H),R~max(H-1),K,Δmin(H-1)(si,H-1).

Let Δmin(H-1)=mini[nH-1]Δmin(H-1)(si,H-1). Then, the rate of convergence for each node si,H-1, i[nH-1] can be uniformly simplified as

δn(H-1)= 2R~max(H-1)(K-1)((2(β(H))1ξ(H)Δmin(H-1))11-η(H)nα(H)ξ(H)(1-η(H))+2α(H)-2+1)n
=  Θ(nα(H)ξ(H)(1-η(H))-1)
=(a)  O(nη-1),

where (a) holds since α(H)=ξ(H)(1-η(H))η(H),η(H)=η. It is worth remarking that μ*(H-1)(si,H-1), as defined in (37), is precisely the value function estimation for si,H-1 at the end of one step of value iteration starting with V^.

Lemma 5 suggests that the necessary assumption of Theorem 3, i.e. (12) and (13), are satisfied by v~n(H-1) for each node or state at level H-1, with α(H-1),ξ(H-1),η(H-1) as defined per relationship (6) - (8) and with appropriately defined large enough constant β(H-1). We shall argue that result similar to Lemma 5, but for node at level H-2, continues to hold with parameters α(H-2),ξ(H-2),η(H-2) as defined per relationship (6) - (8) and with appropriately defined large enough constant β(H-2). And similar argument will continue to apply going from level h to h-1 for all hH-1. That is, we shall assume that the necessary assumption of Theorem 3, i.e. (12) and (13), holds for v~(h)(), for all nodes at level h with α(h),ξ(h),η(h) as defined per relationship (6) - (8) and with appropriately defined large enough constant β(h), and then argue that such holds for nodes at level h-1 as well. This will, using mathematical induction, allow us to prove the results for all h1.

To that end, consider any node at level h-1. Let there be nh-1 nodes at level h-1 corresponding to states s1,h-1,,snh-1,h-1𝒮. Consider a node corresponding to state si,h-1 at level h-1 within the MCTS for i[nh-1]. As part of the algorithm, whenever this node is visited, one of the K feasible action is taken. When an action a[K] is taken, the node sh=si,h-1a, at the level h is reached. This results in reward at node si,h-1 at level h-1 being equal to (si,h-1,a)+γv~(h)(sh). As noted before, (s,a) is an independent, bounded valued random variable while v~(h)() is effectively collected by following a path all the way to the leaf level. Inductively, we assume that v~(h)() satisfies the convergence and concentration property for each node or state at level h, with α(h),ξ(h),η(h) as defined per relationship (6) - (8) and with appropriately defined large enough constant β(h). Therefore, by an application of Lemma 4, it follows that this combined reward continues to satisfy (12) and (13), with α(h),ξ(h),η(h) as defined per relationship (6) - (8) and with a large enough constant which we shall denote as β(h). These constants are used by the MCTS policy. By an application of Theorem 3, we can obtain the following Lemma 6 regarding the convergence and concentration properties for the reward sequence collected at node si,h-1 at level h-1. Similar to the notation in Eq. (37), let

μa(h-1)(si,h-1) =𝔼[(si,h-1,a)]+γμ*(h)(si,h-1a)
μ*(h-1)(si,h-1) =maxa[K]μa(h-1)(si,h-1)
a*(h-1)(si,h-1) argmaxa[K]μa(h-1)(si,h-1) (38)
Δmin(h-1)(si,h-1) =μ*(h-1)(si,h-1)-maxaa*(h-1)(si,h-1)μa(h-1)(si,h-1).

Again, we shall assume that the maximizer in the set argmaxa[K]μa(h-1)(si,h-1) is unique, i.e. Δmin(h-1)(si,h-1)>0. Define R~max(h-1)=Rmax+γR~max(h), where R~(H)=Vmax+ε0. Note that all rewards collected at level h-1 belong to [-R~max(h-1),R~max(h-1)].

Lemma 6

Consider a node corresponding to state si,h-1 at level h-1 within the MCTS for i[nh-1]. Let v~(h-1)(si,h-1)n be the total discounted reward collected at si,h-1 during n1 visits. Then, for the choice of appropriately large β(h)>0, for a given ξ(h)>0, η(h)[12,1) and α(h)>2, we have

  1. A.

    Convergence:

    |𝔼[1nv~(h-1)(si,h-1)n]-μ*(h-1)(si,h-1)|
    2R~max(h-1)(K-1)((2(β(h))1ξ(h)Δmin(h-1)(si,h-1))11-η(h)nα(h)ξ(h)(1-η(h))+2α(h)-2+1)n.
  2. B.

    Concentration: there exist constants, β>1 and ξ>0 and 1/2η<1 such that for n1, z1,

    (v~(h-1)(si,h-1)n-nμ*(h-1)(si,h-1)nηz)βzξ,
    (v~(h-1)(si,h-1)n-nμ*(h-1)(si,h-1)-nηz)βzξ,

    where η=α(h)ξ(h)(1-η(h)), ξ=α(h)-1, and β is a large enough constant that is function of parameters α(h),β(h),ξ(h), η(h),R~max(h-1),K,Δmin(h-1)(si,h-1).

As before, let us define Δmin(h-1)=mini[nh-1]Δmin(h-1)(si,h-1). Similarly, we can show that for every node si,h-1, i[nh-1], the rate of convergence in Lemma 6 can be uniformly simplified as

δn(h-1) =2R~max(h-1)(K-1)((2(β(h))1ξ(h)Δmin(h-1))11-η(h)nα(h)ξ(h)(1-η(h))+2α(h)-2+1)n
 =Θ(nα(h)ξ(h)(1-η(h))-1)=O(nη-1),

where the last equality holds as α(h)=ξ(h)(1-η(h))η(h) and η(h)=η. Again, it is worth remarking, inductively, that μ*(h-1)(si,h-1) is precisely the value function estimation for si,h-1 at the end of H-h+1 steps of value iteration starting with V^.

Remark (Recursive Relation among Parameters). With the above development, we are ready to elaborate our choice of parameters in Theorem 1, defined recursively via Eqs. (6)-(8). In essence, those parameter requirements originate from our analysis of the non-stationary MAB, i.e., Theorem 3. Recall that from our previous analysis, the key to establish the MCTS guarantee is to recursively argue the convergence and the polynomial concentration properties at each level; that is, we recursively solve the non-stationary MAB problem at each level. In order to do so, we apply our result on the non-stationary MAB (Theorem 3) recursively at each level. Importantly, recall that Theorem 3 only holds when ξη(1-η)α<ξ(1-η) and α>2, under which it leads to the recursive conclusions η=αξ(1-η) and ξ=α-1. Using our notation with superscript indicating the levels, this means that apart from the parameters at the leaf level (level H) which could be freely chosen, we must choose parameters of other levels recursively so that the following conditions hold:

α(h)>2,ξ(h)η(h)(1-η(h))α(h)<ξ(h)(1-η(h)),
ξ(h)=α(h+1)-1 and η(h)=α(h+1)ξ(h+1)(1-η(h+1)).

It is not hard to see that the conditions in Theorem 1 guarantee the above. There might be other sequences of parameters satisfying the requirements, but our particular choice gives cleaner analysis as presented in this paper.

We now move to the second part of the proof. The value function iteration improves the estimation of optimal value function by iterating Bellman equation. In effect, the MCTS tree is “unrolling” H steps of such an iteration. Precisely, let V(h)() denote the value function after h iterations starting with V(0)=V^. By definition, for any h0 and s𝒮,

V(h+1)(s) =maxa[K](𝔼[(s,a)]+γV(h)(sa)). (39)

Recall that value iteration is contractive with respect to norm (cf. Bertsekas (2017)). That is, for any h0,

V(h+1)-V* γV(h)-V*. (40)

As remarked earlier, μ*(h-1)(si,h-1), the mean reward collected at node si,h-1 for i[nh-1] for any h1, is precisely V(H-h+1)(si,h-1) starting with V(0)=V^, the input to MCTS policy. Therefore, the mean reward collected at root node s(0) of the MCTS tree satisfies μ*(0)(s(0))=V(H)(s(0)). Using (40), we obtain the following Lemma.

Lemma 7

The mean reward collected under the MCTS policy at root note s(0), μ*(0)(s(0)), starting with input value function proxy V^ is such that

|μ*(0)(s(0))-V*(s(0))| γHV^-V*. (41)

In summary, using Lemma 6, we conclude that the recursive relationship going from level h to h-1 holds for all h1 with level 0 being the root. At root s(0), the query state that is input to the MCTS policy, we have that after n total simulations of MCTS, the empirical average of the rewards over these n trial, 1nv~(0)(s0)n is such that (using the fact that α(0)=ξ(0)(1-η(0))η(0))

|𝔼[1nv~(0)(s0)n]-μ*(0)| =O(nα(0)ξ(0)(1-η(0))-1)=O(nη-1), (42)

where μ*(0) is the value function estimation for s(0) after H iterations of value function iteration starting with V^. By Lemma 7, we have

|μ*(0)-V*(s(0))| γHε0, (43)

since ε0=V^-V*. Combining (42) and (43),

|𝔼[1nv~(0)(s0)n]-V*(s(0))| γHε0+O(nη-1). (44)

This concludes the proof of Theorem 1.

The convergence property, limn𝔼[Z¯n]=μX+ρμY, follows simply by linearity of expectation. For concentration, consider the following: since Xs are i.i.d. bounded random variables taking value in [-B,B], by Hoeffding’s inequality (Hoeffding 1963), we have that for t0,

(nX¯n-nμXnt)exp(-t2n2B2), (45)
(nX¯n-nμX-nt)exp(-t2n2B2).

Therefore,

(nZ¯n-n(μX+ρμY)nηz) (nX¯n-nμXnηz2)+(nY¯n-nμYnηz2ρ)
exp(-z2n2η-18B2)+β2ξρξzξ
βzξ, (46)

where β is a large enough constant depending upon ρ,ξ,β and B. The other-side of the inequality follows similarly. This completes the proof.

First, we establish a useful property of nearest neighbor supervised learning presented in Section id1. This is stated in Section id1. We will use it, along with the guarantees obtained for MCTS in Theorem 1 to establish Theorem 2 in Section id1. Throughout, we shall assume the setup of Theorem 2.

Let δ(0,1) be given. As stated in Section id1, let K(δ,d)=Θ(δ-d) be the collection of balls of radius δ, say ci,i[K(δ,d)], so that they cover 𝒮, i.e. 𝒮i[K(ε,d)]ci. Also, by construction, each of these balls have intersection with 𝒮 whose volume is at least Cdδd. Let S={si:i[N]} denote N state samples from 𝒮 uniformly at random and independent of each other. For each state s𝒮, let V:𝒮[-Vmax,Vmax] be such that |𝔼[V(s)]-V*(s)|Δ. Let the nearest neighbor supervised learning described in Section id1 produce estimate V^:𝒮 using labeled data points (si,V(si))i[N]. Then, we claim the following guarantee. Proof can be found in Section id1.

Lemma 8

Under the above described setup, as long as N32max(1,δ-2Vmax2)Cd-1δ-dlogK(δ,d)δ, i.e., N=Ω(dδ-d-2logδ-1),

𝔼[sups𝒮|V^(s)-V*(s)|] Δ+(C+1)δ+4Vmaxδ2K(δ,d). (47)

Using Theorem 1 and Lemma 8, we complete the proof of Theorem 2 under appropriate choice of algorithmic parameters. We start by setting some notation.

To that end, the algorithm as described in Section id1 iterates between MCTS and supervised learning. In particular, let 1 denote the iteration index. Let m be the number of states that are sampled uniformly at random, independently, over 𝒮 in this iteration, denoted as S()={si():i[m]}. Let V(-1) be the input of value function from prior iteration, using which the MCTS algorithm with n simulations obtains improved estimates of value function for states in S() denoted as V^()(si()),i[m]. Using (si(),V^()(si()))i[m], the nearest neighbor supervised learning as described above with balls of appropriate radius δ(0,1) produces estimate V() for all states in 𝒮. Let () denote the smallest σ-algebra containing all information pertaining to the algorithm (both MCTS and supervised learning). Define the error under MCTS in iteration as

εmcts() =𝔼[sups𝒮|𝔼[V^()(s)|(-1)]-V*(s)|]. (48)

And, the error for supervised learning in iteration as

θsl() =sups𝒮|V()(s)-V*(s)|,andεsl()=𝔼[θsl()]. (49)

Recall that in the beginning, we set V(0)(s)=0 for all s𝒮. Since V*()[-Vmax,Vmax], we have that εsl(0)Vmax. Further, it is easy to see that if the leaf estimates (i.e., the output of the supervised learning from the previous iteration) is bounded in [-Vmax,Vmax], then the output of the MCTS algorithm is always bounded in [-Vmax,Vmax]. That is, since V(0)(s)=0 and the nearest neighbor supervised learning produces estimate V(l) via simple averaging, inductively, the output of the MCTS algorithm is always bounded in [-Vmax,Vmax] throughout every iteration.

With the notation as set up above, it follows that for a given δ(0,1) with m satisfying condition of Lemma 8, i.e. m=Ω(dδ-d-2logδ-1), and with the nearest neighbor supervised learning using δ radius balls for estimation, we have the following recursion:

εsl() εmcts()+(C+1)δ+4Vmaxδ2K(δ,d)εmcts()+Cδ, (50)

where C is a large enough constant, since δ2K(δ,d)=Θ(dδd+2) which is O(δ) for all δ(0,1). By Theorem 1, for iteration +1 that uses the output of supervised learning estimate, V(), as the input to the MCTS algorithm, we obtain

|𝔼[V^(+1)(s)|()]-V*(s)|γH(+1)𝔼[θsl()|()]+O(n+1η-1),s𝒮, (51)

where η[1/2,1) is the constant utilized by MCTS with fixed height of tree being H(+1). This then implies that

εmcts(+1) =𝔼[sups𝒮|𝔼[V^(+1)(s)|()]-V*(s)|]
γH(+1)𝔼[𝔼[θsl()|()]]+O(n+1η-1)
γH(+1)(εmcts()+Cδ)+O(n+1η-1). (52)

Denote by λ(εVmax)1/L. Note that since the final desired error ε should be less than Vmax (otherwise, the problem is trivial by just outputing 0 as the final estimates for all the states), we have λ<1. Let us set the algorithmic parameters for MCTS and nearest neighbor supervised learning as follows: for each 1,

H()=logγλ8,δ=3Vmax4Cλ,n=κl(8Vmaxλ)11-η, (53)

where κl>0 is a sufficiently large constant such that O(nη-1)=Vmax8λ. Substituting these values into Eq. (52) yields

εmcts(+1)=𝔼[sups𝒮|𝔼[V^(+1)(s)|()]-V*(s)|]λ8εmcts()+7Vmax32λ+1.

Note that by (51) and (53), and the fact that εsl(0)Vmax, we have

εmcts(1)λ8εsl(0)+λ8Vmaxλ4Vmax.

It then follows inductively that

εmcts() λ-1εmcts(1)=Vmax4λ.

As for the supervised learning oracle, s𝒮, Eq. (50) implies

𝔼[sups𝒮|V()(s)-V*(s)|]εmcts()+3Vmax4λVmaxλ.

This implies that

𝔼[sups𝒮|V(L)(s)-V*(s)|]VmaxλL=ε.

We now calculate the sample complexity, i.e., the total number of state transitions required for the algorithm. During the -th iteration, each query of MCTS oracle requires n simulations. Recall that the number of querying MCTS oracle, i.e., the size of training set 𝒮() for the nearest neighbor supervised step, should satisfy m=Ω(dδ-d-2logδ-1) (cf. Lemma 8). From Eq. (53), we have

H()=c0logλ-1,δ()=c1λ,andn=c2λ-/(1-η),

where c0,c1,c2, are constants independent of λ and . Note that each simulation of MCTS samples H() state transitions. Hence, the number of state transitions at the -th iteration is given by

M() =mnH().

Therefore, the total number of state transitions after L iterations is

l=1LM() ==1LmnH()=O(ε-(2+1/(1-η)+d)(log1ε)5).

That is, for optimal choice of η=1/2, the total number of state transitions is O(ε-(4+d)(log1ε)5).

Given N samples si,i[N] that are sampled independently and uniformly at random over 𝒮, and given the fact that each ball ci,i[K(δ,d)] has at least Cdδd volume shared with 𝒮, each of the sample falls within a given ball with probability at least Cdδd. Let Ni,i[K(δ,d)] denote the number of samples amongst N samples in ball ci.

Now the number of samples falling in any given ball is lower bounded by a Binomial random variable with parameter N,Cdδd. By Chernoff bound for Binomial variable with parameter n,p, we have that

(B(n,p)np/2) exp(-np8).

Therefore, with an application of union bound, each ball has at least 0.5CdδdN samples with probability at least 1-K(δ,d)exp(-CdδdN/8). That is, for N=32max(1,δ-2Vmax2)Cd-1δ-d[log(K(δ,d)+logδ-1], each ball has at least Γ=16max(1,δ-2Vmax2)(logK(δ,d)+logδ-1) samples with probability at least 1-δ2K(δ,d). Define event

1 ={Ni16max(1,δ-2Vmax2)(logK(δ,d)+logδ-1),i[K(δ,d)]}.

Then

(1c)δ2K(δ,d).

Now, for any s𝒮, the nearest neighbor supervised learning described in Section id1 produces estimate V^(s) equal to the average value of observations for samples falling in ball cj(s). Let Nj(s) denote the number of samples in ball cj(s). To that end,

|V^(s)-V*(s)| =|1Nj(s)(i:sicj(s)V(si)-V*(s))|
=|1Nj(s)(i:sicj(s)V(si)-𝔼[V(si)])|+|1Nj(s)(i:sicj(s)𝔼[V(si)]-V*(si))|
  +|1Nj(s)(i:sicj(s)V*(si)-V*(s))|.

For the first term, since for each sicj(s), V(si) is produced using independent randomness via MCTS, and since the output V(si) is a bounded random variable, using Hoeffding’s inequality, it follows that

(|1Nj(s)(i:sicj(s)V(si)-𝔼[V(si)])|Δ1) 2exp(-Nj(s)Δ128Vmax2).

The second term is no more than Δ due to the guarantee given by MCTS as assumed in the setup. And finally, the third term is no more than Cδ due to Lipschitzness of V*. To summarize, with probability at least 1-2exp(-Nj(s)Δ128Vmax2), we have that

|V^(s)-V*(s)| Δ1+Δ+Cδ.

As can be noticed, the algorithm produces the same estimate for all s𝒮 such that they map to the same ball. And there are K(δ,d) such balls. Therefore, using union bound, it follows that with probability at least 1-2K(δ,d)exp(-(mini[K(δ,d)]Ni)Δ128Vmax2),

sups𝒮|V^(s)-V*(s)| Δ1+Δ+Cδ.

Under event 1, mini[K(δ,d)]Ni16max(1,δ-2Vmax2)(logK(δ,d)+logδ-1). Therefore, under event 1, by choosing Δ1=δ, we have

sups𝒮|V^(s)-V*(s)| Δ+(C+1)δ,

with probability at least 1-2δ2K(δ,d). When event 1 does not hold or the above does not hold, we have trivial error bound of 2Vmax on the error. Therefore, we conclude that

𝔼[sups𝒮|V^(s)-V*(s)|] Δ+(C+1)δ+4Vmaxδ2K(δ,d).

In this paper, we introduce a correction of the popular Monte Carlo Tree Search (MCTS) policy for improved value function estimation for a given state, using an existing value function estimation for the entire state space. This correction was obtained through careful, rigorous analysis of a non-stationary Multi-Arm Bandit where rewards are dependent and non-stationary. In particular, we analyzed a variant of the classical Upper Confidence Bound policy for such an MAB. Using this as a building block, we establish rigorous performance guarantees for the corrected version of MCTS proposed in this work. This, to the best of our knowledge, is the first mathematically correct analysis of the UCT policy despite its popularity since it has been proposed in literature (Kocsis and Szepesvári 2006, Kocsis et al. 2006). We further establish that the proposed MCTS policy, when combined with nearest neighbor supervised learning, leads to near optimal sample complexity for obtaining estimation of value function within a given tolerance, where the optimality is in the minimax sense. This suggests the tightness of our analysis as well as the utility of the MCTS policy.

We take a note that much of this work was inspired by the success of AlphaGo Zero (AGZ) which utilizes MCTS combined with supervised learning. Interestingly enough, the correction of MCTS suggested by our analysis is qualitatively similar to the version of MCTS utilized by AGZ as reported in practice. This seeming coincidence may suggest further avenue for practical utility of versions of the MCTS proposed in this work and is an interesting direction for future work.

References

  • Agrawal (1995) Agrawal R (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27(4):1054–1078.
  • Audibert et al. (2009) Audibert JY, Munos R, Szepesvári C (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19):1876–1902.
  • Auer et al. (2002) Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3):235–256.
  • Azizzadenesheli et al. (2018) Azizzadenesheli K, Yang B, Liu W, Brunskill E, Lipton ZC, Anandkumar A (2018) Sample-efficient deep RL with generative adversarial tree search. CoRR abs/1806.05780.
  • Bartlett et al. (2019) Bartlett P, Gabillon V, Healey J, Valko M (2019) Scale-free adaptive planning for deterministic dynamics & discounted rewards. International Conference on Machine Learning, 495–504.
  • Bertsekas (1975) Bertsekas D (1975) Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control 20(3):415–419.
  • Bertsekas (2017) Bertsekas D (2017) Dynamic Programming and Optimal Control (Athena Scientific).
  • Browne et al. (2012) Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4(1):1–43.
  • Chang et al. (2005) Chang HS, Fu MC, Hu J, Marcus SI (2005) An adaptive sampling algorithm for solving markov decision processes. Operations Research 53(1):126–139.
  • Coquelin and Munos (2007) Coquelin PA, Munos R (2007) Bandit algorithms for tree search. arXiv preprint cs/0703062 .
  • Coulom (2006) Coulom R (2006) Efficient selectivity and backup operators in monte-carlo tree search. International conference on computers and games, 72–83 (Springer).
  • Dufour and Prieto-Rumeau (2012) Dufour F, Prieto-Rumeau T (2012) Approximation of Markov decision processes with general state space. Journal of Mathematical Analysis and applications 388(2):1254–1267.
  • Dufour and Prieto-Rumeau (2013) Dufour F, Prieto-Rumeau T (2013) Finite linear programming approximations of constrained discounted Markov decision processes. SIAM Journal on Control and Optimization 51(2):1298–1324.
  • Efroni et al. (2018) Efroni Y, Dalal G, Scherrer B, Mannor S (2018) Multiple-step greedy policies in approximate and online reinforcement learning. Advances in Neural Information Processing Systems, 5244–5253.
  • Even-Dar and Mansour (2004) Even-Dar E, Mansour Y (2004) Learning rates for Q-learning. JMLR 5.
  • Guo et al. (2014) Guo X, Singh S, Lee H, Lewis RL, Wang X (2014) Deep learning for real-time atari game play using offline monte-carlo tree search planning. Advances in neural information processing systems, 3338–3346.
  • Hoeffding (1963) Hoeffding W (1963) Probability inequalities for sums of bounded random variables. Journal of Americal Statistics Association 58.
  • Hren and Munos (2008) Hren JF, Munos R (2008) Optimistic planning of deterministic systems. European Workshop on Reinforcement Learning, 151–164 (Springer).
  • Jiang et al. (2018) Jiang DR, Ekwedike E, Liu H (2018) Feedback-based tree search for reinforcement learning. International conference on machine learning.
  • Kaufmann and Koolen (2017) Kaufmann E, Koolen WM (2017) Monte-carlo tree search by best arm identification. Advances in Neural Information Processing Systems, 4897–4906.
  • Kearns et al. (2002) Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning 49(2-3):193–208.
  • Kocsis and Szepesvári (2006) Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. European conference on machine learning, 282–293 (Springer).
  • Kocsis et al. (2006) Kocsis L, Szepesvári C, Willemson J (2006) Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep .
  • Mnih et al. (2016) Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. International conference on machine learning, 1928–1937.
  • Mnih et al. (2013) Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .
  • Mnih et al. (2015) Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529.
  • Munos et al. (2014) Munos R, et al. (2014) From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning 7(1):1–129.
  • Salomon and Audibert (2011) Salomon A, Audibert JY (2011) Deviations of stochastic bandit regret. International Conference on Algorithmic Learning Theory, 159–173 (Springer).
  • Schadd et al. (2008) Schadd MPD, Winands MHM, van den Herik HJ, Chaslot GMJB, Uiterwijk JWHM (2008) Single-player monte-carlo tree search. van den Herik HJ, Xu X, Ma Z, Winands MHM, eds., Computers and Games, 1–12 (Berlin, Heidelberg: Springer Berlin Heidelberg).
  • Schulman et al. (2015) Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. International Conference on Machine Learning, 1889–1897.
  • Schulman et al. (2017) Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
  • Shah and Xie (2018) Shah D, Xie Q (2018) Q-learning with nearest neighbors. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds., Advances in Neural Information Processing Systems 31, 3115–3125 (Curran Associates, Inc.).
  • Silver et al. (2016) Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529(7587):484–489.
  • Silver et al. (2017a) Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, et al. (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 .
  • Silver et al. (2017b) Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, et al. (2017b) Mastering the game of go without human knowledge. Nature 550(7676):354.
  • Stone (1982) Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. The Annals of Statistics 1040–1053.
  • Strehl et al. (2006) Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) Pac model-free reinforcement learning. Proceedings of the 23rd international conference on Machine learning, 881–888 (ACM).
  • Sturtevant (2008) Sturtevant NR (2008) An analysis of uct in multi-player games. van den Herik HJ, Xu X, Ma Z, Winands MHM, eds., Computers and Games, 37–49 (Berlin, Heidelberg: Springer Berlin Heidelberg).
  • Sutton (1988) Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine learning 3(1):9–44.
  • Szepesvári (2019) Szepesvári C (2019) Personal communication.
  • Teraoka et al. (2014) Teraoka K, Hatano K, Takimoto E (2014) Efficient sampling method for monte carlo tree search problem. IEICE TRANSACTIONS on Information and Systems 97(3):392–398.
  • Tsybakov (2009) Tsybakov AB (2009) Introduction to Nonparametric Estimation. Springer Series in Statistics (Springer).
  • Van Hasselt et al. (2016) Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. AAAI, volume 2, 5 (Phoenix, AZ).
  • Watkins and Dayan (1992) Watkins CJ, Dayan P (1992) Q-learning. Machine learning 8(3-4):279–292.
  • Yang et al. (2019) Yang Y, Zhang G, Xu Z, Katabi D (2019) Harnessing structures for value-based planning and reinforcement learning. arXiv preprint arXiv:1909.12255 .

The recent work Shah and Xie (2018) establishes a lower bound on the sample complexity for reinforcement learning algorithms on MDPs. We follow a similar argument to establish a lower bound on the sample complexity for MDPs with deterministic transitions. We provide the proof for completeness. The key idea is to connect the problem of estimating the value function to the problem of non-parametric regression, and then leveraging known minimax lower bound for the latter. In particular, we show that a class of non-parametric regression problem can be embedded in an MDP with deterministic transitions, so any algorithm for the latter can be used to solve the former. Prior work on non-parametric regression (Tsybakov 2009, Stone 1982) establishes that a certain number of observations is necessary to achieve a given accuracy using any algorithms, hence leading to a corresponding necessary condition for the sample size of estimating the value function in an MDP problem. We now provide the details.

Step 1. Non-parametric regression. Consider the following non-parametric regression problem: Let 𝒮:=[0,1]d and assume that we have T data pairs (x1,y1),,(xT,yT) such that conditioned on x1,,xn, the random variables y1,,yn are independent and satisfy

𝔼[yt|xt]=f(xt),xt𝒮 (54)

where f:𝒮 is the unknown regression function. Suppose that the conditional distribution of yt given xt=x is a Bernoulli distribution with mean f(x). We also assume that f is 1-Lipschitz continuous with respect to the Euclidean norm, i.e.,

|f(x)-f(x0)||x-x0|,x,x0𝒮.

Let be the collection of all 1-Lipschitz continuous function on 𝒳, i.e.,

={h|h is a 1-Lipschitz function on 𝒮},

The goal is to estimate f given the observations (x1,y1),,(xT,yT) and the prior knowledge that f.

It is easy to verify that the above problem is a special case of the non-parametric regression problem considered in the work by Stone (1982) (in particular, Example 2 therein). Let f^T denote an arbitrary (measurable) estimator of f based on the training samples (x1,y1),,(xT,yT). By Theorem 1 in Stone (1982), we have the following result: there exists a c>0 such that

limTinff^Tsupf(f^T-fc(logTT)12+d)=1, (55)

where infimum is over all possible estimators f^T. Translating this result to the non-asymptotic regime, we obtain the following theorem.

Theorem 4

Under the above stated assumptions, for any number δ(0,1), there exits c>0 and Tδ such that

inff^Tsupf(f^T-fc(logTT)12+d)δ,for all TTδ.

Step 2. MDP with deterministic transitions. Consider a class of discrete-time discounted MDPs (𝒮,𝒜,𝒫,r,γ), where

𝒮=[0,1]d,
𝒜 is finite,
for each(x,a),there exists a unique x𝒮such that𝒫(x|x,a)=1,
r(x,a)=r(x) for all a,
γ=0.

In words, the transition is deterministic, the expected reward is independent of the action taken and the current state, and only immediate reward matters.

Let Rt be the observed reward at step t. We assume that given xt, the random variable Rt is independent of (x1,,xt-1), and follows a Bernoulli distribution Bernoulli(r(xt)). The expected reward function r() is assumed to be 1-Lipschitz and bounded. It is easy to see that for all x𝒮, a𝒜,

V*(x) =r(x). (56)

Step 3. Reduction from regression to MDP. Given a non-parametric regression problem as described in Step 1, we may reduce it to the problem of estimating the value function V* of the MDP described in Step 2. To do this, we set

r(x) =f(x),x𝒮

and

Rt =yt,t=1,2,,T.

In this case, it follows from equations (56) that the value function is given by V*=f. Moreover, the expected reward function r() is 1-Lipschitz, so the assumptions of the MDP in Step 2 are satisfied. This reduction shows that the MDP problem is at least as hard as the nonparametric regression problem, so a lower bound for the latter is also a lower bound for the former.

Applying Theorem 4 yields the following result: for any number δ(0,1), there exist some numbers c>0 and Tδ>0, such that

infV^TsupV*[V^T-V*c(logTT)12+d]δ,for all TTδ.

Consequently, for any reinforcement learning algorithm V^T and any sufficiently small ε>0, there exists an MDP problem with deterministic transitions such that in order to achieve

[V^T-V*<ε]1-δ,

one must have

TCd(1ε)2+dlog(1ε),

where C>0 is a constant. The statement of Proposition 1 follows by selecting δ=12.