How to Combine Tree-Search Methods in Reinforcement Learning

  • 2018-09-06 06:40:08
  • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
  • 13


Finite-horizon lookahead policies are abundantly used in ReinforcementLearning and demonstrate impressive empirical success. Usually, the lookaheadpolicies are implemented with specific planning methods such as Monte CarloTree Search (e.g. in AlphaZero). Referring to the planning problem as treesearch, a reasonable practice in these implementations is to back up the valueonly at the leaves while the information obtained at the root is not leveragedother than for updating the policy. Here, we question the potency of thisapproach.Namely, the latter procedure is non-contractive in general, and itsconvergence is not guaranteed. Our proposed enhancement is straightforward andsimple: use the return from the optimal tree path to back up the values at thedescendants of the root. This leads to a \gamma^h-contracting procedure, where\gamma is the discount factor and $h$ is the tree depth. To establish ourresults, we first introduce a notion called multiple-step greedy consistency.We then provide convergence rates for two algorithmic instantiations of theabove enhancement in the presence of noise injected to both the tree searchstage and value estimation stage.


Introduction (beta)



Conclusion (beta)