On-line Policy Improvement using Monte-Carlo Search

Abstract

We present a Monte-Carlo simulation algorithm for real-time policyimprovement of an adaptive controller. In the Monte-Carlo simulation, thelong-term expected reward of each possible action is statistically measured,using the initial policy to make decisions in each step of the simulation. Theaction maximizing the measured expected reward is then taken, resulting in animproved policy. Our algorithm is easily parallelizable and has beenimplemented on the IBM SP1 and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algorithm to thedomain of backgammon. Results are reported for a wide variety of initialpolicies, ranging from a random policy to TD-Gammon, an extremely strongmulti-layer neural network. In each case, the Monte-Carlo algorithm gives asubstantial reduction, by as much as a factor of 5 or more, in the error rateof the base players. The algorithm is also potentially useful in many otheradaptive control applications in which it is possible to simulate theenvironment.

Quick Read (beta)

loading the full paper ...