On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

Abstract

A simple and natural algorithm for reinforcement learning (RL) is Monte CarloExploring Starts (MCES), where the Q-function is estimated by averaging theMonte Carlo returns, and the policy is improved by choosing actions thatmaximize the current estimate of the Q-function. Exploration is performed by"exploring starts", that is, each episode begins with a randomly chosen stateand action, and then follows the current policy to the terminal state. In theclassic book on RL by Sutton & Barto (2018), it is stated that establishingconvergence for the MCES algorithm is one of the most important remaining opentheoretical problems in RL. However, the convergence question for MCES turnsout to be quite nuanced. Bertsekas & Tsitsiklis (1996) provide acounter-example showing that the MCES algorithm does not necessarily converge.Tsitsiklis (2002) further shows that if the original MCES algorithm is modifiedso that the Q-function estimates are updated at the same rate for allstate-action pairs, and the discount factor is strictly less than one, then theMCES algorithm converges. In this paper we make headway with the original andmore efficient MCES algorithm given in Sutton & Barto (1998), establishingalmost sure convergence for Optimal Policy Feed-Forward MDPs, which are MDPswhose states are not revisited within any episode when using an optimal policy.Such MDPs include a large class of environments such as all deterministicenvironments and all episodic environments with a timestep or any monotonicallychanging values as part of the state. Different from the previous proofs usingstochastic approximations, we introduce a novel inductive approach, which isvery simple and only makes use of the strong law of large numbers.

Quick Read (beta)

loading the full paper ...