### Abstract

A simple and natural algorithm for reinforcement learning is Monte CarloExploring States (MCES), where the Q-function is estimated by averaging theMonte Carlo returns, and the policy is improved by choosing actions thatmaximize the current estimate of the Q-function. Exploration is performed by"exploring starts", that is, each episode begins with a randomly chosen stateand action and then follows the current policy. Establishing convergence forthis algorithm has been an open problem for more than 20 years. We make headwaywith this problem by proving convergence for Optimal Policy Feed-Forward MDPs,which are MDPs whose states are not revisited within any episode for an optimalpolicy. Such MDPs include all deterministic environments (including CliffWalking and other gridworld examples) and a large class of stochasticenvironments (including Blackjack). The convergence results presented here makeprogress for this long-standing open problem in reinforcement learning.