### Abstract

Many model-based reinforcement learning (RL) algorithms can be viewed ashaving two phases that are iteratively implemented: a learning phase where themodel is approximately learned and a planning phase where the learned model isused to derive a policy. In the case of standard MDPs, the learning problem canbe solved using either value iteration or policy iteration. However, in thecase of zero-sum Markov games, there is no efficient policy iterationalgorithm; e.g., it has been shown in Hansen et al. (2013) that one has tosolve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implementthe only known convergent version of policy iteration. Another algorithm forMarkov zero-sum games, called naive policy iteration, is easy to implement butis only provably convergent under very restrictive assumptions. Prior attemptsto fix naive policy iteration algorithm have several limitations. Here, we showthat a simple variant of naive policy iteration for games converges, andconverges exponentially fast. The only addition we propose to naive policyiteration is the use of lookahead in the policy improvement phase. This isappealing because lookahead is anyway often used in RL for games. We furthershow that lookahead can be implemented efficiently in linear Markov games,which are the counterpart of the linear MDPs and have been the subject of muchattention recently. We then consider multi-agent reinforcement learning whichuses our algorithm in the planning phases, and provide sample and timecomplexity bounds for such an algorithm.