Abstract
Model-based algorithms---algorithms that decouple learning of the model andplanning given the model---are widely used in reinforcement learning practiceand theoretically shown to achieve optimal sample efficiency for single-agentreinforcement learning in Markov Decision Processes (MDPs). However, formulti-agent reinforcement learning in Markov games, the current best knownsample complexity for model-based algorithms is rather suboptimal and comparesunfavorably against recent model-free approaches. In this paper, we present asharp analysis of model-based self-play algorithms for multi-agent Markovgames. We design an algorithm \emph{Optimistic Nash Value Iteration} (Nash-VI)for two-player zero-sum Markov games that is able to output an$\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$episodes of game playing, where $S$ is the number of states, $A,B$ are thenumber of actions for the two players respectively, and $H$ is the horizonlength. This is the first algorithm that matches the information-theoreticlower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor,and compares favorably against the best known model-free algorithm if$\min\{A,B\}=o(H^3)$. In addition, our Nash-VI outputs a single Markov policywith optimality guarantee, while existing sample-efficient model-freealgorithms output a nested mixture of Markov policies that is in generalnon-Markov and rather inconvenient to store and execute. We further adapt ouranalysis to designing a provably efficient task-agnostic algorithm for zero-sumMarkov games, and designing the first line of provably sample-efficientalgorithms for multi-player general-sum Markov games.