Abstract
We investigate the problem of best-policy identification in discounted MarkovDecision Processes (MDPs) when the learner has access to a generative model.The objective is to devise a learning algorithm returning the best policy asearly as possible. We first derive a problem-specific lower bound of the samplecomplexity satisfied by any learning algorithm. This lower bound corresponds toan optimal sample allocation that solves a non-convex program, and hence, ishard to exploit in the design of efficient algorithms. We then provide a simpleand tight upper bound of the sample complexity lower bound, whose correspondingnearly-optimal sample allocation becomes explicit. The upper bound depends onspecific functionals of the MDP such as the sub-optimality gaps and thevariance of the next-state value function, and thus really captures thehardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), analgorithm tracking this nearly-optimal allocation, and provide asymptoticguarantees for its sample complexity (both almost surely and in expectation).The advantages of KLB-TS against state-of-the-art algorithms are discussed andillustrated numerically.