Abstract
The theory of reinforcement learning has focused on two fundamental problems:achieving low regret, and identifying $\epsilon$-optimal policies. While asimple reduction allows one to apply a low-regret algorithm to obtain an$\epsilon$-optimal policy and achieve the worst-case optimal rate, it isunknown whether low-regret algorithms can obtain the instance-optimal rate forpolicy identification. We show this is not possible -- there exists afundamental tradeoff between achieving low regret and identifying an$\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure ofinstance-dependent sample complexity for PAC tabular reinforcement learningwhich explicitly accounts for the attainable state visitation distributions inthe underlying MDP. We then propose and analyze a novel, planning-basedalgorithm which attains this sample complexity -- yielding a complexity whichscales with the suboptimality gaps and the "reachability" of a state. We showour algorithm is nearly minimax optimal, and on several examples that ourinstance-dependent sample complexity offers significant improvements overworst-case bounds.