Abstract
One of the challenges in online reinforcement learning (RL) is that the agentneeds to trade off the exploration of the environment and the exploitation ofthe samples to optimize its behavior. Whether we optimize for regret, samplecomplexity, state-space coverage or model estimation, we need to strike adifferent exploration-exploitation trade-off. In this paper, we propose totackle the exploration-exploitation problem following a decoupled approachcomposed of: 1) An "objective-specific" algorithm that (adaptively) prescribeshow many samples to collect at which states, as if it has access to agenerative model (i.e., a simulator of the environment); 2) An"objective-agnostic" sample collection exploration strategy responsible forgenerating the prescribed samples as fast as possible. Building on recentmethods for exploration in the stochastic shortest path problem, we firstprovide an algorithm that, given as input the number of samples $b(s,a)$ neededin each state-action pair, requires $\tilde{O}(B D + D^{3/2} S^2 A)$ time stepsto collect the $B=\sum_{s,a} b(s,a)$ desired samples, in any unknowncommunicating MDP with $S$ states, $A$ actions and diameter $D$. Then we showhow this general-purpose exploration algorithm can be paired with"objective-specific" strategies that prescribe the sample requirements totackle a variety of settings -- e.g., model estimation, sparse rewarddiscovery, goal-free cost-free exploration in communicating MDPs -- for whichwe obtain improved or novel sample complexity guarantees.