Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Abstract

Strong worst-case performance bounds for episodic reinforcement learningexist but fortunately in practice RL algorithms perform much better than suchbounds would predict. Algorithms and theory that provide strongproblem-dependent bounds could help illuminate the key features of what makes aRL problem hard and reduce the barrier to using RL algorithms in practice. As astep towards this we derive an algorithm for finite horizon discrete MDPs andassociated analysis that both yields state-of-the art worst-case regret boundsin the dominant terms and yields substantially tighter bounds if the RLenvironment has small environmental norm, which is a function of the varianceof the next-state value functions. An important benefit of our algorithmic isthat it does not require apriori knowledge of a bound on the environmentalnorm. As a result of our analysis, we also help address an open learning theoryquestion~\cite{jiang2018open} about episodic MDPs with a constant upper-boundon the sum of rewards, providing a regret bound with no $H$-dependence in theleading term that scales a polynomial function of the number of episodes.

Quick Read (beta)

loading the full paper ...