Abstract
We present an optimistic Q-learning algorithm for regret minimization inaverage reward reinforcement learning under an additional assumption on theunderlying MDP that for all policies, the expected time to visit some frequentstate $s_0$ is finite and upper bounded by $H$. Our setting strictlygeneralizes the episodic setting and is significantly less restrictive than theassumption of bounded hitting time {\it for all states} made by most previousliterature on model-free algorithms in average reward settings. We demonstratea regret bound of $\tilde{O}(H^5 S\sqrt{AT})$, where $S$ and $A$ are thenumbers of states and actions, and $T$ is the horizon. A key technical noveltyof our work is to introduce an $\overline{L}$ operator defined as $\overline{L}v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Weshow that under the given assumption, the $\overline{L}$ operator has a strictcontraction (in span) even in the average reward setting. Our algorithm designthen uses ideas from episodic Q-learning to estimate and apply this operatoriteratively. Therefore, we provide a unified view of regret minimization inepisodic and non-episodic settings that may be of independent interest.