Abstract
We present an optimistic Q-learning algorithm for regret minimization inaverage reward reinforcement learning under an additional assumption on theunderlying MDP that for all policies, the time to visit some frequent state$s_0$ is finite and upper bounded by $H$, either in expectation or withconstant probability. Our setting strictly generalizes the episodic setting andis significantly less restrictive than the assumption of bounded hitting time\textit{for all states} made by most previous literature on model-freealgorithms in average reward settings. We demonstrate a regret bound of$\tilde{O}(H^5 S\sqrt{AT})$, where $S$ and $A$ are the numbers of states andactions, and $T$ is the horizon. A key technical novelty of our work is theintroduction of an $\overline{L}$ operator defined as $\overline{L} v =\frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Underthe given assumption, we show that the $\overline{L}$ operator has a strictcontraction (in span) even in the average-reward setting where the discountfactor is $1$. Our algorithm design uses ideas from episodic Q-learning toestimate and apply this operator iteratively. Thus, we provide a unified viewof regret minimization in episodic and non-episodic settings, which may be ofindependent interest.