A Tale of Sampling and Estimation in Discounted Reinforcement Learning

Abstract

The most relevant problems in discounted reinforcement learning involveestimating the mean of a function under the stationary distribution of a Markovreward process, such as the expected return in policy evaluation, or the policygradient in policy optimization. In practice, these estimates are producedthrough a finite-horizon episodic sampling, which neglects the mixingproperties of the Markov process. It is mostly unclear how this mismatchbetween the practical and the ideal setting affects the estimation, and theliterature lacks a formal study on the pitfalls of episodic sampling, and howto do it optimally. In this paper, we present a minimax lower bound on thediscounted mean estimation problem that explicitly connects the estimationerror with the mixing properties of the Markov process and the discount factor.Then, we provide a statistical analysis on a set of notable estimators and thecorresponding sampling procedures, which includes the finite-horizon estimatorsoften used in practice. Crucially, we show that estimating the mean by directlysampling from the discounted kernel of the Markov process brings compellingstatistical properties w.r.t. the alternative estimators, as it matches thelower bound without requiring a careful tuning of the episode horizon.

Quick Read (beta)

loading the full paper ...