Two-Sample Testing in Reinforcement Learning

Abstract

Value-based reinforcement-learning algorithms have shown strong performancesin games, robotics, and other real-world applications. The most popularsample-based method is $Q$-Learning. A $Q$-value is the expected return for astate-action pair when following a particular policy, and the algorithmsubsequently performs updates by adjusting the current $Q$-value towards theobserved reward and the maximum of the $Q$-values of the next state. Theprocedure introduces maximization bias, and solutions like Double $Q$-Learninghave been considered. We frame the bias problem statistically and consider itan instance of estimating the maximum expected value (MEV) of a set of randomvariables. We propose the $T$-Estimator (TE) based on two-sample testing forthe mean. The TE flexibly interpolates between over- and underestimation byadjusting the level of significance of the underlying hypothesis tests. Ageneralization termed $K$-Estimator (KE) obeys the same bias and variancebounds as the TE while relying on a nearly arbitrary kernel function. Using theTE and the KE, we introduce modifications of $Q$-Learning and its neuralnetwork analog, the Deep $Q$-Network. The proposed estimators and algorithmsare thoroughly tested and validated on a diverse set of tasks and environments,illustrating the performance potential of the TE and KE.

Quick Read (beta)

loading the full paper ...