A Cramér Distance perspective on Non-crossing Quantile Regression in Distributional Reinforcement Learning

Abstract

Distributional reinforcement learning (DRL) extends the value-based approachby using a deep convolutional network to approximate the full distribution overfuture returns instead of the mean only, providing a richer signal that leadsto improved performances. Quantile-based methods like QR-DQN project arbitrarydistributions onto a parametric subset of staircase distributions by minimizingthe 1-Wasserstein distance, however, due to biases in the gradients, thequantile regression loss is used instead for training, guaranteeing the sameminimizer and enjoying unbiased gradients. Recently, monotonicity constraintson the quantiles have been shown to improve the performance of QR-DQN foruncertainty-based exploration strategies. The contribution of this work is inthe setting of fixed quantile levels and is twofold. First, we prove that theCram\'er distance yields a projection that coincides with the 1-Wasserstein oneand that, under monotonicity constraints, the squared Cram\'er and the quantileregression losses yield collinear gradients, shedding light on the connectionbetween these important elements of DRL. Second, we propose a novelnon-crossing neural architecture that allows a good training performance usinga novel algorithm to compute the Cram\'er distance, yielding significantimprovements over QR-DQN in a number of games of the standard Atari 2600benchmark.

Quick Read (beta)

loading the full paper ...