Munchausen Reinforcement Learning

Abstract

Bootstrapping is a core mechanism in Reinforcement Learning (RL). Mostalgorithms, based on temporal differences, replace the true value of atransiting state by their current estimate of this value. Yet, another estimatecould be leveraged to bootstrap RL: the current policy. Our core contributionstands in a very simple idea: adding the scaled log-policy to the immediatereward. We show that slightly modifying Deep Q-Network (DQN) in that wayprovides an agent that is competitive with distributional methods on Atarigames, without making use of distributional RL, n-step returns or prioritizedreplay. To demonstrate the versatility of this idea, we also use it togetherwith an Implicit Quantile Network (IQN). The resulting agent outperformsRainbow on Atari, installing a new State of the Art with very littlemodifications to the original algorithm. To add to this empirical study, weprovide strong theoretical insights on what happens under the hood -- implicitKullback-Leibler regularization and increase of the action-gap.

Quick Read (beta)

loading the full paper ...