RUDDER: Return Decomposition for Delayed Rewards

Abstract

We propose a novel reinforcement learning approach for finite Markov decisionprocesses (MDPs) with delayed rewards. In this work, biases of temporaldifference (TD) estimates are proved to be corrected only exponentially slowlyin the number of delay steps. Furthermore, variances of Monte Carlo (MC)estimates are proved to increase the variance of other estimates, the number ofwhich can exponentially grow in the number of delay steps. We introduce RUDDER,a return decomposition method, which creates a new MDP with same optimalpolicies as the original MDP but with redistributed rewards that have largelyreduced delays. If the return decomposition is optimal, then the new MDP doesnot have delayed rewards and TD estimates are unbiased. In this case, therewards track Q-values so that the future expected reward is always zero. Weexperimentally confirm our theoretical results on bias and variance of TD andMC estimates. On artificial tasks with different lengths of reward delays, weshow that RUDDER is exponentially faster than TD, MC, and MC Tree Search(MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, DuelingDDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venturein only a fraction of the learning time. RUDDER considerably improves thestate-of-the-art on the delayed reward Atari game Bowling in much less learningtime. Source code is available at https://github.com/ml-jku/baselines-rudder,with demonstration videos at https://goo.gl/EQerZV.

Quick Read (beta)

loading the full paper ...