Synthetic Returns for Long-Term Credit Assignment

Abstract

Since the earliest days of reinforcement learning, the workhorse method forassigning credit to actions over time has been temporal-difference (TD)learning, which propagates credit backward timestep-by-timestep. This approachsuffers when delays between actions and rewards are long and when interveningunrelated events contribute variance to long-term returns. We proposestate-associative (SA) learning, where the agent learns associations betweenstates and arbitrarily distant future rewards, then propagates credit directlybetween the two. In this work, we use SA-learning to model the contribution ofpast states to the current reward. With this model we can predict each state'scontribution to the far future, a quantity we call "synthetic returns".TD-learning can then be applied to select actions that maximize these syntheticreturns (SRs). We demonstrate the effectiveness of augmenting agents with SRsacross a range of tasks on which TD-learning alone fails. We show that thelearned SRs are interpretable: they spike for states that occur after criticalactions are taken. Finally, we show that our IMPALA-based SR agent solves AtariSkiing -- a game with a lengthy reward delay that posed a major hurdle todeep-RL agents -- 25 times faster than the published state-of-the-art.

Quick Read (beta)

loading the full paper ...