Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Abstract

Recent advances in deep reinforcement learning algorithms have shown greatpotential and success for solving many challenging real-world problems,including Go game and robotic applications. Usually, these algorithms need acarefully designed reward function to guide training in each time step.However, in real world, it is non-trivial to design such a reward function, andthe only signal available is usually obtained at the end of a trajectory, alsoknown as the episodic reward or return. In this work, we introduce a newalgorithm for temporal credit assignment, which learns to decompose theepisodic return back to each time-step in the trajectory using deep neuralnetworks. With this learned reward signal, the learning efficiency can besubstantially improved for episodic reinforcement learning. In particular, wefind that expressive language models such as the Transformer can be adopted forlearning the importance and the dependency of states in the trajectory,therefore providing high-quality and interpretable learned reward signals. Wehave performed extensive experiments on a set of MuJoCo continuous locomotivecontrol tasks with only episodic returns and demonstrated the effectiveness ofour algorithm.

Quick Read (beta)

loading the full paper ...