Abstract
In value-based reinforcement learning (RL), unlike in supervised learning,the agent faces not a single, stationary, approximation problem, but a sequenceof value prediction problems. Each time the policy improves, the nature of theproblem changes, shifting both the distribution of states and their values. Inthis paper we take a novel perspective, arguing that the value predictionproblems faced by an RL agent should not be addressed in isolation, but ratheras a single, holistic, prediction problem. An RL algorithm generates a sequenceof policies that, at least approximately, improve towards the optimal policy.We explicitly characterize the associated sequence of value functions and callit the value-improvement path. Our main idea is to approximate thevalue-improvement path holistically, rather than to solely track the valuefunction of the current policy. Specifically, we discuss the impact that thisholistic view of RL has on representation learning. We demonstrate that arepresentation that spans the past value-improvement path will also provide anaccurate value approximation for future policy improvements. We use thisinsight to better understand existing approaches to auxiliary tasks and topropose new ones. To test our hypothesis empirically, we augmented a standarddeep RL agent with an auxiliary task of learning the value-improvement path. Ina study of Atari 2600 games, the augmented agent achieved approximately doublethe mean and median performance of the baseline agent.