Abstract
Many practical applications of reinforcement learning constrain agents tolearn from a fixed batch of data which has already been gathered, withoutoffering further possibility for data collection. In this paper, we demonstratethat due to errors introduced by extrapolation, standard off-policy deepreinforcement learning algorithms, such as DQN and DDPG, are incapable oflearning with data uncorrelated to the distribution under the current policy,making them ineffective for this fixed batch setting. We introduce a novelclass of off-policy algorithms, batch-constrained reinforcement learning, whichrestricts the action space in order to force the agent towards behaving closeto on-policy with respect to a subset of the given data. We present the firstcontinuous control deep reinforcement learning algorithm which can learneffectively from arbitrary, fixed batch data, and empirically demonstrate thequality of its behavior in several tasks.