Off-Policy Deep Reinforcement Learning without Exploration

Abstract

Reinforcement learning traditionally considers the task of balancingexploration and exploitation. This work examines batch reinforcementlearning--the task of maximally exploiting a given batch of off-policy data,without further data collection. We demonstrate that due to errors introducedby extrapolation, standard off-policy deep reinforcement learning algorithms,such as DQN and DDPG, are only capable of learning with data correlated totheir current policy, making them ineffective for most off-policy applications.We introduce a novel class of off-policy algorithms, batch-constrainedreinforcement learning, which restricts the action space to force the agenttowards behaving on-policy with respect to a subset of the given data. Weextend this notion to deep reinforcement learning, and to the best of ourknowledge, present the first continuous control deep reinforcement learningalgorithm which can learn effectively from uncorrelated off-policy data.

Quick Read (beta)

loading the full paper ...