Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Abstract

Most deep reinforcement learning (RL) systems are not able to learneffectively from off-policy data, especially if they cannot explore online inthe environment. These are critical shortcomings for applying RL to real-worldproblems where collecting data is expensive, and models must be tested offlinebefore being deployed to interact with the environment -- e.g. systems thatlearn from human interaction. Thus, we develop a novel class of off-policybatch RL algorithms, which are able to effectively learn offline, withoutexploring, from a fixed batch of human interaction data. We leverage modelspre-trained on data as a strong prior, and use KL-control to penalizedivergence from this prior during RL training. We also use dropout-baseduncertainty estimates to lower bound the target Q-values as a more efficientalternative to Double Q-Learning. The algorithms are tested on the problem ofopen-domain dialog generation -- a challenging reinforcement learning problemwith a 20,000-dimensional action space. Using our Way Off-Policy algorithm, wecan extract multiple different reward functions post-hoc from collected humaninteraction data, and learn effectively from all of these. We test thereal-world generalization of these systems by deploying them live to conversewith humans in an open-domain setting, and demonstrate that our algorithmachieves significant improvements over prior methods in off-policy batch RL.

Quick Read (beta)

loading the full paper ...