Automatic Reward Shaping from Confounded Offline Data

Abstract

A key task in Artificial Intelligence is learning effective policies forcontrolling agents in unknown environments to optimize performance measures.Off-policy learning methods, like Q-learning, allow learners to make optimaldecisions based on past experiences. This paper studies off-policy learningfrom biased data in complex and high-dimensional domains where \emph{unobservedconfounding} cannot be ruled out a priori. Building on the well-celebrated DeepQ-Network (DQN), we propose a novel deep reinforcement learning algorithmrobust to confounding biases in observed data. Specifically, our algorithmattempts to find a safe policy for the worst-case environment compatible withthe observations. We apply our method to twelve confounded Atari games, andfind that it consistently dominates the standard DQN in all games where theobserved input to the behavioral and target policies mismatch and unobservedconfounders exist.

Quick Read (beta)

loading the full paper ...