Active Inverse Reward Design

Abstract

Reward design, the problem of selecting an appropriate reward function for anAI system, is both critically important, as it encodes the task the systemshould perform, and challenging, as it requires reasoning about andunderstanding the agent's environment in detail. As a result, system designersoften iterate on the reward function in a trial-and-error process to get theirdesired behavior. We propose structuring this process as a series of rewarddesign queries, where we actively select the set of reward functions availableto the designer. We query with two types of sets: discrete queries, where thesystem designer chooses from a small set of reward functions, and featurequeries, where the system queries the designer for weights on a small subset offeatures. After each query, we use inverse reward design (IRD) (Hadfield-Menellet al., 2017) to update the distribution over the true reward function from theobserved proxy reward function chosen by the designer. Compared to vanilla IRD,we find that our approach not only decreases the uncertainty about the truereward, but also greatly improves performance in unseen environments while onlyquerying for reward functions in a single training environment.

Quick Read (beta)

loading the full paper ...