Abstract
Designers of AI agents often iterate on the reward function in atrial-and-error process until they get the desired behavior, but this onlyguarantees good behavior in the training environment. We propose structuringthis process as a series of queries asking the user to compare betweendifferent reward functions. Thus we can actively select queries for maximuminformativeness about the true reward. In contrast to approaches asking thedesigner for optimal behavior, this allows us to gather additional informationby eliciting preferences between suboptimal behaviors. After each query, weneed to update the posterior over the true reward function from observing theproxy reward function chosen by the designer. The recently proposed InverseReward Design (IRD) enables this. Our approach substantially outperforms IRD intest environments. In particular, it can query the designer aboutinterpretable, linear reward functions and still infer non-linear ones.