Abstract
Designers of AI agents often iterate on the reward function in atrialanderror process until they get the desired behavior, but this onlyguarantees good behavior in the training environment. We propose structuringthis process as a series of queries asking the user to compare betweendifferent reward functions. Thus we can actively select queries for maximuminformativeness about the true reward. In contrast to approaches asking thedesigner for optimal behavior, this allows us to gather additional informationby eliciting preferences between suboptimal behaviors. After each query, weneed to update the posterior over the true reward function from observing theproxy reward function chosen by the designer. The recently proposed InverseReward Design (IRD) enables this. Our approach substantially outperforms IRD intest environments. In particular, it can query the designer aboutinterpretable, linear reward functions and still infer nonlinear ones.
Quick Read (beta)
We thank the reviewers for their comments, and respond below.
Response to R1’s questions: 1. Intuitively, $\mathcal{R}$ is a set of possible reward functions, over which we can have a probability distribution. In all of our experiments, $\mathcal{R}\subset {\mathbb{R}}^{F}$ is a finite set of vectors, where $F$ is the number of features. (For continuous queries, the true reward space is still discretized into a finite set of possible reward functions.)
2. A reward function has type $\mathrm{\Xi}\to \mathbb{R}$, that is, for every trajectory $\xi $ it outputs a real number (the reward). ${r}^{*}$ is the true reward function, whereas ${r}^{*}(\xi )$ is the reward assigned by the true reward function to the trajectory $\xi $.
3. In theory, Bayesian inference can be performed over any space with a measure. In practice, Bayesian inference is restricted by computational limitations, but there is no simple property that determines when it can be done.
4. Just before Section 2.1, we specify that in this work we assume that proxy reward functions are of the form $\stackrel{~}{r}(\xi ;w)={w}^{T}\varphi (\xi )$, and so we can identify any reward function $r$ by its weights $w$. So every $w$ induces a reward $r$.
5. See our answer to 1. Since $\mathcal{R}$ is a finite set, it automatically has a measure.
Response to R2: R2 is primarily concerned that the method will not scale to realistic environments. However, we think that many of their concerns have already been addressed in the paper. In the context of our experiment with a misspecified linear space when the true reward is quadratic, they worry about interactions between features. However, a quadratic reward function already incorporates interactions between features. If we start with features ${f}_{1}$ and ${f}_{2}$, a quadratic reward can place a weight on ${f}_{1}{f}_{2}$, which can capture some interaction effects between ${f}_{1}$ and ${f}_{2}$.
This experiment shows that queries can be in a different space (linear rewards) than the true reward space that we perform inference over (quadratic rewards). This points the way towards pixelbased reward functions. Specifically, given a new environment, we can either handdesign or learn some highlevel, interpretable features, and ask the user to choose between reward functions defined on these features. However, given their response, we can maintain a belief over reward functions defined on pixels. Since queries on linear rewards are enough to determine a true quadratic reward, we might hope that queries on highlevel featurebased rewards are enough to determine a reward over pixels. Even if not, this can provide a highly informative prior over reward functions over pixels that can then be used with other methods, such as learning from human comparisons. Looking over the paper again, we didn’t explain these points clearly enough, and will add this discussion to the paper.
The questions of user interface and computational cost are important, but we think this is best left for future work, since we already packed a lot of conceptual ideas into our paper. To our knowledge, active IRD is the first preference inference method that can produce a reward function that can generalize to new settings and could plausibly learn reward functions defined on pixels. (While existing trajectorybased approaches can scale up to complex environments, the learned reward functions are very taskspecific and do not generalize.)
We apologize for the lack of details about the environment; we previously had more details but removed them to make room for experimental results. Clearly we removed too much; we’ll add the details back in.
Additional comments, which we will clarify in the paper:

•
Line 80: Inversion means that we compute $P({r}^{*}\mid w,\stackrel{~}{\mathcal{R}})$ from $P(w\mid {r}^{*},\stackrel{~}{\mathcal{R}})$ from using Bayes’ rule.

•
Line 86: The normalization constant $Z$ is standard notation for the denominator in Bayes’ theorem.

•
Line 150: The function being optimized is the mutual information (see Figure 2b / eq. 2)

•
We use samples to represent the distribution, which we called a "particle representation".
Response to R3: We agree with R3 that a decisiontheoretic approach is appropriate for finding a good policy and active learning is appropriate for generalization. Consistent with this, our experiments explicitly evaluate generalization by looking at regret on unseen test environments, so we are not sure what R3’s point is.
We do think that cognitive scaling is important, and we like the suggestion of bound queries. We have already packed this paper with many experiments, but one great avenue for future work would be to run user studies to investigate the cognitive scalability of various kinds of queries for real humans. For feature queries, while computational complexity is exponential in $K$, we expect that $K\le 3$ will work for most purposes.
R3’s references are related but it’s not clear to us what they would add. Our contribution is not to active preference learning / elicitation in general but more specifically to its application in RL and planning, which requires computing policies for the whole reward function space, with massively increased computational costs (see citation [11]). We emphasize that using mutual information is not a key part of our contribution, but rather to explore a new class of reward learning algorithms.
Thanks for the comments on notation, we will incorporate these into the paper.