Abstract
In preferencebased reinforcement learning (RL), an agent interacts with theenvironment while receiving preferences instead of absolute feedback. Whilethere is increasing research activity in preferencebased RL, the design offormal frameworks that admit tractable theoretical analysis remains an openchallenge. Building upon ideas from preferencebased bandit learning andposterior sampling in RL, we present Dueling Posterior Sampling (DPS), whichemploys preferencebased posterior sampling to learn both the system dynamicsand the underlying utility function that governs the user's preferences.Because preference feedback is provided on trajectories rather than individualstate/action pairs, we develop a Bayesian approach to solving the creditassignment problem, translating user preferences to a posterior distributionover state/action reward models. We prove an asymptotic noregret rate for DPSwith a Bayesian logistic regression credit assignment model; to our knowledge,this is the first regret guarantee for preferencebased RL. We also discusspossible avenues for extending this proof methodology to analyze other creditassignment models. Finally, we evaluate the approach empirically, showingcompetitive performance against existing baselines.
Quick Read (beta)
Dueling Posterior Sampling for PreferenceBased Reinforcement Learning
Abstract
In preferencebased reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preferencebased RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preferencebased bandit learning and posterior sampling in RL, we present Dueling Posterior Sampling (DPS), which employs preferencebased posterior sampling to learn both the system dynamics and the underlying utility function that governs the user’s preferences. Because preference feedback is provided on trajectories rather than individual state/action pairs, we develop a Bayesian approach to solving the credit assignment problem, translating user preferences to a posterior distribution over state/action reward models. We prove an asymptotic noregret rate for DPS with a Bayesian logistic regression credit assignment model; to our knowledge, this is the first regret guarantee for preferencebased RL. We also discuss possible avenues for extending this proof methodology to analyze other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.
Dueling Posterior Sampling for PreferenceBased Reinforcement Learning
Ellen R. Novoseller^{1}, Yanan Sui^{2}, Yisong Yue^{1}, Joel W. Burdick^{1} ^{1} Department of Computing and Mathematical Sciences, California Institute of Technology {[email protected], [email protected], [email protected]} ^{2} Department of Computer Science, Stanford University, [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
In many domains, ranging from clinical trials [40] to autonomous driving [36] and humanrobot interaction [26], it can be unclear how to define a reward signal for reinforcement learning (RL). In such situations, the RL agent seeks to interact optimally with a human user; thus, rewards should reflect the extent to which the algorithm achieves the user’s goals. Yet, for many systems, for instance in autonomous driving [8] and robotics [7, 3], users have difficulty with both specifying numerical reward functions and providing demonstrations of desired behavior. Furthermore, a misspecified reward function can result in “reward hacking” [6], which occurs when the agent learns an undesirable behavior that through some loophole, achieves a high reward. In such cases, the user’s preferences form a more reliable measure of desired system behavior, and the preference data may be leveraged in place of a standard numerical reward signal.
We thus study the problem of preferencebased reinforcement learning (PBRL), where the RL agent executes a pair of trajectories, and the user provides (noisy) preference feedback regarding which trajectory has higher utility. While the study of PBRL has seen increased interest in recent years [18, 13, 48], it remains an open challenge to design formal frameworks that admit tractable theoretical analysis. Compared to the preferencebased bandit setting, which has seen significant theoretical progress (e.g., [53, 56, 2, 44, 15, 55, 32, 52, 41, 42]), one major challenge is how to address credit assignment when only receiving feedback at the trajectory level compared to the state/action level.
In this paper, we present Dueling Posterior Sampling (DPS), which uses preferencebased posterior sampling to tackle the PBRL problem in the Bayesian regime. Posterior sampling (also known as Thompson sampling) [45, 29, 20, 1, 30] is a Bayesian modelbased approach to balancing exploration and exploitation, thereby enabling the algorithm to efficiently learn models of both the environment’s state transition dynamics and the reward function. Previous work on posterior sampling in RL [29, 20, 1, 30] all focused on learning from absolute rewards, while we show how to extend posterior sampling to both elicit and learn from trajectorylevel preference feedback.
To elicit preference feedback, at every episode of learning, DPS draws two independent samples from the posterior to generate two trajectories. This approach is inspired by the SelfSparring algorithm proposed for the bandit setting [41]. Our theoretical analysis is quite different from that in [41], due to the need to incorporate trajectorylevel preference learning and state transition dynamics.
To learn from preference feedback, DPS internally maintains a Bayesian state/action reward model that explains the preferences. In other words, this reward model is a solution to the temporal credit assignment problem [3, 56, 44, 13, 51, 48] and determines which of the encountered states and actions are responsible for the trajectorylevel preference feedback. Learning from trajectorylevel preferences is in general a very challenging problem, as information about the rewards is sparse (often just one bit), is only relative to the pair of trajectories being compared, and does not explicitly include information about actions within trajectories. We thus develop our approach while restricting to standard Bayesian realizability assumptions inherent to most posterior sampling approaches.
We developed DPS concurrently with an analysis framework for characterizing regret convergence in the episodic learning setting. To justify our overall approach, we show how to mathematically integrate Bayesian credit assignment and draw dueling samples within the conventional posterior sampling framework. We evaluate several possible Bayesian credit assignment models, and prove an asymptotic noregret rate for DPS using Bayesian logistic regression [5, 28] as the credit assignment model. To our knowledge, this is the first PBRL approach with theoretical guarantees. In addition, we also demonstrate that DPS delivers competitive performance in simulation.
2 Related work
Posterior sampling. Balancing exploration and exploitation is a key problem in reinforcement learning (RL) and bandits. In the episodic learning setting, the agent typically aims to balance exploration and exploitation to minimize its regret, i.e., the gap between the expected total rewards of the agent and the optimal policy. Posterior sampling, first proposed in [45], is a Bayesian approach toward achieving this goal, and iterates between (1) updating the posterior of a Bayesian environment model and (2) sampling from this posterior to inform the subsequent policy. In both the bandit and RL settings, posterior sampling has been demonstrated to perform competitively in experiments and enjoy favorable theoretical properties in terms of its regret [30, 29, 1, 12].
Our approach builds upon two prior posterior sampling algorithms: SelfSparring [41] for preferencebased bandit learning (also known as dueling bandits [53]) and posterior sampling RL [29]. SelfSparring [41] is a posterior sampling approach, and draws multiple samples to “duel” or “spar” via preference elicitation. The algorithm iteratively: a) draws multiple samples from the posterior model of each action’s reward; b) for each sampled model, executes the action with the highest sampled reward; c) queries for preference feedback between the executed actions; and d) updates the posterior according to the acquired preference data. In [41], the authors prove an asymptotic noregret guarantee for SelfSparring with independent BetaBernoulli reward models for each action.
Within RL, posterior sampling has been applied to the finitehorizon setting with absolute rewards [29]. Posterior sampling RL iterates over four steps: a) draw a sample from the Bayesian posterior of the dynamics and rewards; b) compute the optimal policy for the sampled system; c) execute the policy to get a rollout trajectory; and d) update the posteriors with the new observations from the rollout. In [29], the authors show the expected regret is $O(hS\sqrt{AT\text{log}(SAT)})$, for number of timesteps $T$, finite time horizon $h$, and discrete state and action spaces of sizes $S$ and $A$, respectively.
A third line of relevant work is posterior sampling for Bayesian logistic regression [12, 16, 37], which is used as our Bayesian credit assignment model. One difficulty with Bayesian logistic regression [5] is the lack of a closedform posterior. To handle this, we adopt the approach of [12] and use a Laplace approximation. Other approaches include using Gibbs sampling algorithm [16]. One relevant related application is [35], who apply Bayesian logistic regression to the multiobjective multiarmed bandit problem; to determine the utilities that a human assigns to different objectives, the algorithm queries for pairwise preferences between expected reward vectors corresponding to different actions.
Preferencebased learning. Previous work on preferencebased RL (PBRL) has shown successful performance in a number of applications, such as playing games [13], learning human preferences for autonomous driving [36], and selecting a robot’s controller parameters [26, 4]. Yet, to our knowledge, the PBRL literature still lacks theoretical guarantees.
Existing approaches for trajectorylevel preferencebased RL may be broadly divided into three categories [47]: a) directly optimizing policy parameters [46, 11, 26]; b) learning a preference model to predict action preferences in each state [18]; and c) learning a utility function to characterize the rewards, returns, or values of state/action pairs [49, 50, 3, 51, 13]. In c), the utility is often modeled as linear in the trajectory features. If those features are defined as visitations to each state/action pair, then maximizing utility directly corresponds to maximizing the total (undiscounted) reward.
One popular paradigm, which we also adopt, is PBRL with underlying utility functions. By inferring state/action rewards from preference feedback, one can derive relativelyinterpretable reward models and also use such methods as value iteration. In addition, utilitybased approaches may be more sample efficient compared to policy search and preference relation methods [47], as they extract more information from each observation. Notably, [46] learn a Bayesian model over policy parameters, and draw samples from its posterior to inform actions. From existing PBRL methods, their algorithm perhaps most resembles ours; however, compared to utilitybased approaches, policy search methods typically require either more samples or expert knowledge to craft the policy parameters [48, 26].
Beyond RL, preferencebased learning has been the subject of much research. The closest to RL is the bandit setting [53, 56, 2, 44, 15, 55, 32, 52, 41, 42], which is essentially a singlestate variant of RL. Other settings include: active learning [36, 23, 17], which is focused exclusively on learning an accurate model rather than maximizing utility of decisionmaking; learning with more structured preference feedback [31, 38, 33, 39], where the learner receives more than one bit of information per preference elicitation; and batch supervised settings such as learning to rank [22, 14, 25, 9, 54, 10, 27].
3 Problem statement
Preliminaries. We consider fixedhorizon Markov Decision Processes (MDPs), in which rewards are replaced by preferences over trajectories. This class of MDPs can be represented as a tuple, $\mathcal{M}=(\mathcal{S},\mathcal{A},\succ ,\varphi ,p,{p}_{0},h)$, where the state space $\mathcal{S}$ and action space $\mathcal{A}$ are finite sets. The agent, using policy $\pi $, episodically interacts with the environment with length$h$ rollout trajectories of the form $\tau =\{{s}_{1},{a}_{1},{s}_{2},{a}_{2},\mathrm{\dots},{s}_{h},{a}_{h},{s}_{h+1}\}$. Since we are eliciting preference feedback, in each episode $i$, the agent executes two rollouts ${\tau}_{i1}$ and ${\tau}_{i2}$, and observes a preference between the two. The initial state is sampled from ${p}_{0}$, while $p$ defines the transition dynamics: ${s}_{t+1}\sim p(\cdot {s}_{t},{a}_{t})$.
We use $\succ $ to denote the stochastic preference relationship between trajectories, and $\varphi (\tau ,{\tau}^{\prime})=\mathbb{P}(\tau >{\tau}^{\prime})\in [0,1]$ to capture the feedback generation mechanism. We assume that $\succ $ is a total ordering over trajectories, and $\tau \succ {\tau}^{\prime}\iff \varphi (\tau ,{\tau}^{\prime})>\frac{1}{2}$. We use $\tau >{\tau}^{\prime}$ to denote the event that trajectory $\tau $ was preferred over ${\tau}^{\prime}$ in a preference elicitation, i.e., $\tau >{\tau}^{\prime}$ is observed with probability $\varphi (\tau ,{\tau}^{\prime})$. We further assume an underlying utility function $\overline{r}(\tau )$ for each trajectory, such that $\tau \succ {\tau}^{\prime}\iff \overline{r}(\tau )>\overline{r}({\tau}^{\prime})$, and define $\varphi $ using $\overline{r}$. For instance, if the preferences are noiseless, then: $\varphi ({\tau}_{i},{\tau}_{j})=\mathbb{I}[\overline{r}({\tau}_{i})>\overline{r}({\tau}_{j})]$. Alternatively, $\varphi $ could be the linear link function [2]: ${\varphi}_{\text{lin}}({\tau}_{i},{\tau}_{j}):=(1+\overline{r}({\tau}_{i})\overline{r}({\tau}_{j}))/2$. We primarily assume a logistic or BradleyTerry link function: ${\varphi}_{\text{log}}({\tau}_{i},{\tau}_{j}):={[1+\mathrm{exp}(c(\overline{r}({\tau}_{i})\overline{r}({\tau}_{j})))]}^{1}$ with “temperature” $c\in (0,\mathrm{\infty})$. Our problem setting resembles the PSDP defined in [50], except that additionally, we incorporate the noise model through which the underlying utilities are stochastically translated to preferences. Finally, we assume that the utilities decompose additively: $\overline{r}(\tau )\equiv {\sum}_{i=1}^{h}\overline{r}({s}_{i},{a}_{i})$ for state/action pairs in $\tau $.
Given a policy $\pi $, we can define the standard RL value function as the expected total utility of being in state $s$ at step $i$, and following policy $\pi $:
$${V}_{\pi ,i}(s)=\mathbb{E}[\sum _{j=i}^{h}\overline{r}({s}_{j},\pi ({s}_{j})){s}_{i}=s],$$  (1) 
and now we can define the optimal policy ${\pi}^{*}$ as the one with maximal value for all input states. Note that ${\mathbb{E}}_{{s}_{1}\sim {p}_{0}}\left[{V}_{\pi ,1}({s}_{1})\right]\equiv {\mathbb{E}}_{\tau \sim \pi ,\mathcal{M}}\left[\overline{r}(\tau )\right]$. Given fully specified dynamics and reward models, $p$ and $\overline{r}$, it is straightforward to apply standard dynamic programming approaches such as value iteration to arrive at the optimal policy under $p$ and $\overline{r}$ [43]. The goal of learning, then, is infer $p$ and $\overline{r}$ to the extent necessary for good decisionmaking.
Learning problem. In each iteration (or episode) $i$, the agent selects two policies, ${\pi}_{i1}$ and ${\pi}_{i2}$. The two policies are rolled out to obtain trajectories ${\tau}_{i1}$ and ${\tau}_{i2}$, and a binary preference ${b}_{i}\in \{0,1\}$ between them is sampled according to the underlying utilities of ${\tau}_{i1}$ and ${\tau}_{i2}$. We quantify the performance of the learning agent using expected cumulative regret relative to the optimal policy:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}]=\mathbb{E}\left\{\sum _{i=1}^{\lceil T/(2h)\rceil}\sum _{s\in \mathcal{S}}{p}_{0}(s)\left[2{V}_{{\pi}^{*},1}(s){V}_{{\pi}_{i1},1}(s){V}_{{\pi}_{i2},1}(s)\right]\right\}.$$  (2) 
To minimize regret, the agent must balance exploration (collecting new data) with exploitation (behaving optimally w.r.t. existing models). Overexploration of bad trajectories will incur large regret, and underexploration can prevent converging to the optimal policy. In contrast to the standard formulation in RL [29], at each iteration/episode we compare the utilities of both selected policies.
4 Algorithm
As outlined in Algorithm 1, Dueling Posterior Sampling (DPS) iterates over three main steps: (a) sample two policies ${\pi}_{1},{\pi}_{2}$ from the Bayesian posteriors of the dynamics and utility models (Advance – Algorithm 2); (b) roll out ${\pi}_{1}$ and ${\pi}_{2}$ to obtain trajectories ${\tau}_{1}$ and ${\tau}_{2}$, and receive preference feedback between them; (c) store the new state transitions and feedback and update the posterior (feedback – Algorithm 3). Compared to conventional posterior sampling with absolute feedback [29], the two key differences are that: two policies are sampled rather than one each iteration, and a credit assignment problem is solved when learning from feedback.
Advance (Algorithm 2) samples from the Bayesian posteriors of the dynamics and utility models. The sampled dynamics and utilities form an MDP, and value iteration is used to derive the optimal policy $\pi $ under the sample. One can also think of $\pi $ as a random function whose randomness depends on the sampling of the dynamics and utility models. In the Bayesian setting, it can be shown that $\pi $ is sampled according to its posterior probability of being the true optimal policy ${\pi}^{*}$ [29, 30]. Intuitively, peaked (i.e., certain) posteriors lead to less variability when sampling $\pi $, which implies less exploration. On the other hand, diffuse (i.e., uncertain) posteriors lead to greater variability when sampling $\pi $, which implies more exploration.
feedback (Algorithm 3) updates the Bayesian posteriors of the dynamics and utility models based on new data. Updating the dynamics posterior is relatively straightforward, as we assume that the dynamics are fullyobserved; for instance, the dynamics prior can be modeled via Dirichlet distributions with multinomial conjugate observation likelihoods [29]. In contrast, performing Bayesian inference over state/action utilities from trajectorylevel feedback is much more challenging. We considered a range of approaches (see Appendix A1), and found Bayesian logistic regression (Section 4.1) to both be wellperforming and admit tractable analysis within our theoretical framework.
4.1 Bayesian logistic regression for utility inference and credit assignment
Credit assignment [47] is the problem of inferring which state/action pairs are responsible for observed trajectorylevel preferences. We detail a Bayesian logistic regression approach to address this task in our setting. Logistic regression is a binary classification method that learns a weight vector $\bm{w}$ for the model $p(y=1\bm{x},\bm{w})=\frac{1}{1+\text{exp}({\bm{w}}^{T}\bm{x})}$. Bayesian logistic regression [5, 28] maintains a posterior over possible weight vectors. Because there is no convenient prior yielding a closedform conjugate posterior, we use the Laplace approximation to the posterior as specified below.
Preliminaries. Let $N$ be the number of trajectories pairs observed so far, and $D=SA$ be the total number of state/action pairs. Let ${\bm{x}}_{ij}\in {\mathbb{R}}^{D},j\in \{1,2\}$ be the visitation vector corresponding to trajectory ${\tau}_{ij}$, with the $k$^{th} element ${\bm{x}}_{ij}^{(k)}$ being the number of times that state/action pair $k$ was visited in ${\tau}_{ij}$. Define ${\bm{x}}_{i}:={\bm{x}}_{i1}{\bm{x}}_{i2}$. The observation matrix $X$ and label vector $\bm{y}$ are defined as:
$X=\left[\begin{array}{c}\hfill {({\bm{x}}_{11}{\bm{x}}_{12})}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {({\bm{x}}_{N1}{\bm{x}}_{N2})}^{T}\hfill \end{array}\right],\bm{y}=\left[\begin{array}{c}\hfill {y}_{1}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {y}_{N}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill 2{\mathbb{I}}_{[{\tau}_{11}>{\tau}_{12}]}1\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill 2{\mathbb{I}}_{[{\tau}_{N1}>{\tau}_{N2}]}1\hfill \end{array}\right],$  (3) 
where the expression $2{\mathbb{I}}_{[{\tau}_{i1}>{\tau}_{i2}]}1$ results in labels ${y}_{i}$ with values in $\{1,1\}$.
The observation matrix $X\in {\mathbb{R}}^{N\times D}$ has rank at most $D1$, since the elements of ${\bm{x}}_{i}={\bm{x}}_{i1}{\bm{x}}_{i2}$ must sum to zero for each row. To obtain a fullrowrank observation matrix for Bayesian logistic regression, we transform $X\in {\mathbb{R}}^{N\times D}$ to $Z\in {\mathbb{R}}^{N\times (D1)}$ via the matrix $V=\left[\begin{array}{ccc}\hfill {\bm{v}}_{1}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {\bm{v}}_{D1}\hfill \end{array}\right]\in {\mathbb{R}}^{D\times (D1)}$, where the columns ${\bm{v}}_{i}\in {\mathbb{R}}^{D}$ form an orthonormal basis spanning the $(D1)$dimensional, full possible row space of $X$. To obtain the vector ${\bm{z}}_{i}\in {\mathbb{R}}^{D1}$ that expresses ${\bm{x}}_{i}$ in this basis, apply:
$${\bm{z}}_{i}={[{\bm{x}}_{i}^{T}{\bm{v}}_{1}\mathrm{\dots}{\bm{x}}_{i}^{T}{\bm{v}}_{D1}]}^{T}={V}^{T}{\bm{x}}_{i},$$  (4) 
while converting any vector ${\bm{z}}_{i}\in {\mathbb{R}}^{D1}$ back to the original basis can be accomplished via:
$${\bm{x}}_{\bm{i}}=\sum _{j=1}^{D1}{z}_{ij}{\bm{v}}_{j}=V{\bm{z}}_{i},\text{where}{z}_{ij}\text{is the}j\text{th}\text{element of}{\bm{z}}_{i}.$$  (5) 
Note that this transformation preserves inner products. Equation (5) can be applied to show:
${\bm{x}}_{1}^{T}{\bm{x}}_{2}$  $={\left({\displaystyle \sum _{i=1}^{D1}}{z}_{1i}{\bm{v}}_{i}\right)}^{T}\left({\displaystyle \sum _{j=1}^{D1}}{z}_{2j}{\bm{v}}_{j}\right)={\bm{z}}_{1}^{T}{\bm{z}}_{2}\text{, by orthonormality of}\{{\bm{v}}_{i}\}.$  (6) 
In particular, the transformation preserves orthogonality, so that $X$ and $Z$ have the same rowrank and ${X}^{T}X$ and ${Z}^{T}Z$ have the same rank.
Utility model & posterior inference. We fit a Bayesian logistic regression model to the transformed data $(Z,\bm{y})$. Afterwards, this model predicts the probability that $\tau $ is preferred to ${\tau}^{\prime}$ as a logistic regression function of their visitation vector differences ${\bm{x}}_{\tau}{\bm{x}}_{{\tau}^{\prime}}$. The model parameters correspond exactly to the state/action utilities $\overline{r}$. The model internally computes an elementwise product between ${\bm{x}}_{\tau}{\bm{x}}_{{\tau}^{\prime}}$ and estimated reward vector $\bm{r}$, within the $(D1)$dimensional space given by (4). Given the inner product equivalence (6), this is exactly the trajectory utility, and taking the expectation over trajectories generated by a policy is exactly the value function (1). We show in our experiments that Bayesian logistic regression can robustly learn even with preference modeling mismatch.
We are chiefly interested in sampling from the posterior of parameter/utility vector $\overline{\bm{r}}\in {\mathbb{R}}^{D}$, which can be combined with the sampled dynamics to perform value iteration and obtain a policy. As shown below, via the Laplace approximation, the posterior is Gaussian distributed, and thus can be easily sampled. The internal utility representation lies in ${\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}\in {\mathbb{R}}^{D1}$, and we convert to $\stackrel{\mathbf{~}}{\bm{r}}\in {\mathbb{R}}^{D}$ via (5).
We now describe the Bayesian logistic regression step itself. A Gaussian prior is defined over the utilities ${\bm{r}}^{\mathbf{\prime}}\in {\mathbb{R}}^{D1}$: $p({\bm{r}}^{\mathbf{\prime}})\sim \mathcal{N}({\bm{r}}^{\mathbf{\prime}}\bm{r}^{\mathbf{\prime}}{}_{0},{V}_{0}^{\prime})$. The logistic regression likelihood is:
$p(Z,\bm{y}{\bm{r}}^{\mathbf{\prime}})={\displaystyle \prod _{i=1}^{N}}p({\bm{z}}_{i},{y}_{i}{\bm{r}}^{\mathbf{\prime}})={\displaystyle \prod _{i=1}^{N}}{\displaystyle \frac{1}{1+\text{exp}({y}_{i}{\bm{z}}_{i}^{T}{\bm{r}}^{\mathbf{\prime}})}}.$  (7) 
We approximate the posterior as Gaussian via the Laplace approximation:
$p({\bm{r}}^{\mathbf{\prime}}Z,\bm{y})$  $\approx \mathcal{N}({\bm{r}}^{\mathbf{\prime}}{\widehat{\bm{r}}}^{\mathbf{\prime}},{H}^{1})\text{, where:}$  (8)  
${\widehat{\bm{r}}}^{\mathbf{\prime}}$  $=\underset{{\bm{r}}^{\mathbf{\prime}}}{\text{argmin}}f({\bm{r}}^{\mathbf{\prime}}),f({\bm{r}}^{\mathbf{\prime}}):=\text{log}p(Z,\bm{y},{\bm{r}}^{\mathbf{\prime}})=\text{log}p({\bm{r}}^{\mathbf{\prime}})\text{log}p(Z,\bm{y}{\bm{r}}^{\mathbf{\prime}}),$  (9)  
$H$  $={{\nabla}_{{\bm{r}}^{\mathbf{\prime}}}^{2}f({\bm{r}}^{\mathbf{\prime}})}_{{\widehat{\bm{r}}}^{\mathbf{\prime}}},\text{and where the optimization problem in (}\text{9}\text{) is convex.}$  (10) 
To show a regret convergence using this approximate posterior, we leverage asymptotic normality of the maximum likelihood estimator of logistic regression in our proofs.
5 Theoretical results
We now sketch our asymptotic noregret analysis for Dueling Posterior Sampling (DPS) with Bayesian logistic regression. The full proof is in Appendix A2, while Appendix A2.1 discusses possible avenues for extending this proof methodology toward additional credit assignment models. The proof has two main parts: first proving that DPS with logistic credit assignment is asymptotically consistent (Theorem 1), and then proving that DPS has a sublinear regret rate (Theorem 2). Both parts leverage results on the asymptotic behavior of logistic regression [21]. As before, we consider data $Z\in {\mathbb{R}}^{N\times (D1)}$ and labels $\bm{y}\in {\mathbb{R}}^{N}$, with ${[Z]}_{ij}={z}_{ij}$. To show that DPS is asymptotically consistent in learning the reward function, we first provide some definitions and necessary conditions.
Definition 1 (Derivative of sigmoid).
$f:\mathbb{R}\u27f6\mathbb{R}$, where $f\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}\frac{{e}^{\mathrm{}x}}{{\mathrm{(}\mathrm{1}\mathrm{+}{e}^{\mathrm{}x}\mathrm{)}}^{\mathrm{2}}}$. Note $f\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}f\mathit{}\mathrm{(}\mathrm{}x\mathrm{)}$.
Definition 2.
Let $\overline{\mathbf{r}}\mathrm{\in}{\mathrm{R}}^{D}$ be the vector of true state/action utilities (assumed to exist) and ${\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ be its transformation via (4). Define $\stackrel{\mathrm{~}}{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ as the state/action rewards sampled from the Bayesian logistic regression model posterior in episode $k$, $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ as the model’s maximum a posteriori (MAP) estimate at episode $k$, and $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ as its maximum likelihood estimate at $k$. Lastly, ${\stackrel{\mathrm{~}}{\mathbf{r}}}_{k}\mathrm{\in}{\mathrm{R}}^{D}$, ${\widehat{\mathbf{r}}}_{k}\mathrm{\in}{\mathrm{R}}^{D}$, and ${\widehat{\mathbf{r}}}_{M\mathit{}L\mathrm{,}k}\mathrm{\in}{\mathrm{R}}^{D}$ are their respective equivalents given by (5).
Condition 1.
$$ such that $\mathrm{}{z}_{i\mathit{}j}\mathrm{}\mathrm{\le}{m}_{\mathrm{0}}$ for all $i\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}N\mathrm{\}}\mathrm{,}j\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}D\mathrm{}\mathrm{1}\mathrm{\}}$.
Condition 2.
Let ${\lambda}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}$ and ${\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}$ be the largest and smallest eigenvalues, respectively, of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$. Then, $$ such that $$, for all k.
Proposition 1 (Asymptotic consistency of logistic regression [21]).
If Conditions 1 and 2 are satisfied, then the maximum likelihood estimator $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}$ of ${\overline{\mathbf{r}}}^{\mathrm{\prime}}$ exists almost surely as $k\mathrm{\u27f6}\mathrm{\infty}$, and $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}$ converges almost surely to the true values ${\overline{\mathbf{r}}}^{\mathrm{\prime}}$ if and only if $\underset{k\mathrm{\u27f6}\mathrm{\infty}}{\mathrm{lim}}\mathit{}{\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{=}\mathrm{\infty}$.
We first show that Proposition 1’s final condition is satisfied with known transition dynamics, and afterwards consider the convergence of the dynamics model posterior.
Lemma 1.
Under known transition dynamics, all eigenvalues of the matrix ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$ approach infinity as $k\mathrm{\u27f6}\mathrm{\infty}$.
Lemma 2 (Convergence of dynamics model).
Given Lemma 1, DPS’s dynamics model converges to the true dynamics, and as it converges, all eigenvalues of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$ approach infinity.
Combining these results, we obtain:
Theorem 1 (Asymptotic consistency of DPS).
If there exists a reward function such that a logistic regression model explains the preference feedback, then DPS with a Bayesian logistic regression credit assignment model will learn an asymptotically consistent reward model.
We turn next to characterizing the regret rate of DPS. We apply two prior results, one from [21] regarding the asymptotic distribution of the logistic regression maximum likelihood estimate (Prop. 2), and the other from [29] regarding a regret bound for posterior sampling RL (Prop. 3).
Proposition 2 (Asymptotic normality of logistic regression maximum likelihood estimator [21]).
If Conditions 1 and 2 are satisfied, and if $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}$ converges almost surely to the true value ${\overline{\mathbf{r}}}^{\mathrm{\prime}}$, then:
$${\left[\sum _{i=1}^{k}f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{ML,k}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{\frac{1}{2}}(\widehat{\bm{r}}^{\mathbf{\prime}}{}_{ML,k}{\overline{\bm{r}}}^{\mathbf{\prime}})\stackrel{\mathit{D}}{\u27f6}\mathcal{N}(\mathrm{\U0001d7ce},\mathbb{I})\mathit{\text{as}}k\u27f6\mathrm{\infty},$$  (11) 
where $\stackrel{\mathit{D}}{\mathrm{\u27f6}}$ implies convergence in distribution and ${Q}^{\frac{\mathrm{1}}{\mathrm{2}}}$ is the positive definite matrix associated with positive definite matrix $Q$ such that ${\mathrm{[}{Q}^{\frac{\mathrm{1}}{\mathrm{2}}}\mathrm{]}}^{\mathrm{2}}\mathrm{=}Q$.
Proposition 3 (Expected regret of posterior sampling RL [29]).
Posterior sampling RL has expected $T$step regret $O\mathit{}\mathrm{(}h\mathit{}S\mathit{}\sqrt{A\mathit{}T\mathit{}\text{\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}S\mathit{}A\mathit{}T\mathrm{)}}\mathrm{)}\mathrm{,}$ with horizon $h$ and numbers of states and actions $S$ and $A$.
Leveraging these results, we show that under preference feedback, the regret can be decomposed into two terms: one that reflects the converging dynamics model, and one that reflects the converging reward model (inferred from trajectorylevel preference feedback).
Lemma 3 (Regret decomposition).
The expected regret of DPS can be decomposed into two terms. One term can be bounded by the regret bound of [29], stated in Proposition (3). The other is bounded by: $h\mathit{}{\mathrm{\sum}}_{k\mathrm{=}\mathrm{1}}^{\mathrm{\lceil}T\mathrm{/}h\mathrm{\rceil}}\mathrm{E}\mathit{}\mathrm{[}{\mathrm{}\overline{\mathbf{r}}\mathrm{}{\stackrel{\mathrm{~}}{\mathbf{r}}}_{k}\mathrm{}}_{\mathrm{\infty}}\mathrm{]}\mathrm{\le}h\mathit{}{\mathrm{\sum}}_{k\mathrm{=}\mathrm{1}}^{\mathrm{\lceil}T\mathrm{/}h\mathrm{\rceil}}\mathrm{E}\mathit{}\mathrm{[}{\mathrm{}{\widehat{\mathbf{r}}}_{k}\mathrm{}\overline{\mathbf{r}}\mathrm{}}_{\mathrm{\infty}}\mathrm{]}\mathrm{+}h\mathit{}{\mathrm{\sum}}_{k\mathrm{=}\mathrm{1}}^{\mathrm{\lceil}T\mathrm{/}h\mathrm{\rceil}}\mathrm{E}\mathit{}\mathrm{[}{\mathrm{}{\widehat{\mathbf{r}}}_{\mathbf{k}}\mathrm{}{\stackrel{\mathrm{~}}{\mathbf{r}}}_{k}\mathrm{}}_{\mathrm{\infty}}\mathrm{]}\mathrm{.}$
The final result is obtained by analyzing convergence of the samples ${\stackrel{\mathbf{~}}{\bm{r}}}_{k}$ to ${\widehat{\bm{r}}}_{k}$ and of the credit assignment model ${\widehat{\bm{r}}}_{k}$ to $\overline{\bm{r}}$:
Theorem 2 (Asymptotic regret rate of DPS).
If there exists a reward function such that a logistic regression model explains the preferences, then DPS has an asymptotic noregret rate of $O\mathit{}\mathrm{\left(}h\mathit{}S\mathit{}\sqrt{A\mathit{}T\mathit{}\text{\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}S\mathit{}A\mathit{}T\mathrm{)}}\mathrm{+}h\mathit{}\sqrt{\frac{S\mathit{}A}{{c}_{\mathrm{0}}}\mathit{}T\mathit{}\text{\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}T\mathrm{)}}\mathrm{\right)}$, where ${c}_{\mathrm{0}}$ is a minimum rate at which eigenvalues of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$ increase linearly with collection of samples ${\mathbf{z}}_{i}$ that impact those eigenvalues.
6 Experiments
We validate the empirical performance of Dueling Posterior Sampling (DPS) on two simulated domains with varying levels of preference noise, as well as using alternative credit assignment models. We find that DPS generally performs well and outperforms standard PBRL baselines [49].
Experimental setup. We evaluate on two simulated environments: RiverSwim and random MDPs. The RiverSwim environment [29] has six states and two actions (actions 0 and 1); the optimal policy is to always choose action 1, which maximizes the probability of reaching a particular goal state/action pair. Meanwhile, a suboptimal policy—yielding a much smaller reward compared to the goal—is quickly and easily discovered and incentivizes the agent to always select action 0. The algorithm must demonstrate sufficient exploration to have hope of discovering the optimal policy quickly.
In the second simulated environment, we generate random MDPs as in [29] with 10 states and 5 actions. The transition dynamics and rewards are generated from Dirichlet (all parameters set to 0.1) and exponential (rate parameter set to 5) distributions, respectively. The parameter for these distributions were chosen to generate MDPs with sparse dynamics and rewards. The sampled reward vectors were shifted and normalized so that the minimum reward is zero and their mean is one.
In both of these environments, preferences between pairs of trajectories were generated by (noisily) comparing the total rewards that they accumulated; this reward information was hidden from the learning algorithm, which observed only the trajectory preferences and state transitions. Preference noise is generated according to a logistic model: for trajectories ${\tau}_{i}$ and ${\tau}_{j}$, $P({\tau}_{i}>{\tau}_{j})={\{1+\text{exp}[c(\overline{r}({\tau}_{i})\overline{r}({\tau}_{j}))]\}}^{1}$, where $\overline{r}({\tau}_{i})$ and $\overline{r}({\tau}_{j})$ are the total rewards accrued by the two trajectories, respectively, while the hyperparameter $c$ controls the degree of noisiness.
Methods compared. We evaluate DPS under four different noise levels ($c\in \{0.1,0.5,1,1000\}$) and three credit assignment models: 1) Bayesian logistic regression, 2) Bayesian linear regression, and 3) Gaussian process regression, where the latter two methods are described in Appendix A1. In addition, we evaluate the EveryVisit Preference Monte Carlo (EPMC) algorithm with probabilistic credit assignment [50, 47] as a baseline. Lastly, we compare against the posterior sampling RL algorithm [29], which learns using the true numerical rewards at each step, and thus yields an upperbound on the performance that a preferencebased algorithm could achieve.
Results. Figure 4 shows the performance comparison for $c=1$ in both environments, as well as $c=1,000$ in RiverSwim (additional results are in Appendix A3, including hyperparameter details). DPS performs well in all simulations, and significantly outperforms the EPMC baseline. This may be because the EPMC algorithm uses a uniform exploration strategy, while DPS prioritizes exploration by sampling high rewards in more uncertain regions of the state/action space. Notice that $c=$ 1,000 results in nearlynoiseless preferences; this can decrease performance in RiverSwim in some cases, since preference noise can help the agent to escape the local minimum. We also see that DPS is competitive with PSRL, which has access to the full cardinal rewards at each state/action. Finally, we see that the performance of DPS is robust to the choice of credit assignment model, and in fact using Gaussian process regression (for which we do not have an endtoend regret analysis) often leads to the best empirical performance. These results suggest that DPS is a practically promising approach that can robustly incorporate many modeling approaches as subroutines.
7 Conclusion
We investigate the preferencebased reinforcement learning problem, which receives comparative preferences instead of absolute realvalued rewards as feedback. We develop the Dueling Posterior Sampling (DPS) algorithm, which optimizes policies in an highly efficient and flexible way. To our knowledge, DPS is the first preferencebased RL algorithm with a regret guarantee. DPS also performs well in our simulations, and seems practically promising. That makes it both a theoreticallyjustified and practically promising algorithm.
There are many directions for future work. The Bayesian logistic regression model could be improved with more accurate posterior estimates. Assumptions governing the user’s preferences, such as requiring an underlying utility model, could be relaxed. One can also incorporate kernelized methods to further improve sample efficiency. It is also important to extend to other credit assignment models, such as the Gaussian process regression and Bayesian linear regression methods, for which the same concept of the regret decomposition still applies. We expect that DPS would perform well with any asymptotically consistent reward model that sufficiently captures users’ preference behavior.
Acknowledgments
This work was supported by NIH grant EB007615 and an Amazon graduate fellowship.
References
References
 [1] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: Worstcase regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
 [2] Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
 [3] Riad Akrour, Marc Schoenauer, and Michèle Sebag. APRIL: Active preference learningbased reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116–131. Springer, 2012.
 [4] Riad Akrour, Marc Schoenauer, Michèle Sebag, and JeanChristophe Souplet. Programming by feedback. In International Conference on Machine Learning, volume 32, pages 1503–1511. JMLR. org, 2014.
 [5] James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679, 1993.
 [6] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
 [7] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 [8] Chandrayee Basu, Qian Yang, David Hungerman, Mukesh Sinahal, and Anca D Dragan. Do you want your autonomous car to drive like you? In 2017 12th ACM/IEEE International Conference on HumanRobot Interaction (HRI, pages 417–425. IEEE, 2017.
 [9] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML05), pages 89–96, 2005.
 [10] Christopher J Burges, Robert Ragno, and Quoc V Le. Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems, pages 193–200, 2007.
 [11] Róbert BusaFekete, Balázs Szörényi, Paul Weng, Weiwei Cheng, and Eyke Hüllermeier. Preferencebased evolutionary direct policy search. In ICRA Workshop on Autonomous Learning, 2013.
 [12] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, 2011.
 [13] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
 [14] Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 137–144. ACM, 2005.
 [15] Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory (COLT), 2015.
 [16] Bianca Dumitrascu, Karen Feng, and Barbara Engelhardt. PGTS: Improved Thompson sampling for logistic contextual bandits. In Advances in Neural Information Processing Systems, pages 4624–4633, 2018.
 [17] Brochu Eric, Nando D Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in neural information processing systems, pages 409–416, 2008.
 [18] Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and SangHyeun Park. Preferencebased reinforcement learning: A formal framework and a policy iteration algorithm. Machine learning, 89(12):123–156, 2012.
 [19] Subhashis Ghosal et al. Asymptotic normality of posterior distributions in highdimensional linear models. Bernoulli, 5(2):315–331, 1999.
 [20] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision processes. In Conference on Learning Theory, pages 861–898, 2015.
 [21] Christian Gourieroux and Alain Monfort. Asymptotic properties of the maximum likelihood estimator in dichotomous logit models. Journal of Econometrics, 17(1):83–97, 1981.
 [22] R Herbrich, T Graepel, and K Obermayer. Support vector learning for ordinal regression. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99.(Conf. Publ. No. 470), volume 1, pages 97–102. IET, 1999.
 [23] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
 [24] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 [25] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, pages 377–384. ACM, 2005.
 [26] Andras Kupcsik, David Hsu, and Wee Sun Lee. Learning dynamic robottohuman object handover from human feedback. In Robotics research, pages 161–176. Springer, 2018.
 [27] TieYan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
 [28] Kevin P Murphy. Machine learning: A probabilistic perspective. The MIT Press, 2012.
 [29] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 [30] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2701–2710. JMLR. org, 2017.
 [31] Filip Radlinski and Thorsten Joachims. Query chains: Learning to rank from implicit feedback. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 239–248. ACM, 2005.
 [32] Siddartha Y Ramamohan, Arun Rajkumar, and Shivani Agarwal. Dueling bandits: Beyond condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems, pages 1253–1261, 2016.
 [33] Karthik Raman, Thorsten Joachims, Pannaga Shivaswamy, and Tobias Schnabel. Stable coactive learning via perturbation. In International Conference on Machine Learning, pages 837–845, 2013.
 [34] Carl Edward Rasmussen and Christopher K Williams. Gaussian processes for machine learning. The MIT Press, 2(3):4, 2006.
 [35] Diederik M Roijers, Luisa M Zintgraf, and Ann Nowé. Interactive Thompson sampling for multiobjective multiarmed bandits. In International Conference on Algorithmic Decision Theory, pages 18–34. Springer, 2017.
 [36] Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preferencebased learning of reward functions. In Robotics: Science and Systems (RSS), 2017.
 [37] Steven L Scott. A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
 [38] Pannaga Shivaswamy and Thorsten Joachims. Online structured prediction via coactive learning. In Proceedings of the 29th International Conference on Machine Learning, pages 59–66. Omnipress, 2012.
 [39] Pannaga Shivaswamy and Thorsten Joachims. Coactive learning. Journal of Artificial Intelligence Research, 53:1–40, 2015.
 [40] Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. Stagewise safe Bayesian optimization with Gaussian processes. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4781–4789. PMLR, 10–15 Jul 2018.
 [41] Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multidueling bandits with dependent arms. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2017.
 [42] Yanan Sui, Masrour Zoghi, Katja Hofmann, and Yisong Yue. Advancements in dueling bandits. In IJCAI, pages 5502–5510, 2018.
 [43] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [44] Balázs Szörényi, Róbert BusaFekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for PlackettLuce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pages 604–612, 2015.
 [45] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 [46] Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pages 1133–1141, 2012.
 [47] Christian Wirth. Efficient Preferencebased Reinforcement Learning. PhD thesis, Technische Universität, 2017.
 [48] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preferencebased reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945–4990, 2017.
 [49] Christian Wirth and Johannes Fürnkranz. EPMC: Every visit preference Monte Carlo for reinforcement learning. In Asian Conference on Machine Learning, pages 483–497, 2013.
 [50] Christian Wirth and Johannes Fürnkranz. A policy iteration algorithm for learning from preferencebased feedback. In International Symposium on Intelligent Data Analysis, pages 427–437. Springer, 2013.
 [51] Christian Wirth, Johannes Fürnkranz, and Gerhard Neumann. Modelfree preferencebased reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [52] Huasen Wu and Xin Liu. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
 [53] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The karmed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
 [54] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 271–278. ACM, 2007.
 [55] Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten De Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.
 [56] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten De Rijke. Relative upper confidence bound for the karmed dueling bandit problem. In International Conference on Machine Learning (ICML), 2014.
A1 Bayesian state/action credit assignment: two additional approaches
Credit assignment [48] is the problem of inferring which states or state/action pairs are responsible for observed user preferences. This paper derives theoretical guarantees for a Bayesian logistic regression credit assignment model. In this Appendix, we detail two additional credit assignment models—employing Bayesian linear regression and Gaussian process regression, respectively—for inferring a posterior over state/action utilities using trajectory preferences. (Note that these methods could similarly model utilities over states, rather than state/action pairs.)
In what follows, $\stackrel{~}{s}$ denotes a state/action pair, with $D=SA$ representing the number of possible values of $\stackrel{~}{s}$. For each trajectory $\tau =\{{\stackrel{~}{s}}_{1},{\stackrel{~}{s}}_{2},{\stackrel{~}{s}}_{3},\mathrm{\dots},{\stackrel{~}{s}}_{h}\}$, we observe the user’s preference, yielding a dataset $\{{\tau}_{i}i\in 1,\mathrm{\dots},N\}$ of $N$ labeled trajectories. Define $X\in {\mathbb{R}}^{N\times D}$ such that ${x}_{ij}:={[X]}_{ij}$ is the number of times that trajectory $i$ visits state/action ${\stackrel{~}{s}}_{j}$. Finally, we denote the label vector as $\bm{y}\in {\{0,1\}}^{N}$, where the $i$^{th} element ${y}_{i}$ is the preference label corresponding to ${\tau}_{i}$; for instance, if ${\tau}_{1}$ is preferred to ${\tau}_{2}$, then we would have ${y}_{1}=1$ and ${y}_{2}=0$.
As before, $\overline{r}(\stackrel{~}{s})$ represents the true state/action utilities, such that $\overline{r}(\tau )={\sum}_{i=1}^{h}\overline{r}({\stackrel{~}{s}}_{i})$, with $\overline{r}(\tau )$ being trajectory $\tau $’s total (unobserved) utility. To apply regression methods to this data using preference labels, we approximate ${y}_{i}\approx \overline{r}({\tau}_{i})$ to infer $\overline{r}(\stackrel{~}{s})$. In the following, we denote $\widehat{r}(\stackrel{~}{s})$ as our model of the true utilities $\overline{r}(\stackrel{~}{s})$. Also, define $\widehat{\bm{r}}\in {\mathbb{R}}^{D}$ as a vector in which the $i$^{th} element is $\widehat{r}({\stackrel{~}{s}}_{i})$.
In the following, we discuss how to perform Bayesian inference on $\overline{r}(\stackrel{~}{s})$ using $\overline{r}({\tau}_{i})$, which is approximated in practice via preferences. Note that the approximation ${y}_{i}\approx \overline{r}({\tau}_{i})$ performs well empirically, though future work could perhaps apply Bayesian methods such as those in [14] to infer trajectories’ total utilities from the preference labels.
A1.1 Bayesian linear regression
One can estimate state/action utilities from preferences via linear regression: $\bm{y}=X\widehat{\bm{r}}+\bm{\epsilon}$, where $\bm{\epsilon}$ is a vector of residuals and the other quantities are defined above. Bayesian linear regression infers a distribution over likely values of $\widehat{\bm{r}}$. We define conjugate Gaussian prior and likelihood distributions over the state/action utilities and preference labels, respectively, to obtain a Gaussian posterior distribution over $\widehat{\bm{r}}$. The prior, likelihood, and posterior take the following form, where $\mathrm{\Lambda}\in {\mathbb{R}}^{D\times D}$ and $\sigma \in \mathbb{R}$ are prior parameters, $\mathrm{\Lambda}$ is positive definite, and we set $\mathrm{\Lambda}=\lambda \mathbb{I}$ for some $\lambda >0$:
$\text{Prior:}\widehat{\bm{r}}\sim \mathcal{N}(0,{\mathrm{\Lambda}}^{1}),\mathrm{\Lambda}=\lambda \mathbb{I};\text{Likelihood:}$ 
$p(\mathbf{y}X,\widehat{\bm{r}};{\sigma}^{2})={\displaystyle \frac{1}{{(2\pi {\sigma}^{2})}^{\frac{N}{2}}}}\mathrm{exp}\left({\displaystyle \frac{1}{2{\sigma}^{2}}}{\mathbf{y}X\widehat{\bm{r}}}^{2}\right)$ 
Posterior:  $\widehat{\bm{r}}X,\bm{y},{\sigma}^{2},\mathrm{\Lambda}\sim \mathcal{N}(\bm{\mu},\mathrm{\Sigma})\text{, where:}$  (12)  
$\bm{\mu}={({X}^{T}X+{\sigma}^{2}\lambda \mathbb{I})}^{1}{X}^{T}\bm{y}$  (13)  
$\mathrm{\Sigma}={\sigma}^{2}{({X}^{T}X+{\sigma}^{2}\lambda \mathbb{I})}^{1}$  (14) 
A1.2 Gaussian process regression
Gaussian process regression [34] extends Bayesian linear regression credit assignment to larger state and action spaces by generalizing across nearby states and actions. We will model the state/action utilities $\widehat{r}(\stackrel{~}{s})$ as a Gaussian process [34] over $\stackrel{~}{s}$, and then use the observed preferences to perform Gaussian process inference on $\widehat{\bm{r}}$.
We model $\widehat{\bm{r}}$ as a Gaussian process: $\widehat{\bm{r}}\sim \mathcal{G}\mathcal{P}({\bm{\mu}}_{\bm{r}},{K}_{r})$, where ${\bm{\mu}}_{\bm{r}}$ is the prior mean vector and ${K}_{r}$ is the prior covariance matrix, such that ${[{K}_{r}]}_{ij}$ gives the prior covariance between $\widehat{r}({\stackrel{~}{s}}_{i})$ and $\widehat{r}({\stackrel{~}{s}}_{j})$. We model $\overline{r}({\tau}_{i})$, the total utility for trajectory ${\tau}_{i}$, as a sum over the latent state/action utilities:
$$\overline{r}({\tau}_{i})=\sum _{j=1}^{D}{x}_{ij}\widehat{r}({\stackrel{~}{s}}_{j})+{\epsilon}_{i},$$  (15) 
with i.i.d. residuals ${\epsilon}_{i}\sim \mathcal{N}(0,{\sigma}_{\epsilon}^{2})$. Because any linear combination of jointly Gaussian variables is Gaussian, $\overline{r}({\tau}_{i})$ is a Gaussian process over the ${x}_{ij}$’s . Let $\bm{R}\in {\mathbb{R}}^{N}$ be the vector with $i$^{th} element equal to ${R}_{i}=\overline{r}({\tau}_{i})$. Calculating the relevant expectations and covariances (see A1.3), one can show that $\widehat{\bm{r}}$ and $\bm{R}$ are jointly normally distributed as follows:
$$\left[\begin{array}{c}\hfill \widehat{\bm{r}}\hfill \\ \hfill \bm{R}\hfill \end{array}\right]\sim \mathcal{N}(\left[\begin{array}{c}\hfill {\bm{\mu}}_{\bm{r}}\hfill \\ \hfill X{\bm{\mu}}_{\bm{r}}\hfill \end{array}\right],\left[\begin{array}{cc}\hfill {K}_{r}\hfill & \hfill {K}_{r}{X}^{T}\hfill \\ \hfill X{K}_{r}^{T}\hfill & \hfill X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}\hfill \end{array}\right]).$$  (16) 
The standard approach for obtaining a conditional distribution from a joint Gaussian distribution [34] yields $\widehat{\bm{r}}\bm{R}\sim \mathcal{N}(\bm{\mu},\mathrm{\Sigma})$, where:
$$\bm{\mu}={\bm{\mu}}_{\bm{r}}+{K}_{r}{X}^{T}{[X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}]}^{1}(\bm{R}X{\bm{\mu}}_{\bm{r}})$$  (17) 
$$\mathrm{\Sigma}={K}_{r}{K}_{r}{X}^{T}{[X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}]}^{1}X{K}_{r}^{T}.$$  (18) 
In practice, $\bm{R}$ is substituted with $\bm{y}$ to perform the inference.
A1.3 Using Gaussian processes to infer state/action utilities from preferences
This section derives the posterior inference equations (17) and (18), used in Gaussian process credit assignment. In this derivation, we act as though we observe the trajectories’ total utilities $\bm{R}$, while remembering that in practice, $\bm{R}$ is approximated via the user’s preferences. Recall that $\widehat{\bm{r}}\in {\mathbb{R}}^{D}$ has $i$^{th} element $\widehat{r}({\stackrel{~}{s}}_{i})$, which models the utility of state/action $i$, while $\bm{R}\in {\mathbb{R}}^{N}$ has $i$^{th} element ${R}_{i}=\overline{r}({\tau}_{i})$. Let ${\bm{x}}_{\bm{i}}$ be the transpose of the $i$^{th} row of $X$.
Our goal is to infer the values of $\widehat{\bm{r}}$ given observations $\bm{R}$ of the trajectories’ total utilities. This can be accomplished via the following four steps:

1.
Model the state/action utilities $\widehat{r}(\stackrel{~}{s})$ as a Gaussian process over $\stackrel{~}{s}$.

2.
Model the trajectory utilities $\bm{R}$ as a Gaussian process that can be defined as a sum of the state/action utilities $\widehat{r}(\stackrel{~}{s})$.

3.
Using the two Gaussian processes defined in 1) and 2), obtain the covariance matrix between the values of $\{\widehat{r}(\stackrel{~}{s})\stackrel{~}{s}\in 1,\mathrm{\dots},D\}$ and $\{{R}_{i}i\in 1,\mathrm{\dots},N\}$.

4.
Write the joint Gaussian distribution between the values of $\{\widehat{r}(\stackrel{~}{s})\stackrel{~}{s}\in 1,\mathrm{\dots},D\}$ and $\{{R}_{i}i\in 1,\mathrm{\dots},N\}$, and obtain the posterior distribution of $\widehat{\bm{r}}$ at all state/actions given $\bm{R}$.
The four subsequent subsections correspond to these four steps, respectively.
A1.3.1 The state/action utility Gaussian process
We model the state/action utilities as a Gaussian process over $\stackrel{~}{s}$, with mean $\mathbb{E}[\widehat{r}(\stackrel{~}{s})]={\mu}_{r}(\stackrel{~}{s})$ and covariance kernel $\mathrm{Cov}(\widehat{r}(\stackrel{~}{s}),\widehat{r}({\stackrel{~}{s}}^{\prime}))={k}_{r}(\stackrel{~}{s},{\stackrel{~}{s}}^{\prime})$, for all state/action pairs $\stackrel{~}{s},{\stackrel{~}{s}}^{\prime}$. Thus,
$$\widehat{r}(\stackrel{~}{s})\sim \mathcal{G}\mathcal{P}({\mu}_{r}(\stackrel{~}{s}),{k}_{r}(\stackrel{~}{s},{\stackrel{~}{s}}^{\prime})).$$  (19) 
Define ${\bm{\mu}}_{\bm{r}}\in {\mathbb{R}}^{D}$ such that the $i$^{th} element is ${[{\bm{\mu}}_{\bm{r}}]}_{i}={\mu}_{r}({\stackrel{~}{s}}_{i})$, the prior mean for the utility of state/action ${\stackrel{~}{s}}_{i}$. Let ${K}_{r}\in {\mathbb{R}}^{D\times D}$ be the covariance matrix over state/action utilities, such that ${[{K}_{r}]}_{ij}={k}_{r}({\stackrel{~}{s}}_{i},{\stackrel{~}{s}}_{j})$. Therefore, $\widehat{\bm{r}}$, the vector for which ${[\widehat{\bm{r}}]}_{i}=\widehat{r}({\stackrel{~}{s}}_{i})$, is also a Gaussian process:
$$\widehat{\bm{r}}\sim \mathcal{G}\mathcal{P}({\bm{\mu}}_{\bm{r}},{K}_{r}).$$  (20) 
A1.3.2 The trajectory utility Gaussian process
By assumption, the trajectory utilities $\bm{R}\in {\mathbb{R}}^{N}$ are sums of the latent state/action utilities via the following relationship between $\bm{R}$ and $\widehat{\bm{r}}$:
$$R({\bm{x}}_{\bm{i}}):={R}_{i}=\sum _{j=1}^{D}{x}_{ij}\widehat{r}({\stackrel{~}{s}}_{j})+{\epsilon}_{i},$$  (21) 
where ${\epsilon}_{i}$ are i.i.d. Gaussian noise variables with distribution $\mathcal{N}(0,{\sigma}_{\epsilon}^{2})$.
Note that $R({\bm{x}}_{i})$ is a Gaussian process over ${\bm{x}}_{i}\in {\mathbb{R}}^{D}$ because $\{\widehat{r}(\stackrel{~}{s}),\forall \stackrel{~}{s}\}$ are jointly normally distributed by definition of a Gaussian process, and any linear combination of jointly Gaussian variables has a univariate normal distribution.
Next, we calculate the expectation and covariance of $\bm{R}$ over the observations. The expectation of its $i$^{th} element ${R}_{i}=R({\bm{x}}_{i})$ can be expressed as follows:
$\mathbb{E}[{R}_{i}]=\mathbb{E}\left[{\displaystyle \sum _{j=1}^{D}}{x}_{ij}\widehat{r}({\stackrel{~}{s}}_{j})+{\epsilon}_{i}\right]={\displaystyle \sum _{j=1}^{D}}{x}_{ij}\mathbb{E}[\widehat{r}({\stackrel{~}{s}}_{j})]={\displaystyle \sum _{j=1}^{D}}{x}_{ij}{\mu}_{r}({\stackrel{~}{s}}_{j}).$  (22) 
The expectation over $\bm{R}(X)$ can thus be written as:
$$\mathbb{E}[\bm{R}(X)]=X{\bm{\mu}}_{\bm{r}}.$$  (23) 
Next, we model the covariance matrix of $\bm{R}(X)$. The $ij$^{th} element of this matrix is the covariance of $R({\bm{x}}_{i})$ and $R({\bm{x}}_{j})$:
$\mathrm{Cov}(R({\bm{x}}_{i}),R({\bm{x}}_{j}))$  $=\mathbb{E}[R({\bm{x}}_{i})R({\bm{x}}_{j})]\mathbb{E}[R({\bm{x}}_{i})]\mathbb{E}[R({\bm{x}}_{j})]$  (24)  
$=\mathbb{E}[R({\bm{x}}_{i})R({\bm{x}}_{j})]\left({\displaystyle \sum _{k=1}^{D}}{x}_{ik}{\mu}_{r}({\stackrel{~}{s}}_{k})\right)\left({\displaystyle \sum _{m=1}^{D}}{x}_{jm}{\mu}_{r}({\stackrel{~}{s}}_{m})\right)$  (25)  
$=\mathbb{E}\left[\left({\displaystyle \sum _{k=1}^{D}}{x}_{ik}\widehat{r}({\stackrel{~}{s}}_{k})+{\epsilon}_{i}\right)\left({\displaystyle \sum _{m=1}^{D}}{x}_{jm}\widehat{r}({\stackrel{~}{s}}_{m})+{\epsilon}_{j}\right)\right]$  (26)  
$\left({\displaystyle \sum _{k=1}^{D}}{x}_{ik}{\mu}_{r}({\stackrel{~}{s}}_{k})\right)\left({\displaystyle \sum _{m=1}^{D}}{x}_{jm}{\mu}_{r}({\stackrel{~}{s}}_{m})\right)$  (27)  
$={\displaystyle \sum _{k=1}^{D}}{\displaystyle \sum _{m=1}^{D}}{x}_{ik}{x}_{jm}\mathbb{E}[\widehat{r}({\stackrel{~}{s}}_{k})\widehat{r}({\stackrel{~}{s}}_{m})]+\mathbb{E}[{\epsilon}_{i}{\epsilon}_{j}]{\displaystyle \sum _{k=1}^{D}}{\displaystyle \sum _{m=1}^{D}}{x}_{ik}{x}_{jm}{\mu}_{r}({\stackrel{~}{s}}_{k}){\mu}_{r}({\stackrel{~}{s}}_{m})$  (28)  
$={\displaystyle \sum _{k=1}^{D}}{\displaystyle \sum _{m=1}^{D}}{x}_{ik}{x}_{jm}[\mathrm{Cov}(\widehat{r}({\stackrel{~}{s}}_{k}),\widehat{r}({\stackrel{~}{s}}_{m}))+{\mu}_{r}({\stackrel{~}{s}}_{k}){\mu}_{r}({\stackrel{~}{s}}_{m})]$  (29)  
${\displaystyle \sum _{k=1}^{D}}{\displaystyle \sum _{m=1}^{D}}{x}_{ik}{x}_{jm}{\mu}_{r}({\stackrel{~}{s}}_{k}){\mu}_{r}({\stackrel{~}{s}}_{m})+{\sigma}_{\epsilon}^{2}{\mathbb{I}}_{[i=j]}$  (30)  
$={\displaystyle \sum _{k=1}^{D}}{\displaystyle \sum _{m=1}^{D}}{x}_{ik}{x}_{jm}\mathrm{Cov}(\widehat{r}({\stackrel{~}{s}}_{k}),\widehat{r}({\stackrel{~}{s}}_{m}))+{\sigma}_{\epsilon}^{2}{\mathbb{I}}_{[i=j]}$  (31)  
$={\displaystyle \sum _{k=1}^{D}}{\displaystyle \sum _{m=1}^{D}}{x}_{ik}{x}_{jm}{k}_{r}({\stackrel{~}{s}}_{k},{\stackrel{~}{s}}_{m})+{\sigma}_{\epsilon}^{2}{\mathbb{I}}_{[i=j]}={\bm{x}}_{i}^{T}{K}_{r}{\bm{x}}_{j}+{\sigma}_{\epsilon}^{2}{\mathbb{I}}_{[i=j]}.$  (32) 
We can then write the covariance matrix of $\bm{R}$ as ${K}_{R}(X)$, where:
$${[{K}_{R}(X)]}_{ij}=\mathrm{Cov}(R({\bm{x}}_{i}),R({\bm{x}}_{j}))={\bm{x}}_{i}^{T}{K}_{r}{\bm{x}}_{j}+{\sigma}_{\epsilon}^{2}{\mathbb{I}}_{[i=j]}.$$  (33) 
From here, it can be seen that ${K}_{R}(X)=X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}$:
$X{K}_{r}{X}^{T}$  $=\left[\begin{array}{c}\hfill {\bm{x}}_{1}^{T}\hfill \\ \hfill {\bm{x}}_{2}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {\bm{x}}_{N}^{T}\hfill \end{array}\right]{K}_{r}\left[\begin{array}{cccc}\hfill {\bm{x}}_{1}\hfill & \hfill {\bm{x}}_{2}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {\bm{x}}_{N}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {\bm{x}}_{1}^{T}\hfill \\ \hfill {\bm{x}}_{2}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {\bm{x}}_{N}^{T}\hfill \end{array}\right]\left[\begin{array}{cccc}\hfill {K}_{r}{\bm{x}}_{1}\hfill & \hfill {K}_{r}{\bm{x}}_{2}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {K}_{r}{\bm{x}}_{N}\hfill \end{array}\right]$  (34)  
$=\left[\begin{array}{ccc}\hfill {\bm{x}}_{1}^{T}{K}_{r}{\bm{x}}_{1}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {\bm{x}}_{1}^{T}{K}_{r}{\bm{x}}_{N}\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill {\bm{x}}_{N}^{T}{K}_{r}{\bm{x}}_{1}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {\bm{x}}_{N}^{T}{K}_{r}{\bm{x}}_{N}\hfill \end{array}\right]={K}_{R}(X){\sigma}_{\epsilon}^{2}\mathbb{I}.$  (35) 
A1.3.3 Covariance between state/action and trajectory utilities
In this subsection, we consider the covariance between $\widehat{\bm{r}}$ and $\bm{R}$, denoted ${K}_{\widehat{r},R}$:
$${[{K}_{\widehat{r},R}]}_{ij}=\mathrm{Cov}({[\widehat{\bm{r}}]}_{i},{[\bm{R}]}_{j})=\mathrm{Cov}(\widehat{r}({\stackrel{~}{s}}_{i}),R({\bm{x}}_{j})).$$  (36) 
This covariance matrix can be expressed in terms of $X,{K}_{r}$, and ${\bm{\mu}}_{\bm{r}}$:
${[{K}_{\widehat{r},R}]}_{ij}$  $=\mathrm{Cov}(\widehat{r}({\stackrel{~}{s}}_{i}),R({\bm{x}}_{j}))=\mathrm{Cov}(\widehat{r}({\stackrel{~}{s}}_{i}),{\displaystyle \sum _{k=1}^{D}}{x}_{jk}\widehat{r}({\stackrel{~}{s}}_{k})+{\epsilon}_{j})$  (37)  
$=\mathbb{E}\left[\widehat{r}({\stackrel{~}{s}}_{i}){\displaystyle \sum _{k=1}^{D}}{x}_{jk}\widehat{r}({\stackrel{~}{s}}_{k})+{\epsilon}_{j}\widehat{r}({\stackrel{~}{s}}_{i})\right]\mathbb{E}[\widehat{r}({\stackrel{~}{s}}_{i})]\mathbb{E}\left[{\displaystyle \sum _{k=1}^{D}}{x}_{jk}\widehat{r}({\stackrel{~}{s}}_{k})+{\epsilon}_{j}\right]$  (38)  
$={\displaystyle \sum _{k=1}^{D}}{x}_{jk}\mathbb{E}[\widehat{r}({\stackrel{~}{s}}_{i})\widehat{r}({\stackrel{~}{s}}_{k})][{\mu}_{r}({\stackrel{~}{s}}_{i})][{\bm{x}}_{j}^{T}{\bm{\mu}}_{\bm{r}}]$  (39)  
$={\displaystyle \sum _{k=1}^{D}}{x}_{jk}\{\mathrm{Cov}(\widehat{r}({\stackrel{~}{s}}_{i}),\widehat{r}({\stackrel{~}{s}}_{k}))+\mathbb{E}[\widehat{r}({\stackrel{~}{s}}_{i})]\mathbb{E}[\widehat{r}({\stackrel{~}{s}}_{k})]\}{\mu}_{r}({\stackrel{~}{s}}_{i}){\bm{x}}_{j}^{T}{\bm{\mu}}_{\bm{r}}$  (40)  
$={\displaystyle \sum _{k=1}^{D}}{x}_{jk}[{k}_{r}({\stackrel{~}{s}}_{i},{\stackrel{~}{s}}_{k})+{\mu}_{r}({\stackrel{~}{s}}_{i}){\mu}_{r}({\stackrel{~}{s}}_{k})]{\mu}_{r}({\stackrel{~}{s}}_{i}){\bm{x}}_{j}^{T}{\bm{\mu}}_{\bm{r}}$  (41)  
$={\displaystyle \sum _{k=1}^{D}}{x}_{jk}{k}_{r}({\stackrel{~}{s}}_{i},{\stackrel{~}{s}}_{k})+{\mu}_{r}({\stackrel{~}{s}}_{i}){\bm{x}}_{j}^{T}{\bm{\mu}}_{\bm{r}}{\mu}_{r}({\stackrel{~}{s}}_{i}){\bm{x}}_{j}^{T}{\bm{\mu}}_{\bm{r}}={\displaystyle \sum _{k=1}^{D}}{x}_{jk}{k}_{r}({\stackrel{~}{s}}_{i},{\stackrel{~}{s}}_{k})={\bm{x}}_{j}^{T}{[{K}_{r}]}_{i,:},$  (42) 
where ${[{K}_{r}]}_{i,:}$ is the column vector obtained by transposing the $i$^{th} row of ${K}_{r}$. From here, one can see that ${K}_{\widehat{r},R}={K}_{r}{X}^{T}$:
${K}_{r}{X}^{T}$  $=\left[\begin{array}{c}\hfill {[{K}_{r}]}_{1,:}^{T}\hfill \\ \hfill {{}_{2,:}}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {{}_{D,:}}^{T}\hfill \end{array}\right]*\left[\begin{array}{cccc}\hfill {\bm{x}}_{1}\hfill & \hfill {\bm{x}}_{2}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {\bm{x}}_{N}\hfill \end{array}\right]={K}_{\widehat{r},R}.$  (43) 
A1.3.4 Posterior inference over state/action utilities
The results of the previous three subsections can be combined to obtain the following joint probability density between $\widehat{\bm{r}}$ and $\bm{R}$:
$$\left[\begin{array}{c}\hfill \widehat{\bm{r}}\hfill \\ \hfill \bm{R}\hfill \end{array}\right]\sim \mathcal{N}(\left[\begin{array}{c}\hfill {\bm{\mu}}_{\bm{r}}\hfill \\ \hfill X{\bm{\mu}}_{\bm{r}}\hfill \end{array}\right],\left[\begin{array}{cc}\hfill {K}_{r}\hfill & \hfill {K}_{r}{X}^{T}\hfill \\ \hfill X{K}_{r}^{T}\hfill & \hfill X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}\hfill \end{array}\right]).$$  (44) 
This relationship expresses all components of the joint Gaussian density in terms of $X,{K}_{r}$, and ${\bm{\mu}}_{\bm{r}}$, or in other words, in terms of the state/action visit counts in the observed trajectories (captured by $X$) and the Gaussian process prior on $\widehat{\bm{r}}$.
Using the standard approach for obtaining a conditional distribution from a jointly Gaussian distribution (see Appendix A.2 in [34]), we arrive at:
$$\widehat{\bm{r}}\bm{R}\sim \mathcal{N}(\bm{\mu},\mathrm{\Sigma})\text{, where:}$$  (45) 
$\bm{\mu}$  $={\bm{\mu}}_{\bm{r}}+{K}_{r}{X}^{T}{[X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}]}^{1}(\bm{R}X{\bm{\mu}}_{\bm{r}})$  (46)  
$\mathrm{\Sigma}$  $={K}_{r}{K}_{r}{X}^{T}{[X{K}_{r}{X}^{T}+{\sigma}_{\epsilon}^{2}\mathbb{I}]}^{1}X{K}_{r}^{T}.$  (47) 
Thus, we have expressed the conditional posterior density of $\widehat{\bm{r}}$ in terms of $X,{K}_{r}$, ${\bm{\mu}}_{\bm{r}}$, and the observations $\bm{R}\approx \bm{y}$.
A2 Proofs
This section proves the theoretical results stated in Section 5 of the paper.
We begin by restating a result from Gourieroux and Monfort [21] establishing conditions in which logistic regression is asymptotically consistent, after defining two necessary conditions.
Definition 1 (Derivative of sigmoid).
$f:\mathbb{R}\u27f6\mathbb{R}$, where $f\mathrm{=}\frac{{e}^{\mathrm{}x}}{{\mathrm{(}\mathrm{1}\mathrm{+}{e}^{\mathrm{}x}\mathrm{)}}^{\mathrm{2}}}$. Note that $f\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}f\mathit{}\mathrm{(}\mathrm{}x\mathrm{)}$. This is the derivative of the sigmoid function, $\sigma \mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}\frac{\mathrm{1}}{\mathrm{1}\mathrm{+}{e}^{\mathrm{}x}}$.
Definition 2.
Let $\overline{\mathbf{r}}\mathrm{\in}{\mathrm{R}}^{D}$ be the vector of true state/action utilities (assumed to exist) and ${\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ be its transformation via (4). Define $\stackrel{\mathrm{~}}{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ as the state/action rewards sampled from the Bayesian logistic regression model posterior in episode $k$, $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ as the model’s maximum a posteriori (MAP) estimate at episode $k$, and $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}\mathrm{\in}{\mathrm{R}}^{D\mathrm{}\mathrm{1}}$ as its maximum likelihood estimate at $k$. Lastly, ${\stackrel{\mathrm{~}}{\mathbf{r}}}_{k}\mathrm{\in}{\mathrm{R}}^{D}$, ${\widehat{\mathbf{r}}}_{k}\mathrm{\in}{\mathrm{R}}^{D}$, and ${\widehat{\mathbf{r}}}_{M\mathit{}L\mathrm{,}k}\mathrm{\in}{\mathrm{R}}^{D}$ are their respective equivalents given by (5).
Condition 1.
$$ such that $\mathrm{}{z}_{i\mathit{}j}\mathrm{}\mathrm{\le}{m}_{\mathrm{0}}$ for all $i\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}N\mathrm{\}}\mathrm{,}j\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}D\mathrm{}\mathrm{1}\mathrm{\}}$.
Condition 1 is always satisfied because ${z}_{ij}={\bm{x}}_{i}^{T}{\bm{v}}_{j}$ by (4), where ${\bm{v}}_{j}$ is a unit vector. Additionally, ${\bm{x}}_{i}={\bm{x}}_{1i}{\bm{x}}_{i2}$ is difference of two vectors that both count how many times each state/action pair is visited in an episode, and thus both have only positive elements that sum to the episode horizon, $h$. So, ${z}_{ij}\le {{\bm{x}}_{i}}_{2}{{\bm{v}}_{j}}_{2}={{\bm{x}}_{i}}_{2}\le {{\bm{x}}_{1i}{\bm{x}}_{i2}}_{1}={\bm{x}}_{1i}+{{\bm{x}}_{i2}}_{1}=2h$, where the inequalities are respectively the CauchySchwarz inequality, $$, and the triangle inequality. So, Condition 1 holds for ${m}_{0}=2h$.
Condition 2.
Let ${\lambda}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}$ and ${\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}$ be the largest and smallest eigenvalues, respectively, of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$. Then, $$ such that $$, for all k.
Intuitively, Condition 2 requires that an arbitrarilyhigh fraction of observations cannot lie in a proper subspace of their possible span. This condition is necessary, as otherwise data outside this subspace would become increasingly ignored as more data points are obtained. Condition 2 states that the inverse Hessian of the negative log reward posterior—evaluated at the true rewards ${\overline{\bm{r}}}^{\mathbf{\prime}}$—has an upperbounded ratio between its largest and smallest eigenvalues. This can be explicitly enforced by bounding this maximumtominimum eigenvalue ratio for the matrix ${\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}){\bm{z}}_{i}{\bm{z}}_{i}^{T}$, and only updating the reward posterior when the eigenvalue ratio is belowthreshold. As shown in Lemma 1 below, satisfying this restriction bounds the eigenvalue ratio for ${\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}$, as desired.
Condition 2 may also be guaranteed in certain situations, e.g., if randomness in the dynamics is sufficient to ensure a fullrowrank observation matrix $Z$. In this case, the eigenvalue ratio will be bounded regardless of the specific policies executed. If properties of the environment are known to guarantee Condition 2, then it need not be explicitly enforced. Meanwhile, weakening the requirements of Condition 2—for instance, by strengthening the Bayesian logistic regression convergence analysis or considering other credit assignment models—would be an interesting avenue for future work.
Lemma 1 (Enforcing Condition 2).
The eigenvalue ratio $\frac{{\lambda}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathit{}\mathrm{\left(}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}\mathrm{\right)}}{{\lambda}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathit{}\mathrm{\left(}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}\mathrm{\right)}}$ has a constant upper bound that holds for all $k$ if and only if the eigenvalue ratio $\frac{{\lambda}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathit{}\mathrm{\left(}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}\mathrm{\right)}}{{\lambda}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathit{}\mathrm{\left(}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}\mathrm{\right)}}$ has a constant upper bound that holds for all $k$. Therefore, one can ensure that the former condition holds by enforcing the latter.
Proof.
We apply the result in Lemma 1.1 below. To verify that the conditions for this result hold, we must show that both $f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}})$ and $f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})$ have upper and lower bounds for all $i,k$, where the lower bound exceeds zero. The upper bound exists because $f(x)\in (0,1]\forall x$.
It remains to show that both $f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}})$ and $f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})$ are lowerbounded above zero. Because $f(x)$ monotonically decreases as $x$ increases, this is true as long as ${\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}$ and ${\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}$ are upperbounded. In the former case, $$. The quantity ${{\bm{z}}_{i}}_{2}$ is upperbounded by Condition 1. The rewards ${\overline{\bm{r}}}^{\mathbf{\prime}}$ produce the same policy when scaled by any positive quantity, and so their magnitude can be viewed as fixed. The same logic holds in the latter case.
∎
Lemma 1.1 (Bounded eigenvalue ratios).
Let ${A}_{k}\mathrm{,}{B}_{k}\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}n}$ be two matrices of the form ${A}_{k}\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}{\alpha}_{i}\mathit{}{\mathbf{v}}_{i}\mathit{}{\mathbf{v}}_{i}^{T}$, ${B}_{k}\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}{\beta}_{i}\mathit{}{\mathbf{v}}_{i}\mathit{}{\mathbf{v}}_{i}^{T}$, where ${\alpha}_{i}\mathrm{\in}\mathrm{[}{\alpha}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathrm{,}{\alpha}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathrm{]}\mathrm{,}{\beta}_{i}\mathrm{\in}\mathrm{[}{\beta}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathrm{,}{\beta}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathrm{]}$, ${\alpha}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathrm{>}\mathrm{0}\mathrm{,}{\beta}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathrm{>}\mathrm{0}$, and ${\mathbf{v}}_{i}\mathrm{\in}{\mathrm{R}}^{n}$ for $i\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}k\mathrm{\}}$. Let ${\lambda}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathit{}\mathrm{(}M\mathrm{)}$ and ${\lambda}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathit{}\mathrm{(}M\mathrm{)}$ respectively be the largest eigenvalues of a matrix $M$. Then, $\frac{{\lambda}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathit{}\mathrm{(}{A}_{k}\mathrm{)}}{{\lambda}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathit{}\mathrm{(}{A}_{k}\mathrm{)}}$ is upperbounded for all $T$ if and only if $\frac{{\lambda}_{\text{\mathit{m}\mathit{a}\mathit{x}}}\mathit{}\mathrm{(}{B}_{k}\mathrm{)}}{{\lambda}_{\text{\mathit{m}\mathit{i}\mathit{n}}}\mathit{}\mathrm{(}{B}_{k}\mathrm{)}}$ is upperbounded for all $T$.
Proof.
Without loss of generality, we assume that $$ for some ${m}_{1}$ and for all $k$, and will show that $\frac{{\lambda}_{\text{max}}({B}_{k})}{{\lambda}_{\text{min}}({B}_{k})}$ has an upper bound for all $k$. Note that $a\bm{u}{\bm{u}}^{T}\u2ab00$ (i.e., is positive semidefinite) for any $a>0$ and vector $\bm{u}$, and that sums of positive semidefinite matrices are also positive semidefinite. In addition, we will use the following facts about arbitrary positive semidefinite matrices $A,B\u2ab00$ (which can be shown from the CourantFischerWeyl minmax principle):
${\lambda}_{\text{max}}(A)$  $\le {\lambda}_{\text{max}}(A+B)$  (48)  
${\lambda}_{\text{min}}(A)$  $\le {\lambda}_{\text{min}}(A)+{\lambda}_{\text{min}}(B)\le {\lambda}_{\text{min}}(A+B)$  (49) 
The desired result is an outcome of the following four relations:
${\lambda}_{\text{max}}({A}_{k})$  $={\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\alpha}_{i}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)={\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}({\alpha}_{i}{\alpha}_{\text{min}}){\bm{v}}_{i}{\bm{v}}_{i}^{T}+{\displaystyle \sum _{i=1}^{k}}{\alpha}_{\text{min}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)$  (50)  
$\ge {\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\alpha}_{\text{min}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)={\alpha}_{\text{min}}{\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\text{, by (}\text{48}\text{)}$  (51)  
${\lambda}_{\text{min}}({A}_{k})$  $={\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\alpha}_{i}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\le {\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\alpha}_{i}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)+{\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}({\alpha}_{\text{max}}{\alpha}_{i}){\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)$  (52)  
$\le {\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\alpha}_{\text{max}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)={\alpha}_{\text{max}}{\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\text{, by (}\text{49}\text{)}$  (53)  
${\lambda}_{\text{max}}({B}_{k})$  $={\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\beta}_{i}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\le {\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\beta}_{i}{\bm{v}}_{i}{\bm{v}}_{i}^{T}+{\displaystyle \sum _{i=1}^{k}}({\beta}_{\text{max}}{\beta}_{i}){\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\text{, by (}\text{48}\text{)}$  (54)  
$={\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\beta}_{\text{max}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)={\beta}_{\text{max}}{\lambda}_{\text{max}}\left({\displaystyle \sum _{i=1}^{k}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)$  (55)  
${\lambda}_{\text{min}}({B}_{k})$  $={\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\beta}_{i}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)={\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}({\beta}_{i}{\beta}_{\text{min}}){\bm{v}}_{i}{\bm{v}}_{i}^{T}+{\displaystyle \sum _{i=1}^{k}}{\beta}_{\text{min}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)$  (56)  
$\ge {\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\beta}_{\text{min}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)={\beta}_{\text{min}}{\lambda}_{\text{min}}\left({\displaystyle \sum _{i=1}^{k}}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\text{, by (}\text{49}\text{)}$  (57) 
Now, we can upper bound the eigenvalue ratio for ${B}_{k}$:
$\frac{{\lambda}_{\text{max}}({B}_{k})}{{\lambda}_{\text{min}}({B}_{k})}$  $\le {\displaystyle \frac{{\beta}_{\text{max}}{\lambda}_{\text{max}}\left({\sum}_{i=1}^{k}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)}{{\beta}_{\text{min}}{\lambda}_{\text{min}}\left({\sum}_{i=1}^{k}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)}}\text{, by (}\text{55}\text{) and (}\text{57}\text{)}$  (58)  
$\le {\displaystyle \frac{{\beta}_{\text{max}}}{{\beta}_{\text{min}}}}{\displaystyle \frac{{\lambda}_{\text{max}}({A}_{k})}{{\alpha}_{\text{min}}}}{\left[{\displaystyle \frac{{\lambda}_{\text{min}}({A}_{k})}{{\alpha}_{\text{max}}}}\right]}^{1}\text{, by (}\text{51}\text{) and (}\text{53}\text{)}$  (59)  
$={\displaystyle \frac{{\beta}_{\text{max}}{\alpha}_{\text{max}}}{{\beta}_{\text{min}}{\alpha}_{\text{min}}}}{\displaystyle \frac{{\lambda}_{\text{max}}({A}_{k})}{{\lambda}_{\text{min}}({A}_{k})}}\le {\displaystyle \frac{{\beta}_{\text{max}}{\alpha}_{\text{max}}}{{\beta}_{\text{min}}{\alpha}_{\text{min}}}}*{m}_{1}.$  (60) 
∎
We will apply the following result from Gourieroux and Monfort [21] concerning asymptotic consistency of the maximum likelihood estimator:
Proposition 1 (Asymptotic consistency of logistic regression [21]).
If Conditions 1 and 2 are satisfied, then the maximum likelihood estimator $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}$ of ${\overline{\mathbf{r}}}^{\mathrm{\prime}}$ exists almost surely as $k\mathrm{\u27f6}\mathrm{\infty}$, and $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}$ converges almost surely to the true value ${\overline{\mathbf{r}}}^{\mathrm{\prime}}$ if and only if $\underset{k\mathrm{\u27f6}\mathrm{\infty}}{\text{\mathit{l}\mathit{i}\mathit{m}}}\mathit{}{\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{=}\mathrm{\infty}$.
Proof.
This result is a restatement of Proposition 3 in Gourieroux and Monfort [21]. ∎
Remark: The proof of Proposition 1 in [21] can be adapted such that the same result holds when the maximum likelihood estimator $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{ML,k}$ is replaced with the MAP estimator $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}$; thus, it applies to our setting. We show that Proposition 1’s final condition for convergence of the MAP estimator holds:
Lemma 2.
Under known transition dynamics, all eigenvalues of the matrix ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$ approach infinity as $k\mathrm{\u27f6}\mathrm{\infty}$.
Proof.
Let ${\alpha}_{i}=f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}})$ and ${\widehat{\alpha}}_{i}^{(k)}=f({y}_{i}{\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})$. The values ${\alpha}_{i}$ are both upperbounded and nondecaying: $f(x)\in (0,1)$ for all $x$, and ${\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}$ is bounded in magnitude since we assume the true rewards ${\overline{\bm{r}}}^{\mathbf{\prime}}$ are bounded. We can write:
$$\sum _{i=1}^{N}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}=\sum _{i=1}^{N}{\alpha}_{i}{\bm{z}}_{i}{\bm{z}}_{i}^{T}=\left[\begin{array}{cccc}\hfill \sqrt{{\alpha}_{1}}{\bm{z}}_{1}\hfill & \hfill \sqrt{{\alpha}_{2}}{\bm{z}}_{2}\hfill & \hfill \mathrm{\dots}\hfill & \hfill \sqrt{{\alpha}_{N}}{\bm{z}}_{N}\hfill \end{array}\right]\left[\begin{array}{c}\hfill \sqrt{{\alpha}_{1}}{\bm{z}}_{1}^{T}\hfill \\ \hfill \sqrt{{\alpha}_{2}}{\bm{z}}_{2}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill \sqrt{{\alpha}_{N}}{\bm{z}}_{N}^{T}\hfill \end{array}\right].$$  (61) 
Define $M\in {\mathbb{R}}^{N\times (D1)}$, ${M}_{1}\in {\mathbb{R}}^{m\times (D1)}$, and ${M}_{2}\in {\mathbb{R}}^{(Nm)\times (D1)}$ as:
$$M=\left[\begin{array}{c}\hfill \sqrt{{\alpha}_{1}}{\bm{z}}_{1}^{T}\hfill \\ \hfill \sqrt{{\alpha}_{2}}{\bm{z}}_{2}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill \sqrt{{\alpha}_{N}}{\bm{z}}_{N}^{T}\hfill \end{array}\right],{M}_{1}=\left[\begin{array}{c}\hfill \sqrt{{\alpha}_{1}}{\bm{z}}_{1}^{T}\hfill \\ \hfill \sqrt{{\alpha}_{2}}{\bm{z}}_{2}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill \sqrt{{\alpha}_{m}}{\bm{z}}_{m}^{T}\hfill \end{array}\right],{M}_{2}=\left[\begin{array}{c}\hfill \sqrt{{\alpha}_{m+1}}{\bm{z}}_{m+1}^{T}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill \sqrt{{\alpha}_{N}}{\bm{z}}_{N}^{T}\hfill \end{array}\right].$$  (62) 
Then,
$M=\left[\begin{array}{c}\hfill {M}_{1}\hfill \\ \hfill {M}_{2}\hfill \end{array}\right],\text{and}{M}^{T}M={\displaystyle \sum _{i=1}^{N}}{\alpha}_{i}{\bm{z}}_{i}{\bm{z}}_{i}^{T}={M}_{1}^{T}{M}_{1}+{M}_{2}^{T}{M}_{2}.$  (63) 
Also, define $\widehat{M}\in {\mathbb{R}}^{N\times (D1)}$, ${\widehat{M}}_{1}\in {\mathbb{R}}^{m\times (D1)}$, and ${\widehat{M}}_{2}\in {\mathbb{R}}^{(Nm)\times (D1)}$ analogously, but replacing ${\alpha}_{i}$ with ${\widehat{\alpha}}_{i}^{(N)}$. Note that ${M}^{T}M$ and ${\widehat{M}}^{T}\widehat{M}$ have the same rank, since $M$ and $\widehat{M}$ have the same rowrank; similarly ${M}^{T}{M}_{i}$ and ${\widehat{M}}_{i}^{T}{\widehat{M}}_{i}$ have the same rank for $i\in \{1,2\}$.
Assume that there exists $m\in \mathbb{N}$ such that ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$ has rank less than $(D1)$; let $r\ge 1$ be the number of zero eigenvalues of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$. We will show that as $N$ increases, the probability of drawing a sample ${\bm{z}}_{i}$ that is linearly independent of the rows already in ${\widehat{M}}_{2}$ (that is, a sample that increases the rank of ${M}_{2}^{T}{M}_{2}$) does not decay to zero. This means that there will be infinitelymany nonoverlapping sets of indices $\{j,\mathrm{\dots},k\}$ such that ${\sum}_{i=j}^{k}{\alpha}_{i}{\bm{z}}_{i}{\bm{z}}_{i}^{T}$ has rank $(D1)$. From here, we can show that $(D1)$ of the eigenvalues of ${M}^{T}M$ approach infinity (recall that the ${\alpha}_{i}$ values are lowerbounded).
Under the Laplace approximation, the posterior covariance after $N$ episodes takes the form ${\mathrm{\Sigma}}_{N}={\left({\sum}_{i=1}^{N}f({y}_{i}{\bm{z}}_{i}^{T}{\widehat{\bm{r}}}_{N}){\bm{z}}_{i}{\bm{z}}_{i}^{T}+{V}_{0}^{1}\right)}^{1}={\left({\widehat{M}}^{T}\widehat{M}+{V}_{0}^{1}\right)}^{1}$. Under our assumption that ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$ has $r\ge 1$ eigenvalues equal to zero, the rowrank of ${\widehat{M}}_{2}$ is $(D1r)$.
We show that ${\mathrm{\Sigma}}_{N}$ has $r$ eigenvalues that are bounded away from zero. We write the eigenvalues of any square matrix ${M}_{0}\in {\mathbb{R}}^{n\times n}$ as ${\lambda}_{1}({M}_{0})\ge {\lambda}_{2}({M}_{0})\ge \mathrm{\dots}\ge {\lambda}_{n}({M}_{0})$. Then, we apply the following fact (which can be shown from the CourantFischerWeyl minmax principle): for positive semidefinite matrices $A,B\in {\mathbb{R}}^{n\times n}$,
${\lambda}_{n}(A)+{\lambda}_{k}(B)\le {\lambda}_{k}(A+B)\le {\lambda}_{1}(A)+{\lambda}_{k}(B).$  (64) 
Clearly, ${\sum}_{i=1}^{N}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}$ and ${V}_{0}^{1}$ are both positive semidefinite (${V}_{0}$ is positive definite by its definition), and so:
$0\le {\lambda}_{k}\left({\displaystyle \sum _{i=1}^{N}}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}+{V}_{0}^{1}\right)\le {\lambda}_{1}\left({V}_{0}^{1}\right)+{\lambda}_{k}\left({\displaystyle \sum _{i=1}^{N}}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}\right),$  (65) 
where the first inequality holds because the sum of two positive semidefinite matrices is also positive semidefinite, and the second inequality is an application of the inequality in (64). The first term in (65) is a constant, while the second is zero for $k>D1r$. Thus, ${\mathrm{\Sigma}}_{N}^{1}={\sum}_{i=1}^{N}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}+{V}_{0}^{1}$ has $r$ eigenvalues that are upperbounded, where the upper bound depends only on ${V}_{0}$. Therefore, $\mathrm{\Sigma}$ has $r$ eigenvalues that are lowerbounded by ${\left[{\lambda}_{1}\left({V}_{0}^{1}\right)\right]}^{1}={\left[\frac{1}{{\lambda}_{D1}({V}_{0})}\right]}^{1}={\lambda}_{D1}({V}_{0})>0$; the other $D1r$ eigenvalues of $\mathrm{\Sigma}$ all decay to zero.
To complete the proof, we argue that the following statements hold; the first two of these statements are proven in the two subsequent sublemmas; the third statement completes the proof.

1.
Note that ${M}^{T}M$ and ${\widehat{M}}^{T}\widehat{M}$ have the same rank, due to $M$ and $\widehat{M}$ having the same rowrank. Assume that for some $m$, ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$ has $r\ge 1$ zero eigenvalues. The reward vector ${\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}$ sampled from the logistic regression posterior can be expressed in terms of the eigenbasis of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$, $\{{\bm{\nu}}_{i}i=1,\mathrm{\dots},D1\}$: ${\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}={\sum}_{i=1}^{D1}{\beta}_{i}{\bm{\nu}}_{i}$ for some coefficients ${\beta}_{i}$. Then, as $N\u27f6\mathrm{\infty}$, for $j$ such that ${\bm{v}}_{j}$ is in the nullspace of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$, the probability of sampling ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ such that $\frac{{\beta}_{j}}{{\sum}_{i=1}^{D1}{\beta}_{i}}\ge b$ for any $b\in (0,1)$ does not decay to zero.

2.
There exists $b\in (0,1)$ such that if $\frac{{\beta}_{j}}{{\sum}_{i=1}^{D1}{\beta}_{i}}\ge b$, then with probability above zero, the policy will sample a trajectory such that the next ${\bm{z}}_{i}$ sample has a nonzero component in ${\bm{\nu}}_{j}$.

3.
If a trajectory as described in 2) is sampled, a zeroeigenvalue of ${M}_{2}^{T}{M}_{2}$ increases by a lowerbounded amount. This comes from the fact that there are only finitelymany possible ${\bm{z}}_{i}$ vectors, so if ${\bm{z}}_{i}$ has nonzero inner product with an eigenvector, this inner product cannot be arbitrarily close to zero. This contradicts that $\text{Rank}({M}_{2}^{T}{M}_{2})$ remains below $D1$ as $N\u27f6\mathrm{\infty}$.
∎
Lemma 2.1 (Proof of Statement 1 in Lemma 2).
Proof.
In episode $N$, the sampled reward vector ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ is drawn from the logistic regression posterior, $\mathcal{N}({\widehat{\bm{r}}}^{\mathbf{\prime}},{\mathrm{\Sigma}}_{N})$, with covariance matrix ${\mathrm{\Sigma}}_{N}={\left({\sum}_{i=1}^{N}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}+{V}_{0}^{1}\right)}^{1}$. Consider a sample ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ with a multivariate normal distribution in ${\mathbb{R}}^{D1}$:
$$\mathcal{N}({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N},{\mathrm{\Sigma}}_{N})\propto \mathrm{exp}\left(\frac{1}{2}{({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})}^{T}{\mathrm{\Sigma}}_{N}^{1}(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{N}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})\right).$$  (66) 
The intuition for this proof is as follows. There is a decreasing probability of sampling a vector ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ such that $({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})$ has components in eigenvectors of ${\mathrm{\Sigma}}_{N}$ whose eigenvalues decay toward zero. Rather, the probability density concentrates toward the span of eigenvectors of ${\mathrm{\Sigma}}_{N}$ whose eigenvalues are bounded away from zero.
The probability density of ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ can be viewed as a function of ${({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})}^{T}{\mathrm{\Sigma}}_{N}^{1}({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})$. This fact can be used to define ${R}_{\alpha ,N}\subset {\mathbb{R}}^{D1}$:
$${R}_{\alpha ,N}=\{{\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}\in {\mathbb{R}}^{D1}{({\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})}^{T}{\mathrm{\Sigma}}^{1}({\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})\le \alpha \}.$$  (67) 
For $\epsilon \in (0,1)$, define ${\alpha}_{0}(\epsilon )$ as the maximum possible value of $\alpha $ such that $P({\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}\in {R}_{\alpha})\ge 1\epsilon $:
$${\alpha}_{0}(\epsilon )=\underset{\text{s.t.}P({\stackrel{\mathbf{~}}{\bm{r}}}^{\prime}\in {R}_{\alpha})\ge 1\epsilon}{\text{max}}\alpha .$$  (68) 
Thus, ${R}_{{\alpha}_{0}(\epsilon )}$ contains at least $1\epsilon $ of the probability density of ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$.
Recall that ${\mathrm{\Sigma}}_{N}^{1}$ has $r$ eigenvalues that are upperbounded, while the other $D1r$ eigenvalues of ${\mathrm{\Sigma}}_{N}^{1}$ all approach infinity as $N\u27f6\mathrm{\infty}$. Let $({\lambda}_{i},{\bm{\mu}}_{i})$ represent the eigenvalues and eigenvectors of ${\mathrm{\Sigma}}_{N}^{1}$, respectively. Then, ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ can be expressed in terms of the eigenbasis: ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}={\sum}_{i=1}^{D1}{\eta}_{i}{\bm{\mu}}_{i}$ for some coefficients ${\eta}_{i}$, and:
$${({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})}^{T}{\mathrm{\Sigma}}^{1}({\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N})=\sum _{i=1}^{D1}{\eta}_{i}^{2}{\lambda}_{i}.$$  (69) 
Then, ${R}_{{\alpha}_{0}(\epsilon )}$ can be written as,
$${R}_{{\alpha}_{0}(\epsilon )}=\left\{{\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\in {\mathbb{R}}^{L1}\right\sum _{i=1}^{D1}{\eta}_{i}^{2}{\lambda}_{i}\le {\alpha}_{0}(\epsilon )\}.$$  (70) 
We can divide ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ for ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\in {R}_{{\alpha}_{0}(\epsilon )}$ into two terms:
$${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}=\sum _{i=1}^{D1r}{\eta}_{i}{\bm{\mu}}_{i}+\sum _{i=Dr}^{D1}{\eta}_{i}{\bm{\mu}}_{i},$$  (71) 
where ${\lambda}_{i}\u27f6\mathrm{\infty}$ for $i\le D1r$. The first term contains eigenvectors ${\bm{\mu}}_{i}$ corresponding to eigenvalues of ${\mathrm{\Sigma}}_{N}^{1}$ that grow to infinity, while the second term contains eigenvectors corresponding to upperbounded eigenvalues of ${\mathrm{\Sigma}}_{N}^{1}$. For ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ belonging to ${R}_{{\alpha}_{0}(\epsilon )}$, as ${\lambda}_{i}\u27f6\mathrm{\infty}$, the ${\eta}_{i}$ coefficients in the first term will decay; otherwise, the sum ${\sum}_{i=1}^{D1}{\eta}_{i}^{2}{\lambda}_{i}$ grows unboundedly, while ${\sum}_{i=1}^{D1}{\eta}_{i}^{2}{\lambda}_{i}$ remains comparativelysmaller for increasinglymany vectors that place weight only on ${\bm{\mu}}_{i}$ for $i>D1r$. In other words, as $N\u27f6\mathrm{\infty}$, the probability density of ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ increasingly concentrates toward vectors that can be expressed as linear combinations of eigenvectors of ${\mathrm{\Sigma}}_{N}^{1}$ corresponding to upperbounded ${\lambda}_{i}$. Thus, as long as ${\lambda}_{j},j>D1r$, are upperbounded, the probability of sampling ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ such that $\frac{{\eta}_{j}}{{\sum}_{i=1}^{D1}{\eta}_{i}}\ge b$ for any $j>D1r$ and any $b\in (0,1)$ does not decay to zero.
Next, we argue that the eigenvectors of ${\mathrm{\Sigma}}_{N}^{1}$ approach a set of vectors spanning the null space of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$ as $N\u27f6\mathrm{\infty}$. For $j\le D1r$, $({\lambda}_{j},{\bm{\mu}}_{j})$ is an eigenpair of ${\mathrm{\Sigma}}_{N}^{1}$ for which ${\lambda}_{j}\u27f6\mathrm{\infty}$ as $N\u27f6\mathrm{\infty}$. The eigenpair relationship gives:
$\left({\displaystyle \sum _{i=1}^{N}}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}+{V}_{0}^{1}\right){\bm{\mu}}_{j}={\lambda}_{j}{\bm{\mu}}_{j}.$  (72) 
Dividing by ${\lambda}_{j}$,
$\left({\displaystyle \frac{1}{{\lambda}_{j}}}{\displaystyle \sum _{i=1}^{N}}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}+{\displaystyle \frac{1}{{\lambda}_{j}}}{V}_{0}^{1}\right){\bm{\mu}}_{j}={\bm{\mu}}_{j}.$  (73) 
The term $\frac{1}{{\lambda}_{j}}{V}_{0}^{1}$ becomes insignificant as ${\lambda}_{j}\u27f6\mathrm{\infty}$, since ${V}_{0}$ is constant, while scaling the first term by $\frac{1}{{\lambda}_{j}}$ does not affect its eigenbasis. Thus, for $j\le D1r$, ${\bm{\mu}}_{j}$ approaches an eigenvector of ${\sum}_{i=1}^{N}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}$ that has an unbounded eigenvector. Meanwhile, the span of eigenvectors ${\bm{\mu}}_{j}$ for $j>D1r$ approaches the span of eigenvectors of ${\sum}_{i=1}^{N}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}$ with upperbounded eigenvalues. Equivalently, the span of eigenvectors ${\bm{\mu}}_{j}$ for $j>D1r$ approaches the null space of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}={\sum}_{i=m+1}^{N}{\widehat{\alpha}}_{i}^{(N)}{\bm{z}}_{i}{\bm{z}}_{i}^{T}$.
Recall from Statement 1 that the eigenvectors of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$ are $\{{\bm{\nu}}_{i}i=1,\mathrm{\dots},D1\}$. The $r$ smallest of these eigenvectors correspond to $\text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$; so, for $i>D1r$, ${\lambda}_{i}$ are upperbounded and $\text{Span}({\bm{\mu}}_{i},i>Dr1)$ approaches $\text{Span}({\bm{\nu}}_{i},i>Dr1)$. The probability of sampling ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}={\sum}_{i=1}^{D1}{\beta}_{i}{\bm{\nu}}_{i}$—with ${\beta}_{i}>\epsilon $ for $$ and fixed $\epsilon >0$—decays as $N\u27f6\mathrm{\infty}$. Thus ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ increasingly coincides with $\text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$. Meanwhile, for $j$ such that ${\bm{v}}_{j}\in \text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$, the probability of sampling ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ such that $\frac{{\beta}_{j}}{{\sum}_{i=1}^{D1}{\beta}_{i}}\ge b$ for any $b\in (0,1)$ does not decay to zero.
We have shown that ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ increasingly coincides with $\text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$. To show that DPS is guaranteed to draw samples ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ that coincide with this null space, we must consider the effect of $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$. It suffices to show that its magnitude does not increase enormously with $N$, since the probability of sampling ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}\in \text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$ with an arbitrarilylarge magnitude does not decay to zero. Note that the magnitude of $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{N}$ can be considered upperbounded, since a reward vector can always be scaled arbitrarily without affecting the resultant policy learned by value iteration. ∎
Lemma 2.2 (Proof of Statement 2 in Lemma 2).
Proof.
As in Statement 1, express the reward vector ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ sampled from the logistic regression posterior in terms of the eigenbasis of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$, $\{{\bm{\nu}}_{i}i=1,\mathrm{\dots},D1\}$: ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}={\sum}_{i=1}^{D1}{\beta}_{i}{\bm{\nu}}_{i}$ for some coefficients ${\beta}_{i}$. Then, as $N\u27f6\mathrm{\infty}$, consider ${\bm{v}}_{j}$ in the nullspace of ${\widehat{M}}_{2}^{T}{\widehat{M}}_{2}$. Assume that there exists $b\in (0,1)$ such that $\frac{{\beta}_{j}}{{\sum}_{i=1}^{D1}{\beta}_{i}}\ge b$. We show that for $b$ high enough, with probability above zero, the policy will sample a trajectory such that the next sampled observation ${\bm{w}}_{i}$ has a nonzero component in ${\bm{\nu}}_{j}$.
First, note that if ${\bm{z}}_{i}={\sum}_{k=1}^{D1}{\gamma}_{k}{\bm{\nu}}_{k}$, then ${\bm{\nu}}_{j}^{T}{\bm{z}}_{i}={\sum}_{k=1}^{D1}{\gamma}_{k}{\bm{\nu}}_{j}^{T}{\bm{\nu}}_{k}={\sum}_{k=1}^{D1}{\gamma}_{k}{\delta}_{jk}={\gamma}_{j}$. Thus, because the eigenvectors $\{{\bm{\nu}}_{i}\}$ form an orthonormal basis, for ${\bm{w}}_{i}$ to have a nonzero component ${\gamma}_{j}$ in ${\bm{\nu}}_{j}$ is equivalent to ${\bm{z}}_{i}^{T}{\bm{\nu}}_{j}\ne 0$.
By assumption—that is, our ability to choose $b$ arbitrarily close to $1$—we can assume that ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ is arbitrarily aligned with ${\bm{\nu}}_{j}$: for any ${\epsilon}_{0}>0$, we can assume that $$. Define $\bm{z}:=\alpha {\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}+{\bm{\nu}}_{j}$. Then, $$ and ${\bm{\nu}}_{j}=\alpha {\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}+\bm{z}$.
We aim to show that ${\bm{z}}_{i}^{T}{\bm{\nu}}_{j}\ne 0$. We can write, ${\bm{z}}_{i}^{T}{\bm{\nu}}_{j}={\bm{z}}_{i}^{T}(\alpha {\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}+\bm{z})=\alpha {\bm{z}}_{i}^{T}{\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}+{\bm{z}}_{i}^{T}\bm{z}$. The second term has upperbounded magnitude:
${\bm{z}}_{i}^{T}\bm{z}$  $\le {{\bm{z}}_{i}}_{2}{\bm{z}}_{2}\text{, by the CauchySchwarz inequality}$  (74)  
$\le {{\bm{z}}_{i}}_{2}{\epsilon}_{0}={\epsilon}_{0}\sqrt{{\bm{z}}_{i}^{T}{\bm{z}}_{i}}={\epsilon}_{0}\sqrt{{\bm{x}}_{i}^{T}{\bm{x}}_{i}}\text{, because Equation (}\text{4}\text{) preserves inner products}$  (75)  
$={\epsilon}_{0}{{\bm{x}}_{i}}_{2}\le {\epsilon}_{0}{{\bm{x}}_{i}}_{1}\text{, since}{\bm{y}}_{2}\le {\bm{y}}_{1}\text{holds for arbitrary vector}\bm{y}\in {\mathbb{R}}^{n}$  (76)  
$\le 2{\epsilon}_{0}h\text{, where}h\text{is the episode horizon},$  (77) 
where the final inequality holds because ${\bm{x}}_{i}={\bm{x}}_{i1}{\bm{x}}_{i2}$, and the vectors ${\bm{x}}_{i1}$ and ${\bm{x}}_{i2}$ contain positive integer elements that sum to $h$.
Set $\epsilon >0$. Since ${\bm{z}}_{i}^{T}\bm{z}\le 2{\epsilon}_{0}h$, we can set ${\epsilon}_{0}\le \frac{\alpha \epsilon}{2h}$ to ensure that if $$, then $$. To show that ${\bm{z}}_{i}^{T}{\bm{\nu}}_{j}=\alpha {\bm{z}}_{i}^{T}{\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}+{\bm{z}}_{i}^{T}\bm{z}\ne 0$, it thus suffices to show that ${\bm{z}}_{i}^{T}{\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}>\epsilon $ with nonzero probability, for some $\epsilon >0$ that we specify (note that selecting the value of $\epsilon $ determines the possible values of $b$).
Observe that ${\bm{z}}_{i}^{T}{\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}={\bm{x}}_{i}^{T}{\stackrel{\mathbf{~}}{\bm{r}}}_{N}={({\bm{x}}_{i1}{\bm{x}}_{i2})}^{T}{\stackrel{\mathbf{~}}{\bm{r}}}_{N}$, again because the linear transformation ${\bm{z}}_{i}={V}^{T}{\bm{x}}_{i}$ (4) preserves inner products. This is the difference in the total rewards of the two trajectories. Recall that the probability of at least a certain component of ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}$ belonging to $\text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$ is nondecaying. For ${\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}\in \text{Null}({\widehat{M}}_{2}^{T}{\widehat{M}}_{2})$, the total reward is zero for any ${\bm{z}}_{i}$ in the rowspace of ${\widehat{M}}_{2}$. Thus, $\alpha {\stackrel{\mathbf{~}}{\bm{r}}}_{N}^{\prime}={\bm{\nu}}_{j}$ maximally encourages the induced policy to explore necessary parts of the state/action space that lead to increasing the rowrank of ${\widehat{M}}_{2}$.
We have $$. The optimal policy is determined by comparing value functions of competing policies; value functions are continuous in rewards, so ${\epsilon}_{0}$ can be set small enough that this necessary exploration occurs with positive probability. ∎
Lemma 3 (Convergence of dynamics model).
Given Lemma 2, DPS’s dynamics model converges to the true dynamics, and all eigenvalues of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$ approach infinity as $k\mathrm{\u27f6}\mathrm{\infty}$.
Proof.
DPS updates and samples from the dynamics model via the same procedure as the posterior sampling RL algorithm of Osband et al. [29], which conducts posterior sampling RL with numerical rewards; the difference between the two algorithms lies in their handling of rewards, but not of dynamics. In [29], the theoretical result uses the fact that as a particular state/action pair is observed infinitelyoften, samples of the dynamics parameters $p({s}^{\prime}s,a)$ concentrate to their true values. This fact comes from Lemma 17 in [24], which proves that the deviation between the empirical mean and true value of $p({s}^{\prime}s,a)$ decays as the number of visits to state/action pair $(s,a)$ increases. This lemma is proven via the following concentration inequality:
$\mathbb{P}\{\widehat{\bm{p}}(\cdot )\bm{p}(\cdot ){}_{1}\ge \epsilon \}\le ({2}^{S}2)\text{exp}({\displaystyle \frac{n{\epsilon}^{2}}{2}}),$  (78) 
where $S$ is the size of the state space, $n$ is the number of times that $(s,a)$ has been visited, $\bm{p}(s,a)$ is the true vector of transition probabilities ${[p({s}_{1}s,a),p({s}_{2}s,a),\mathrm{\dots},p({s}_{S}s,a)]}^{T}$, and $\widehat{\bm{p}}(s,a)$ is the empirical mean of the observations of $\bm{p}(s,a)$.
We apply the following two results to prove Lemma 3:

1.
For any state/action pair that is sampled infinitelyoften, the posterior over that state/action’s transition probabilities (i.e., the distribution over possible next states) converges to its true values. Thus, if every state/action pair is sampled infinitelyoften, then the dynamics model as a whole converges to the true dynamics.
 2.
The proof of Lemma 3 consists of proving the two statements below:

1.
If the sampled dynamics $\stackrel{\mathbf{~}}{\bm{p}}(s,a)$ are sufficientlyclose to the true dynamics (in ${l}_{1}$norm ${\stackrel{\mathbf{~}}{\bm{p}}(s,a)\bm{p}(s,a)}_{1}$) with high probability and ${M}_{2}^{T}{M}_{2}$ is not fullrank, then there exists $c\in (0,1)$ such that $P(\text{Sample}{\bm{z}}_{i}\notin \text{Range}({M}_{2}^{T}{M}_{2}))\ge c0$.

2.
Every state/action pair will be sampled infinitelyoften.
Proof of Statement A:
Let $\bm{v}\in \text{Null}({M}_{2}^{T}{M}_{2})$. Treating sampled dynamics model $\stackrel{\mathbf{~}}{\bm{p}}$ as if it were the true dynamics, Lemma 2 establishes that $P(\text{Sample}{\bm{z}}_{i}\text{s.t.}{\bm{z}}_{i}^{T}\bm{v}\ne 0\stackrel{\mathbf{~}}{\bm{p}})\ge f(\stackrel{\mathbf{~}}{\bm{p}})0$, where the lower bound $f(\stackrel{\mathbf{~}}{\bm{p}})$ depends on the sampled dynamics $\stackrel{\mathbf{~}}{\bm{p}}$, and the probability $P(\cdot )$ is with respect to sampling the reward model (which determines the policy to execute) and the dynamics transitions during the policy rollout.
We argue that there exists such a lower bound $f$ that is continuous in the dynamics $\stackrel{\mathbf{~}}{\bm{p}}$. To do so, we define $g(\stackrel{\mathbf{~}}{\bm{p}}):=P(\text{Sample}{\bm{z}}_{i}\text{s.t.}{\bm{z}}_{i}^{T}\bm{v}\ne 0\stackrel{\mathbf{~}}{\bm{p}})$, and show that $g(\stackrel{\mathbf{~}}{\bm{p}})$ has at most finitelymany discontinuities. This is sufficient to show that there must exist a continuous function $f$ such that for any dynamics $\stackrel{\mathbf{~}}{\bm{p}}$, $g(\stackrel{\mathbf{~}}{\bm{p}})\ge f(\stackrel{\mathbf{~}}{\bm{p}})>0$.
The rewards are sampled independently of the dynamics, and the probability of sampling rewards that are aligned with $\bm{v}$ to at least a specified degree does not decay: $P\left(\right\frac{{({\stackrel{\mathbf{~}}{\bm{r}}}^{\mathbf{\prime}}\mathbf{)}}^{T}}{{{\stackrel{\mathbf{~}}{\bm{r}}}^{\mathbf{\prime}}}_{2}}\cdot \frac{\bm{v}}{{\bm{v}}_{2}}>b)\mathrm{\uff0f}\u27f60$, for any $b\in (0,1)$. Statement A considers only a specific iteration of the algorithm, rather than a sequence of episodes, and so the reward distribution can be treated as fixed.
Given dynamics $\stackrel{\mathbf{~}}{\bm{p}}$, the probability of sampling any trajectory $\tau =\{{s}_{1},{a}_{1},{s}_{2},{a}_{2},\mathrm{\dots},{s}_{h+1}\}$ under policy $\pi $ is continuous in $\stackrel{\mathbf{~}}{\bm{p}}$:
$$P(\tau )={\stackrel{~}{p}}_{0}({s}_{1})\prod _{i=1}^{h}\pi ({a}_{i}{s}_{i})\stackrel{~}{p}({s}_{i+1}{s}_{i},{a}_{i}).$$  (79) 
Letting $L$ be the set of trajectories such that ${\bm{z}}_{i}^{T}\bm{v}\ne 0$, the probability of sampling a trajectory $\tau \in L$ is:
$$P(\tau \in L)=\sum _{{\tau}^{\prime}\in L}P({\tau}^{\prime}),$$  (80) 
where the probability $P({\tau}^{\prime})$ is given by (79). The function $g(\stackrel{\mathbf{~}}{\bm{p}})=P(\tau \in L\stackrel{\mathbf{~}}{\bm{p}})$ is therefore continuous in the parameters of $\stackrel{\mathbf{~}}{\bm{p}}$. The policy $\pi $, meanwhile, depends upon the dynamics $\stackrel{\mathbf{~}}{\bm{p}}$ and sampled rewards $\stackrel{\mathbf{~}}{\bm{r}}$, and is selected via value iteration to optimize the expected total reward:
$${\pi}^{*}=\underset{\mathit{\pi}}{\text{argmax}}{V}_{\pi ,1}^{\stackrel{~}{M}}(s)=\underset{\mathit{\pi}}{\text{argmax}}{\mathbb{E}}_{M,\pi}[\sum _{j=1}^{h}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s],$$  (81) 
where $\stackrel{~}{M}$ is the sampled MDP (i.e., consisting of $\stackrel{\mathbf{~}}{\bm{p}}$ and $\stackrel{\mathbf{~}}{\bm{r}}$).
Value iteration selects the deterministic policy that maximizes the expected total trajectory reward in (81). Given finite state and action spaces and a finite time horizon, there are finitelymany possible deterministic policies. If for given dynamics $\stackrel{\mathbf{~}}{\bm{p}}$, only a single policy $\pi (\stackrel{\mathbf{~}}{\bm{p}})$ maximizes the value function, then under a sufficientlysmall change in the dynamics, that same policy $\pi (\stackrel{\mathbf{~}}{\bm{p}})$ still yields the optimal value function. Under a fixed policy, the probability of generating any trajectory is continuous in the dynamics, and so $g$ is continuous at $\stackrel{\mathbf{~}}{\bm{p}}$.
Thus, the function $g(\stackrel{\mathbf{~}}{\bm{p}})$ can only be discontinuous at points in the dynamics space at which multiple policies maximize (81). We argue that $g(\stackrel{\mathbf{~}}{\bm{p}})$ has finitelymany such discontinuities. If this were not the case, there would exist a region of the dynamics space in which the policy maximizing (81) changes infinitelymany times. Since the set of dynamics models is bounded (each dynamics parameter is in $[0,1]$), this would imply that for any $\epsilon >0$, there exists a region of the dynamics parameter space with area below $\epsilon $, in which the policy optimizing (81) changes infinitelymany times. To see that this is impossible, notice that any dynamics $\stackrel{\mathbf{~}}{\bm{p}}$ and policy $\pi $ induce a distribution $\stackrel{~}{n}(s,a)$ over the expected number of times a trajectory visits each state/action pair. The expected total reward, given $\stackrel{~}{n}(s,a)$ and sampled rewards $\stackrel{\mathbf{~}}{\bm{r}}$, can be expressed as,
$${V}_{\pi ,1}^{\stackrel{~}{M}}(s)=\sum _{(s,a)\in \mathcal{S}\times \mathcal{A}}\stackrel{~}{n}(s,a)\stackrel{~}{r}(s,a).$$  (82) 
Since $\stackrel{~}{n}(s,a)$ depends on the policy and dynamics according to (79) for each possible trajectory, its slope with respect to the dynamics is upperbounded, and it cannot oscillate infinitely with respect to the dynamics.
Given the continuous lower bound $f$, for $\epsilon >0$, there exists $\delta $ such that if $$, then $$. If we set $$, then $P(\text{Sample}{\bm{z}}_{i}\text{s.t.}{\bm{z}}_{i}^{T}\bm{v}\ne 0\bm{p})\ge f(\bm{p})\frac{f(\stackrel{\mathbf{~}}{\bm{p}})}{2}0$.
Proof of Statement B:
Statement B asserts that every state/action pair will be sampled infinitelyoften. To show this, we first define known and unknown state/action pairs:
Definition 3.
($\epsilon \mathrm{,}\delta \mathrm{)}$known state/action pair: Fix $\epsilon \mathrm{,}\delta \mathrm{>}\mathrm{0}$, and let $\stackrel{\mathrm{~}}{\mathbf{p}}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ be a sample from the dynamics model’s posterior. An $\mathrm{(}\epsilon \mathrm{,}\delta \mathrm{)}$known state/action pair is one for which $$ with probability $\mathrm{1}\mathrm{}\delta $.
Definition 4.
Unknown state/action pair: An $\mathrm{(}\epsilon \mathrm{,}\delta \mathrm{)}$unknown state/action pair is one that does not satisfy Definition 3.
Previously, we noted that for any state/action pair that is sampled infinitelyoften, the model of that state/action’s transition probabilities (that is, the distribution over possible next states) converges to the true values. For any fixed $\epsilon $ and $\delta $, after enough steps in the MDP are taken, at least one state/action pair is guaranteed to eventually be sampled enough that it becomes $(\epsilon ,\delta )$known.
To prove Statement B, it suffices to show that if part of the dynamics are known and part are unknown, then the MDP is guaranteed to eventually leave its known portion. We will prove this by contradiction: we will assume that all sampled policies stay only in the known portion of the MDP, and then contradict this assumption by showing that the MDP is guaranteed to exit the known region.
For $\epsilon ,\delta >0$, an $(\epsilon ,\delta )$known state/action pair is one for which $$ with probability $1\delta $. For unknown state/action pairs $(s,a)$, due to the assumption that only known state/actions are visited, the distribution from which $\stackrel{\mathbf{~}}{\bm{p}}(s,a)$ is sampled does not change. Without loss of generality, let ${\stackrel{~}{s}}_{1}$ be an unknown state/action pair. This means that ${\stackrel{~}{s}}_{1}$ is not visited past some iteration $m$.
Without visiting ${\stackrel{~}{s}}_{1}$, ${M}_{2}^{T}{M}_{2}$ cannot be fullrank. As a result, there exists $\bm{v}\in \text{Null}({M}_{2}^{T}{M}_{2})$ which directly corresponds to the lack of ${\stackrel{~}{s}}_{1}$ observations. As in the proof of Statement A, there exists a function $f$ such that $P(\text{Sample}{\bm{z}}_{i}\text{s.t.}{\bm{z}}_{i}^{T}\bm{v}\ne 0\stackrel{\mathbf{~}}{\bm{p}})\ge f(\stackrel{\mathbf{~}}{\bm{p}})0$. While $\stackrel{\mathbf{~}}{\bm{p}}$ itself changes in each episode, we can show that $\mathbb{E}[f(\stackrel{\mathbf{~}}{\bm{p}})]$ does not decay to zero. To do so, denote the samples of the known and unknown portions of the dynamics parameters as ${\stackrel{\mathbf{~}}{\bm{p}}}_{\text{known}}$ and ${\stackrel{\mathbf{~}}{\bm{p}}}_{\text{unknown}}$, respectively. Also, let ${S}_{\epsilon}$ be the set of known dynamics parameters ${\stackrel{\mathbf{~}}{\bm{p}}}_{\text{known}}$ that lie inside the ${\epsilon}_{1}$ball $$ for known $(s,a)$. Then:
$\mathbb{E}[f(\stackrel{\mathbf{~}}{\bm{p}})]$  $={\displaystyle \int \text{Pr}(\stackrel{\mathbf{~}}{\bm{p}})f(\stackrel{\mathbf{~}}{\bm{p}})d(\stackrel{\mathbf{~}}{\bm{p}})}$  (83)  
$\ge \underset{{\stackrel{\mathbf{~}}{\bm{p}}}_{\text{known}}\in {S}_{\epsilon}}{\text{min}}{\displaystyle \int}\text{Pr}({\stackrel{\mathbf{~}}{\bm{p}}}_{\text{unknown}})f(\stackrel{\mathbf{~}}{\bm{p}})d({\stackrel{\mathbf{~}}{\bm{p}}}_{\text{unknown}})\mathit{\hspace{1em}\u2006}\text{(with prob.}\ge 1\delta ).$  (84) 
By minimizing the known dynamics over the area in which they lie with high probability, we obtain an integral in which all dynamics parameters are either fixed (i.e., the known dynamics) or drawn from an unchanging distribution (i.e., the unknown dynamics). Therefore, since $f(\stackrel{\mathbf{~}}{\bm{p}})$ is strictly positive everywhere, and we are integrating it over an unchanging probability distribution, the integral must evaluate to a strictly positive, unchanging quantity.
Hence, as long as ${\stackrel{~}{s}}_{1}$ is unknown and not visited, ${\mathbb{E}}_{\stackrel{\mathbf{~}}{\bm{p}}}[P(\text{Sample}{\bm{z}}_{i}\text{s.t.}{\bm{z}}_{i}^{T}\bm{v}\ne 0)]\ge c0$ with probability $\ge 1\delta $, for some $c>0$. In general, when sampling a sequence of random variables ${X}_{i}$ with bounded support and $\mathbb{E}[{X}_{i}]\ge c$ for all $i$, we observe ${X}_{i}\ge c$ infinitelyoften. Since $f(\cdot )$ has support in $[0,1]$, we are guaranteed to infinitelyoften sample dynamics $\stackrel{~}{p}$ such that $P(\text{Sample}{\bm{z}}_{i}\text{s.t.}{\bm{z}}_{i}^{T}\bm{v}\ne 0)\ge c0$, in which event state/action ${\stackrel{~}{s}}_{1}$ has nonzero probability of being sampled. This completes the proof by contradicting the hypothesis that the MDP will only visit known state/action pairs. ∎
Theorem 1 (Asymptotic consistency of DPS).
If there exists a reward function such that a logistic regression model explains the user’s preferences, then DPS with a Bayesian logistic regression credit assignment model will learn an asymptotically consistent reward model.
Proof.
To prove this result, we apply Proposition 1. Given that Lemma 2, Lemma 3, and Condition 1 hold, it remains only to argue that Condition 2 must hold. To see this, refer back to the proof of Lemma 2. As long as a particular eigenvalue of ${Z}^{T}Z$ does not increase, there is a nondecaying probability of sampling an observation that increases it by at least some minimum amount. The probabilities of sampling such vectors depend on the ratios of the covariance matrix ${\mathrm{\Sigma}}_{k}$’s eigenvalues, and so the ratio of ${\lambda}_{1}$ and ${\lambda}_{D1}$ must always have some upper bound. ∎
We turn next to characterizing the regret rate of DPS. We will apply two prior results, one from Gourieroux and Monfort [21] regarding the asymptotic distribution of the logistic regression maximum likelihood estimate, while the other (Osband et al. [29]) presents a regret bound for posterior sampling RL.
Proposition 2 (Asymptotic normality of logistic regression maximum likelihood estimator [21]).
If Conditions 1 and 2 are satisfied, and if $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{M\mathit{}L\mathrm{,}k}$ converges almost surely to the true value ${\overline{\mathbf{r}}}^{\mathrm{\prime}}$, then:
$${\left[\sum _{i=1}^{k}f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{ML,k}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{\frac{1}{2}}(\widehat{\bm{r}}^{\mathbf{\prime}}{}_{ML,k}{\overline{\bm{r}}}^{\mathbf{\prime}})\stackrel{\mathit{D}}{\u27f6}\mathcal{N}(\mathrm{\U0001d7ce},\mathbb{I})\mathit{\text{as}}k\u27f6\mathrm{\infty},$$  (85) 
where $\stackrel{\mathit{D}}{\mathrm{\u27f6}}$ implies convergence in distribution and ${Q}^{\frac{\mathrm{1}}{\mathrm{2}}}$ is the positive definite matrix associated with positive definite matrix $Q$ such that ${\mathrm{[}{Q}^{\frac{\mathrm{1}}{\mathrm{2}}}\mathrm{]}}^{\mathrm{2}}\mathrm{=}Q$.
Proof.
See Proposition 4 in Gourieroux and Monfort [21]. ∎
Remark: Just as with Proposition 1, the proof of Proposition 2 in [21] can be adapted such that the result holds when the maximum likelihood estimator $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{ML,k}$ is replaced with the MAP estimator $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}$.
Proposition 3 (Expected regret of posterior sampling RL).
In posterior sampling RL with episode horizon $h$ and numbers of states and actions $S$ and $A$, the expected $T$step regret is bounded as:
$$\mathbb{E}[\text{\mathit{R}\mathit{e}\mathit{g}\mathit{r}\mathit{e}\mathit{t}}(T,{\pi}_{h}^{\text{\mathit{P}\mathit{S}}})]=O\left(hS\sqrt{AT\text{\mathit{l}\mathit{o}\mathit{g}}(SAT)}\right).$$  (86) 
Proof.
See Theorem 1 in Osband et al. [29]. ∎
Next, we show that under preference feedback, the regret can be decomposed into two terms: one that reflects the converging dynamics model, and one that reflects the converging reward model.
Lemma 4 (Regret decomposition).
The expected regret of DPS can be decomposed into two terms. One of these terms can be upper bounded by the same regret bound as in Osband et al. [29], stated in Proposition (3). The second term may be upperbounded by
$$h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{\overline{\bm{r}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}}_{\mathrm{\infty}}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{{\widehat{\bm{r}}}_{k}\overline{\bm{r}}}_{\mathrm{\infty}}]+h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{{\widehat{\bm{r}}}_{\bm{k}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}}_{\mathrm{\infty}}].$$  (87) 
Proof.
Recall that the value of state $s$ at timestep $i$ and under policy $\mu $ and MDP $M$ is:
$${V}_{\mu ,i}^{M}(s):={\mathbb{E}}_{M,\mu}[\sum _{j=i}^{h}{\overline{r}}^{M}({s}_{j},{a}_{j}){s}_{i}=s],$$  (88) 
where ${\overline{r}}^{M}(s,a)$ is the average utility for taking action $a$ in state $s$ and MDP $M$. Then, the regret from episode $k$ is defined as:
$${\mathrm{\Delta}}_{k}:=\sum _{s\in \mathcal{S}}{p}_{0}(s)({V}_{{\mu}^{*},1}^{{M}^{*}}{V}_{{\mu}_{k},1}^{{M}^{*}}),$$  (89) 
where ${M}^{*}$ is the true (unknown) MDP, ${\mu}^{*}$ is the optimal policy in ${M}^{*}$, and ${\mu}_{k}$ is the policy that the algorithm follows in episode $k$. The total regret over $T$ timesteps is then:
$$\text{Regret}(T,\pi ):=\sum _{k=1}^{\lceil T/h\rceil}{\mathrm{\Delta}}_{k}.$$  (90) 
For a sampled MDP ${M}_{k}$, drawn from the posterior model of the environment, Osband et al. [29] define:
$${\stackrel{~}{\mathrm{\Delta}}}_{k}:=\sum _{s\in \mathcal{S}}{p}_{0}(s)({V}_{{\mu}_{k},1}^{{M}_{k}}{V}_{{\mu}_{k},1}^{{M}^{*}}).$$  (91) 
In Osband et al. [29], the authors prove the following regret equivalence result:
$$\mathbb{E}\left[\sum _{k=1}^{m}{\mathrm{\Delta}}_{k}\right]=\mathbb{E}\left[\sum _{k=1}^{m}{\stackrel{~}{\mathrm{\Delta}}}_{k}\right],$$  (92) 
which they use to derive the posterior sampling RL regret bound restated in Proposition 3.
The true (unknown) MDP ${M}^{*}$ consists of the true dynamics ${P}^{*}$ and average rewards $\overline{r}({s}_{j},{a}_{j})$, while ${M}_{k}$ has a sampled dynamics model and sampled rewards $\stackrel{~}{r}({s}_{j},{a}_{j})$. In addition, we introduce ${M}_{k}^{*}$, which pairs the true dynamics model ${P}^{*}$ with the sampled rewards $\stackrel{~}{r}({s}_{j},{a}_{j})$. By adding and subtracting terms based upon ${M}_{k}^{*}$, the terms of ${\stackrel{~}{\mathrm{\Delta}}}_{k}$ can be decomposed:
${V}_{{\mu}_{k},1}^{{M}_{k}}(s){V}_{{\mu}_{k},1}^{{M}^{*}}(s)$  $={\mathbb{E}}_{{M}_{k},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s]{\mathbb{E}}_{{M}^{*},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\overline{r}({s}_{j},{a}_{j}){s}_{1}=s]$  (93)  
$={\mathbb{E}}_{{M}_{k},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s]{\mathbb{E}}_{{M}_{k}^{*},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s]$  (94)  
$+{\mathbb{E}}_{{M}_{k}^{*},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s]{\mathbb{E}}_{{M}^{*},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\overline{r}({s}_{j},{a}_{j}){s}_{1}=s]$  (95)  
$={\mathbb{E}}_{{M}_{k},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s]{\mathbb{E}}_{{M}_{k}^{*},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}\stackrel{~}{r}({s}_{j},{a}_{j}){s}_{1}=s]$  (96)  
$+{\mathbb{E}}_{{P}^{*},{\mu}_{k}}[{\displaystyle \sum _{j=1}^{h}}(\stackrel{~}{r}({s}_{j},{a}_{j})\overline{r}({s}_{j},{a}_{j})){s}_{1}=s],$  (97) 
where in the final term, recall that ${P}^{*}$ is the true transition model.
The regret arising from the first two terms of this expression can be upperbounded by the same regret bound as in [29], stated in (86) above. This holds because the regret bound of [29] arises from both dynamics and reward models that converge based upon direct observations of each model parameter, i.e., direct observations of both state/action transitions and rewards. In the two terms in (96), however, the rewards are identical among the two terms, while the dynamics are modeled identically to the setup of [29].
It remains to consider the third term of the regret decomposition (97), which is the error due to credit assignment. Let ${\mathrm{\text{Reg}}}_{T}^{R}$ be the total component of the $T$step regret arising from this credit assignment error term. Our goal is to upperbound its expectation:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]={\mathbb{E}}_{{p}_{0},\stackrel{~}{r}}\left\{\sum _{k=1}^{\lceil T/h\rceil}{\mathbb{E}}_{{P}^{*},{\mu}_{k}}\left[\sum _{j=1}^{h}(\stackrel{~}{r}({s}_{j},{a}_{j})\overline{r}({s}_{j},{a}_{j}))\right{s}_{1}=s]\right\}.$$  (98) 
Note that because ${M}^{*}$ and ${M}_{k}^{*}$ have the same transition dynamics and initial state distributions ${p}_{0}$, and that because the expectation over both summations in (98) is taken with respect to the same policy ${\mu}_{k}$, the two MDPs ${M}^{*}$ and ${M}_{k}^{*}$ have the same probabilities of reaching any state/action pair $(s,a)$ at each step $j$.
Recall that $\overline{\bm{r}}\in {\mathbb{R}}^{D}$ is the vector of true state/action rewards, ${\stackrel{\mathbf{~}}{\bm{r}}}_{k}\in {\mathbb{R}}^{D}$ is the vector of sampled state/action rewards (from the model posterior) in episode $k$, and ${\widehat{\bm{r}}}_{k}\in {\mathbb{R}}^{D}$ is the expected value of the reward model’s posterior distribution at episode $k$.
Given the observation that ${M}^{*}$ and ${M}_{k}^{*}$ have the same probabilities of reaching any state/action pair $(s,a)$ at each step $j$, we can write:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{\overline{\bm{r}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}}_{\mathrm{\infty}}].$$  (99) 
Applying the triangle inequality yields:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{\overline{\bm{r}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}+{\widehat{\bm{r}}}_{k}{\widehat{\bm{r}}}_{k}}_{\mathrm{\infty}}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{{\widehat{\bm{r}}}_{k}\overline{\bm{r}}}_{\mathrm{\infty}}]+h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{{\widehat{\bm{r}}}_{\bm{k}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}}_{\mathrm{\infty}}].$$  (100) 
Thus, to bound the total credit assignment error in (98), we must consider the rates at which ${\widehat{\bm{r}}}_{k}$ converges to $\overline{\bm{r}}$ and at which ${\stackrel{\mathbf{~}}{\bm{r}}}_{k}$ converges to ${\widehat{\bm{r}}}_{k}$.
∎
The final two lemmas characterize the convergence of ${\stackrel{\mathbf{~}}{\bm{r}}}_{k}$ to ${\widehat{\bm{r}}}_{k}$, and of ${\widehat{\bm{r}}}_{k}$ to $\overline{\bm{r}}$, respectively.
Lemma 5.
Sampling $\stackrel{\mathrm{~}}{\mathbf{r}}^{\mathrm{\prime}}{}_{k}$ via the Laplace approximation to the Bayesian logistic regression posterior gives $\stackrel{\mathrm{~}}{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{\sim}\mathrm{N}\mathit{}\mathrm{(}\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{,}{\mathrm{\Sigma}}_{k}\mathrm{)}$, with $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}$ and ${\mathrm{\Sigma}}_{k}$ defined in (9) and (10). All eigenvalues of ${\mathrm{\Sigma}}_{k}^{\mathrm{}\mathrm{1}}$ approach infinity as $k\mathrm{\u27f6}\mathrm{\infty}$. Moreover, given Condition 2, there exists ${c}_{\mathrm{0}}$ such that ${\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{\ge}k\mathit{}{c}_{\mathrm{0}}$, where recall that ${\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}$ is the smallest eigenvalue of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$. Thus, asymptotically ${\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathit{}\mathrm{(}{\mathrm{\Sigma}}_{k}^{\mathrm{}\mathrm{1}}\mathrm{)}\mathrm{\ge}k\mathit{}{c}_{\mathrm{0}}$, where ${\lambda}_{D\mathrm{}\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathit{}\mathrm{(}{\mathrm{\Sigma}}_{k}^{\mathrm{}\mathrm{1}}\mathrm{)}$ is the smallest eigenvalue of ${\mathrm{\Sigma}}_{k}^{\mathrm{}\mathrm{1}}$.
Proof.
From Theorem 1 (see subsequent remark), $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}\u27f6{\overline{\bm{r}}}^{\mathbf{\prime}}$ almost surely. Due to this convergence,
$${\mathrm{\Sigma}}_{k}^{1}=\sum _{i=1}^{k}\frac{\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})}{{(1+\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}))}^{2}}{\bm{z}}_{i}{\bm{z}}_{i}^{T}\stackrel{k\u27f6\mathrm{\infty}}{\u27f6}\sum _{i=1}^{k}\frac{\text{exp}({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}})}{{(1+\text{exp}({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}))}^{2}}{\bm{z}}_{i}{\bm{z}}_{i}^{T},$$  (101) 
and all eigenvalues of ${\mathrm{\Sigma}}_{k}^{1}$ approach infinity as $k\u27f6\mathrm{\infty}$ by Lemma 3. Also, Condition 2 asymptotically holds for ${\mathrm{\Sigma}}_{k}^{1}$ as $k$ increases, so that if ${\lambda}_{1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})$ and ${\lambda}_{D1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})$ are the largest and smallest eigenvalues of ${\mathrm{\Sigma}}_{k}^{1}$, respectively, then $\exists {M}_{1}$ such that $\frac{{\lambda}_{1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})}{{\lambda}_{D1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})}\le {m}_{1}$ for large enough $k$.
To complete the proof, we argue that $\text{Tr}({\mathrm{\Sigma}}_{k}^{1})$ increases by a lowerbounded amount on average with each time step, where Tr denotes the trace, or sum of a matrix’s eigenvalues. Combining that 1) $\frac{{\lambda}_{1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})}{{\lambda}_{D1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})}\le {m}_{1}$ for large enough $k$, and 2) $\text{Tr}({\mathrm{\Sigma}}_{k}^{1})={\sum}_{i=1}^{D1}{\lambda}_{i}({\mathrm{\Sigma}}_{k}^{1})$ increases (on average) by a lowerbounded amount with each new iteration, one can see that there exists ${c}_{0}$ such that asymptotically ${\lambda}_{D1}^{(k)}({\mathrm{\Sigma}}_{k}^{1})\ge k{c}_{0}$.
The quantity $\text{Tr}({\mathrm{\Sigma}}_{k}^{1})$ can only increase:
$\text{Tr}\left[{\mathrm{\Sigma}}_{k}^{1}\right]$  $={\displaystyle \sum _{i=1}^{k}}{\displaystyle \frac{\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})}{{(1+\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}))}^{2}}}\text{Tr}({\bm{z}}_{i}{\bm{z}}_{i}^{T})\text{, by linearity of trace}$  (102)  
$={\displaystyle \sum _{i=1}^{k}}{\displaystyle \frac{\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})}{{(1+\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}))}^{2}}}{\bm{z}}_{i}^{T}{\bm{z}}_{i}\text{, by the cyclic property of trace}$  (103)  
$={\displaystyle \sum _{i=1}^{k}}{\displaystyle \frac{\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})}{{(1+\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}))}^{2}}}{\bm{x}}_{i}^{T}{\bm{x}}_{i}\text{, by (}\text{6}\text{)}.$  (104) 
The quantity $\frac{\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})}{{(1+\text{exp}({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}))}^{2}}$ is bounded away from zero because it approaches $\frac{\text{exp}({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}})}{{(1+\text{exp}({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}))}^{2}}$, which is bounded away from zero as justified previously. For any observation ${\bm{x}}_{i}$ such that ${\bm{x}}_{i}={\bm{x}}_{i1}{\bm{x}}_{i2}\ne 0$, ${\bm{x}}_{i}^{T}{\bm{x}}_{i}\ge 1$ always, since ${\bm{x}}_{i1}$ and ${\bm{x}}_{i2}$ differ by at least one. It remains to argue that the event $\{{\bm{x}}_{i1}{\bm{x}}_{i2}=0\}$ cannot occur increasinglyoften: there must be some belowone upperbound to the probability that this event takes place. This is indeed the case; if it were not, this would imply that the model converges toward always sampling a single trajectory $\bm{x}$; however, as long as a particular eigenvalue of ${Z}^{T}Z$ does not increase, there is a nondecaying probability of sampling an observation ${\bm{x}}^{\prime}$ that increases it by at least some minimum amount (Lemma 1). The probability of sampling a trajectory ${\bm{x}}^{\prime}\ne \bm{x}$ depends on the ratios of the covariance matrix’s eigenvalues, which are bounded according to Condition 2.
∎
Lemma 6.
Under the conditions for asymptotic consistency of logistic regression (Theorem 1), the vector $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}$ behaves asymptotically as $\mathrm{N}\mathit{}\mathrm{(}\mathrm{0}\mathrm{,}{\mathrm{\left[}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}\mathrm{\right]}}^{\mathrm{}\mathrm{1}}\mathrm{)}$. This is the same distribution as that characterizing $\stackrel{\mathrm{~}}{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{}\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}$ as $k\mathrm{\u27f6}\mathrm{\infty}$, as discussed in Lemma 5. Thus, similarly, $\widehat{\mathbf{r}}^{\mathrm{\prime}}{}_{k}\mathrm{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}$ asymptotically has covariance ${\mathrm{\Sigma}}_{k}$.
Proof.
By Proposition 2 (see subsequent remark),
$${\left[\sum _{i=1}^{k}f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{\frac{1}{2}}(\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}})\stackrel{\mathit{D}}{\u27f6}\mathcal{N}(\mathrm{\U0001d7ce},\mathbb{I})\text{as}k\u27f6\mathrm{\infty}.$$  (105) 
Multiplying both sides by ${\left[{\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1/2}$ gives that:
${\left[{\displaystyle \sum _{i=1}^{k}}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1/2}{\left[{\displaystyle \sum _{i=1}^{k}}f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1/2}(\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}})$  (106)  
$\text{behaves asymptotically as}\mathcal{N}(\mathrm{\U0001d7ce},{\left[{\displaystyle \sum _{i=1}^{k}}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1}).$  (107) 
Note that ${\left[{\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1/2}{\left[{\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1/2}\u27f6\mathbb{I}$ because $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}\u27f6{\overline{\bm{r}}}^{\mathbf{\prime}}$, so that $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}}$ behaves asymptotically as $\mathcal{N}(\mathrm{\U0001d7ce},{\left[{\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1})$. ∎
Theorem 2 (Asymptotic regret rate of DPS).
If there exists a reward function such that a logistic regression model explains the preferences, then DPS has an asymptotic noregret rate of $O\mathit{}\mathrm{\left(}h\mathit{}S\mathit{}\sqrt{A\mathit{}T\mathit{}\text{\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}S\mathit{}A\mathit{}T\mathrm{)}}\mathrm{+}h\mathit{}\sqrt{\frac{S\mathit{}A}{{c}_{\mathrm{0}}}\mathit{}T\mathit{}\text{\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}T\mathrm{)}}\mathrm{\right)}$, where ${c}_{\mathrm{0}}$ is a minimum rate at which eigenvalues of ${\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}f\mathit{}\mathrm{(}{\mathbf{z}}_{i}^{T}\mathit{}{\overline{\mathbf{r}}}^{\mathrm{\prime}}\mathrm{)}\mathit{}{\mathbf{z}}_{i}\mathit{}{\mathbf{z}}_{i}^{T}$ increase linearly with collection of samples ${\mathbf{z}}_{i}$ that impact those eigenvalues.
Proof.
From Lemma 3, the regret is upperbounded by a sum of two terms, where the first of those terms is bounded by $O\left(hS\sqrt{AT\text{log}(SAT)}\right)$ as proven in [29]. Here, we analyze the second term, given in (98), and upperbounded by the expression in (99), restated here:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{\overline{\bm{r}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}}_{\mathrm{\infty}}].$$ 
We use that ${\bm{x}}_{\mathrm{\infty}}\le {\bm{x}}_{2}$ for any $\bm{x}\in {\mathbb{R}}^{n}$, and that the linear transformation in (4) preserves inner products, as established in Equation (6). In particular, the linear transformation preserves ${l}_{2}$norms of vectors, since for any vector $\bm{x}\in {\mathbb{R}}^{n}$, ${\bm{x}}_{2}=\sqrt{{\bm{x}}^{T}\bm{x}}$. Applying both of these facts gives:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{\overline{\bm{r}}{\stackrel{\mathbf{~}}{\bm{r}}}_{k}}_{\mathrm{\infty}}]\le h\sum _{k=1}^{\lceil T/h\rceil}\mathbb{E}[{{\overline{\bm{r}}}^{\mathbf{\prime}}\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}}_{2}].$$  (108) 
Consider the expected ${l}_{2}$norm of a Gaussian vector $\bm{x}\in {\mathbb{R}}^{n},x\sim \mathcal{N}(\mathrm{\U0001d7ce},\mathrm{\Sigma})$. It can be upperbounded in terms of $n$ and the eigenvalues of $\mathrm{\Sigma}$:
$\mathbb{E}[{\bm{x}}_{2}]$  $\le \sqrt{\mathbb{E}[{\bm{x}}_{2}^{2}]},\text{by Jensen\u2019s inequality}$  (109)  
$=\sqrt{\mathbb{E}\left[{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}\right]}=\sqrt{{\displaystyle \sum _{i=1}^{n}}\mathbb{E}\left[{x}_{i}^{2}\right]}=\sqrt{{\displaystyle \sum _{i=1}^{n}}\text{Var}\left[{x}_{i}\right]}$  (110)  
$=\sqrt{\text{Tr}(\mathrm{\Sigma})}=\sqrt{{\displaystyle \sum _{i=1}^{n}}{\lambda}_{i}(\mathrm{\Sigma})}\le \sqrt{n{\lambda}_{\text{max}}(\mathrm{\Sigma})}.$  (111) 
Next, consider the asymptotic probability distribution of $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}}=(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k})+(\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}})$. Given ${\overline{\bm{r}}}^{\mathbf{\prime}}$, which is constant, the two quantities in parentheses are independent due to the nature of Laplace sampling, in which $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}$ is drawn from a normal distribution centered at $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}$. By Lemmas 5 and 6, both $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}$ and $\widehat{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}}$ behave asymptotically according to $\mathcal{N}(\mathrm{\U0001d7ce},{\mathrm{\Sigma}}_{k})$, where ${\mathrm{\Sigma}}_{k}:={\left[{\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1}$. Therefore, $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}}$ behaves asymptotically according to $\mathcal{N}(\mathrm{\U0001d7ce},2{\mathrm{\Sigma}}_{k})$.
Combining (108) and (111) gives:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\sqrt{(D1){\lambda}_{\text{max}}(2{\mathrm{\Sigma}}_{k})}\le h\sqrt{2SA}\sum _{k=1}^{\lceil T/h\rceil}{\lambda}_{\text{max}}({\mathrm{\Sigma}}_{k}).$$  (112) 
Although ${\lambda}_{\text{max}}({\mathrm{\Sigma}}_{k})$ decays to zero over time by Lemma 2, this result is too loose: there could be many consecutive iterations in which ${\lambda}_{\text{max}}({\mathrm{\Sigma}}_{k})$ remains constant. For instance, if Condition 2 is enforced explicitly as discussed earlier, then as the model posterior converges, one would expect to find increasinglymany consecutive iterations in which the reward posterior is not updated; in these iterations, ${\lambda}_{\text{max}}({\mathrm{\Sigma}}_{k})$ is fixed.
The key intuition, formalized below, is that when a nonoptimal episode is sampled, the covariance matrix ${\mathrm{\Sigma}}_{k}$ shrinks with respect to the directions (i.e., its eigenvectors) responsible for estimating the rewards suboptimally. Meanwhile, if particular eigenvectors of ${\mathrm{\Sigma}}_{k}$ do not result in a suboptimal policy during episode $k$, then for that episode, the corresponding eigenvalues can be removed from the righthand sum in (112). Thus, eigenvalues do not appear in the regret’s upper bound except for when their corresponding eigenvectors affect an episode’s regret. In this event, however, the sampled observations cause ${\mathrm{\Sigma}}_{k}$ to shrink along the dimensions of the guilty eigenvectors.
To see this, first note that in the context of preferencebased learning, the true reward function $\overline{\bm{r}}$ is unobserved. One can define an equivalence class of reward functions that, when coupled with the true transition dynamics, induce the optimal policy with respect to the user’s preferences. In the preferencebased setting, an optimal policy is defined as any policy ${\pi}^{*}$ such that when compared to any other policy ${\pi}_{i}$:
$${\mathbb{E}}_{{\tau}^{*}\sim {\pi}^{*},{\tau}_{i}\sim {\pi}_{i}}\left[P({\tau}^{*}>{\tau}_{i})\right]\ge \frac{1}{2},$$  (113) 
where ${\tau}^{*}$ is a trajectory sampled from policy ${\pi}^{*}$, and ${\tau}_{i}$ is a trajectory sampled from ${\pi}_{i}$.
Let ${S}_{\text{eq}}\subset {\mathbb{R}}^{D1}$ be the equivalence class of reward functions that induce policies satisfying (113) when coupled with the true dynamics. To see that there are indeed multiple such functions, notice that for any reward vector ${\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}$, $a{\bm{r}}^{\mathbf{\prime}}+b\mathrm{1}\in {S}_{\text{eq}}$ for any $a>0$ and $b\in \mathbb{R}$, where $\mathrm{1}\in {\mathbb{R}}^{D1}$ is a vector of ones. Therefore, in (108), we can minimize each episode’s regret over reward vectors belonging to ${S}_{\text{eq}}$:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\left\{\mathbb{E}\left[{\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}}}_{2}\right]\right\}.$$  (114) 
Even given specified scaling and normalization of the rewards (e.g. specified values of the smallest and largest state/action rewards), there could be multiple reward functions that give rise to the optimal policy. For instance, a small perturbation to the rewards might not change the policy. Given specified scaling and normalization, however, there is only one true latent reward function governing the preferences; this is the specific ${\overline{\bm{r}}}^{\mathbf{\prime}}\in {S}_{\text{eq}}$ to which the logistic regression posterior converges.
Define ${\mathrm{\Sigma}}_{k}:={\left[{\sum}_{i=1}^{k}f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}}){\bm{z}}_{i}{\bm{z}}_{i}^{T}\right]}^{1}$, and let $\{{\bm{v}}_{k1},\mathrm{\dots},{\bm{v}}_{k(D1)}\}$ be an orthonormal basis of eigenvectors of ${\mathrm{\Sigma}}_{k}$. For ${\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}$, the quantity $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}}$ can be expressed in this eigenbasis:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le h\sum _{k=1}^{\lceil T/h\rceil}\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\left\{\mathbb{E}\left[{\sum _{i=1}^{D1}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}\right]\right\}.$$  (115) 
Consider the inner sum, ${\sum}_{i=1}^{D1}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}$. We will show that if any terms in this sum are known not contribute to the regret (i.e., to sampling a nonoptimal policy in episode $k$), then they can be eliminated from the sum. In particular, if an episode’s sampled reward belongs to ${S}_{\text{eq}}$, then its reward regret ${\mathrm{\text{Reg}}}_{T}^{R}$ is zero. Otherwise, the components of the reward function that make the policy nonoptimal cause the corresponding eigenvalues of ${\mathrm{\Sigma}}_{k}$ to shrink.
Assume that in episode $k$, the reward sample’s projection onto ${\bm{v}}_{i}$ does not affect the regret. In other words, there exists ${\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}$ such that $({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}=\mathrm{\U0001d7ce}$. Let ${\mathcal{I}}_{k}\subset \{1,\mathrm{\dots},D1\}$ be a set of indices that jointly do not affect the regret, such that there exists ${\bm{r}}_{k}^{*\prime}\in {S}_{\text{eq}}$ satisfying:
$$\sum _{i\in {\mathcal{I}}_{k}}({\bm{v}}_{i}^{T}(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{*}\prime})){\bm{v}}_{i}=\mathrm{\U0001d7ce}.$$  (116) 
Let ${\mathcal{J}}_{k}$ be the set of indices not contained in ${\mathcal{I}}_{k}$: ${\mathcal{J}}_{k}=\{1,\mathrm{\dots},D1\}\setminus {\mathcal{I}}_{k}$; we make no particular assumptions on how vectors in ${\mathcal{J}}_{k}$ affect the regret. The vector ${\bm{r}}_{k}^{*\prime}$ can then be written as:
$${\bm{r}}_{k}^{*\prime}=\sum _{i\in {\mathcal{I}}_{k}}({\bm{v}}_{i}^{T}\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}){\bm{v}}_{i}+\sum _{j\in {\mathcal{J}}_{k}}{\alpha}_{j}^{*}{\bm{v}}_{j},$$  (117) 
for some constants ${\alpha}_{j}^{*}$, $j\in {\mathcal{J}}_{k}$.
Due to orthogonality of ${\bm{v}}_{i}$ and ${\bm{v}}_{j}$, (116) can be modified by adding any linear combination of the vectors ${\bm{v}}_{j},j\in {\mathcal{J}}_{k}$, to ${\bm{r}}^{\mathbf{*}\prime}$:
$$\sum _{i\in {\mathcal{I}}_{k}}\left({\bm{v}}_{i}^{T}\left(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{*}\prime}\sum _{j\in {\mathcal{J}}_{k}}{\alpha}_{j}{\bm{v}}_{j}\right)\right){\bm{v}}_{i}=\mathrm{\U0001d7ce},\text{for any values of}{\alpha}_{j}.$$  (118) 
Conditioning on knowledge of the set ${\mathcal{I}}_{k}$, we can upperbound the minimization in (115) as follows:
$\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i=1}^{D1}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]$  (119)  
$=\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i\in {\mathcal{I}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}+{\displaystyle \sum _{j\in {\mathcal{J}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]$  (120)  
$\le \underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i\in {\mathcal{I}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}+{{\displaystyle \sum _{j\in {\mathcal{J}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right],$  (121) 
where the last step comes from the triangle inequality. The expression can be further upperbounded by restricting the set over which the minimization takes place:
$\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i=1}^{D1}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]$  
$\le \underset{\begin{array}{c}{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}\\ {\bm{r}}^{\mathbf{\prime}}={\bm{r}}^{\mathbf{*}\prime}+{\sum}_{j\in {\mathcal{J}}_{k}}{\alpha}_{j}{\bm{v}}_{j},\\ {\alpha}_{j}\in \mathbb{R}\end{array}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i\in {\mathcal{I}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}+{{\displaystyle \sum _{j\in {\mathcal{J}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right].$ 
Note that the constraint set of the minimization is nonempty, since ${\bm{r}}^{\mathbf{*}\prime}\in {S}_{\text{eq}}$ and satisfies ${\bm{r}}^{\mathbf{\prime}}={\bm{r}}^{\mathbf{*}\prime}+{\sum}_{j\in {\mathcal{J}}_{k}}{\alpha}_{j}{\bm{v}}_{j}$ when ${\alpha}_{j}={\alpha}_{j}^{*}\forall j\in {\mathcal{J}}_{k}$. Next, for any ${\bm{r}}^{\mathbf{\prime}}$ satisfying the minimization constraints, the first ${l}_{2}$norm expression is equal to zero by (118). The second term, meanwhile, is not affected by the projections of ${\bm{r}}^{\mathbf{\prime}}$ onto ${\bm{v}}_{i}$ for any $i\in {\mathcal{I}}_{k}$, and so the constraint ${\bm{r}}^{\mathbf{\prime}}={\bm{r}}^{\mathbf{*}\prime}+{\sum}_{j\in {\mathcal{J}}_{k}}{\alpha}_{j}{\bm{v}}_{j}$ is irrelevant to the minimization. This gives:
$\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i=1}^{D1}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]\le \underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{j\in {\mathcal{J}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right].$ 
This can again be upperbounded by setting ${\bm{r}}^{\mathbf{\prime}}$ to any specific value in ${S}_{\text{eq}}$. In particular, we can set ${\bm{r}}^{\mathbf{\prime}}={\overline{\bm{r}}}^{\mathbf{\prime}}$, the true latent reward function governing the preferences. Recall that while multiple reward functions may give rise to the optimal policy, the true rewards ${\overline{\bm{r}}}^{\mathbf{\prime}}$ are unique under fixed scaling and normalization. Therefore,
$\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i=1}^{D1}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]\le \mathbb{E}\left[{{\displaystyle \sum _{j\in {\mathcal{J}}_{k}}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right].$  (122) 
Recall from above that $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}}$ behaves asymptotically as $\mathcal{N}(\mathrm{\U0001d7ce},2{\mathrm{\Sigma}}_{k})$. The righthandside summation in (122) is the projection of $\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\overline{\bm{r}}}^{\mathbf{\prime}}$ onto the subspace of eigenvectors of ${\mathrm{\Sigma}}_{k}$ corresponding to ${\mathcal{J}}_{k}$. Applying the result in (111) to (122), we finally obtain:
$\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i=1}^{D1}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]$  $\le \sqrt{{\displaystyle \sum _{j\in {\mathcal{J}}_{k}}}{\lambda}_{j}(2{\mathrm{\Sigma}}_{k})}$  (123)  
$\le \sqrt{2SA\underset{j\in {\mathcal{J}}_{k}}{\text{max}}{\lambda}_{j}({\mathrm{\Sigma}}_{k})}.$  (124) 
This result demonstrates that if an eigenvector of ${\mathrm{\Sigma}}_{k}$ does not affect the regret, then its corresponding eigenvalue can be removed from the regret’s upper bound in (111). Therefore, an eigenvector of ${\mathrm{\Sigma}}_{k}$ does not affect the regret unless its contribution to the sampled reward ${\stackrel{\mathbf{~}}{\bm{r}}}_{k}$ makes the policy nonoptimal.
A nonoptimal policy’s distribution over sampled observations differs from that of the optimal policy. When a nonoptimal policy is sampled, the covariance matrix is more likely to shrink along directions of worse observations, decreasing the future probability of sampling a nonoptimal policy. This shrinkage occurs linearly with respect to the episodes in which nonoptimal policies are sampled because ${({\mathrm{\Sigma}}_{k})}^{1}$ grows linearly in directions favored by the nonoptimal policies, with respect to the number of times that they are executed. Indeed, for an arbitrary vector $\bm{v}\in {\mathbb{R}}^{D1}$:
${\bm{v}}^{T}{({\mathrm{\Sigma}}_{k})}^{1}\bm{v}={\displaystyle \sum _{i=1}^{k}}{\alpha}_{i}{\bm{v}}^{T}({\bm{z}}_{i}{\bm{z}}_{i}^{T})\bm{v}={\displaystyle \sum _{i=1}^{k}}{\alpha}_{i}{({\bm{z}}_{i}^{T}\bm{v})}^{2},$  (125) 
where ${\alpha}_{i}=f({\bm{z}}_{i}^{T}{\overline{\bm{r}}}^{\mathbf{\prime}})$ lives in a bounded range as discussed in Lemma 1; this quantity increases linearly on average with respect to observations ${\bm{z}}_{i}$ that are similarlyaligned with $\bm{v}$. Also,
${\bm{v}}^{T}{({\mathrm{\Sigma}}_{k})}^{1}\bm{v}={\bm{v}}^{T}\left({\displaystyle \sum _{i=1}^{D1}}{[{\lambda}_{i}({\mathrm{\Sigma}}_{k})]}^{1}{\bm{v}}_{i}{\bm{v}}_{i}^{T}\right)\bm{v}={\displaystyle \sum _{i=1}^{D1}}{[{\lambda}_{i}({\mathrm{\Sigma}}_{k})]}^{1}{({\bm{v}}_{i}^{T}\bm{v})}^{2}.$  (126) 
Comparing (125) and (126), one can see that the eigenvalues of ${({\mathrm{\Sigma}}_{k})}^{1}$ associated with nonoptimal policies should increase linearly with the collection of observations ${\bm{z}}_{i}$ associated with those nonoptimal policies.
The regret for an episode is zero if the optimal policy is selected. Otherwise, examine the bound from (123): $\sqrt{2SA\underset{j\in {\mathcal{J}}_{k}}{\text{max}}{\lambda}_{j}({\mathrm{\Sigma}}_{k})}=\sqrt{\frac{2SA}{\underset{j\in {\mathcal{J}}_{k}}{\text{min}}{\lambda}_{j}({\mathrm{\Sigma}}_{k}^{1})}}$. The denominator increases linearly with respect to episodes with nonoptimal policies, so that asymptotically:
$\underset{{\bm{r}}^{\mathbf{\prime}}\in {S}_{\text{eq}}}{\text{min}}\mathbb{E}\left[{{\displaystyle \sum _{i=1}^{D1}}({(\stackrel{\mathbf{~}}{\bm{r}}^{\mathbf{\prime}}{}_{k}{\bm{r}}^{\mathbf{\prime}})}^{T}{\bm{v}}_{ki}){\bm{v}}_{ki}}_{2}{\mathcal{I}}_{k}\right]\le \sqrt{{\displaystyle \frac{2SA}{{c}_{0}k}}},$  (127) 
where ${c}_{0}$ is a constant related to the rate at which eigenvalues of ${({\mathrm{\Sigma}}_{k})}^{1}$ increase with respect to trajectories sampled from nonoptimal policies. Summing over all the episodes gives:
$$\mathbb{E}[{\mathrm{\text{Reg}}}_{T}^{R}]\le \frac{h\sqrt{2SA}}{{c}_{0}}\sum _{k=1}^{T}\frac{1}{\sqrt{k}}\le \frac{h\sqrt{2SA}}{{c}_{0}}\sqrt{T\text{log}(T)}\text{for all}T\ge 17.$$  (128) 
Combining (128) with the regret decomposition result of Lemma 4 and the posterior sampling regret bound from Osband et al. [29] (Proposition 3) yields the stated result.
∎
A2.1 Extending proof techniques to other credit assignment models
Currently, this proof methodology extends only to the Bayesian logistic regression credit assignment model. Therefore, extending it to other credit assignment models, such as the Gaussian process regression and Bayesian linear regression methods detailed in Appendix A1, is an important direction for future work. The concept of regret decomposition, as introduced in Lemma 3, is not dependent on a specific credit assignment model. Thus, the concept of decomposing the total regret into two terms, dependent on the convergence of the dynamics and reward models, respectively, holds under any credit assignment model. For the dynamicsdependent term, the regret result from Osband et al. [29] can be applied. The second, rewarddependent term can be bounded if: (1) posterior samples from the reward model concentrate to the posterior distribution’s MAP estimate, (2) the reward model’s posterior concentrates to the true underlying utilities, and (3) the events in (1) and (2) occur at a sufficientlyfast rate.
We hypothesize that existing results on asymptotic consistency of Bayesian linear regression [19] and Gaussian process regression [34] could be leveraged toward extending notions of consistency and regret toward these credit assignment models. Unlike with classificationbased credit assignment, it may be necessary to consider the residuals between preference labels and utilities. The concept of approximate linearity [41] has facilitated bridging the gap between the preference and absolutereward domains in the bandit setting, and could potentially also apply here. In practice, we expect that DPS would perform well with any asymptotically consistent model for rewards that sufficiently captures the users’ preference behavior.
A3 Additional experimental details
As described in Section 6, experiments were conducted with the RiverSwim and random MDP environments. We use an episode horizon time of 50 in both cases. Figures 9 and 14 display performance in both environments for four values of the hyperparameter $c$ ($c\in \{0.1,0.5,1,1000\}$), which governs the degree of preference noise. Experiments were run on an Ubuntu 16.04.3 machine with 16 GB of RAM and an 8core Intel i7 processor.
We detail the ranges of hyperparameter values tested, as well as those displayed on the plots, for the different algorithms. For DPS, hyperparameters were tuned manually. For Gaussian process regression credit assignment, we used RBF kernels for the Gaussian process model, considering kernel variances from 0.001 to 0.5, kernel lengthscales from 0 to 0.1, and noise variances from 0.001 to 0.1. The plots depict values of (0.03, 0, 0.05), respectively, for these parameter values in the RiverSwim case, while depicting (1, 0, 0.03) in the random MDP case.
For Bayesian linear regression, we considered ranges of hyperparameter values between 0.1 and 3 for both hyperparameters ($\sigma $ and $\lambda $). The RiverSwim plots display hyperparameter values of 0.5 and 0.1 for $\sigma $ and $\lambda $, respectively, while for the random MDP environment, both are set to 0.1. For Bayesian logistic regression, we set the prior mean to zero for all states and actions. All prior covariance matrices considered were of the form $b\mathbb{I}\in {\mathbb{R}}^{SA\times SA}$, where $\mathbb{I}$ is the identity matrix, and tested values of $b$ ranged from 0.1 to 30; the plots all show results for a prior covariance matrix of $20\mathbb{I}$.
The EMPC algorithm [50] has two hyperparameter values, $\alpha $ and $\eta $. We optimize both of these jointly via a grid search over values of $(0.1,0.2,\mathrm{\dots},0.9)$, with 100 runs of each pair of values. The bestperforming hyperparameter values (i.e. those achieving the highest total reward) are displayed in Table 1; these are the hyperparameter values depicted in Figures 9 and 14.
Noise:  0.1  0.5  1  1000 

RiverSwim  0.1/0.5  0.1/0.7  0.2/0.2  0.1/0.6 
Random MDP  0.3/0.8  0.9/0.7  0.4/0.2  0.7/0.4 