Dueling Posterior Sampling for Preference-Based Reinforcement Learning

  • 2019-08-04 07:51:36
  • Ellen R. Novoseller, Yanan Sui, Yisong Yue, Joel W. Burdick
  • 1

Abstract

In preference-based reinforcement learning (RL), an agent interacts with theenvironment while receiving preferences instead of absolute feedback. Whilethere is increasing research activity in preference-based RL, the design offormal frameworks that admit tractable theoretical analysis remains an openchallenge. Building upon ideas from preference-based bandit learning andposterior sampling in RL, we present Dueling Posterior Sampling (DPS), whichemploys preference-based posterior sampling to learn both the system dynamicsand the underlying utility function that governs the user's preferences.Because preference feedback is provided on trajectories rather than individualstate/action pairs, we develop a Bayesian approach to solving the creditassignment problem, translating user preferences to a posterior distributionover state/action reward models. We prove an asymptotic no-regret rate for DPSwith a Bayesian logistic regression credit assignment model; to our knowledge,this is the first regret guarantee for preference-based RL. We also discusspossible avenues for extending this proof methodology to analyze other creditassignment models. Finally, we evaluate the approach empirically, showingcompetitive performance against existing baselines.

 

Quick Read (beta)

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

Ellen R. Novoseller1, Yanan Sui2, Yisong Yue1, Joel W. Burdick1
1 Department of Computing and Mathematical Sciences, California Institute of Technology
{[email protected], [email protected], [email protected]}
2 Department of Computer Science, Stanford University, [email protected]
Abstract

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present Dueling Posterior Sampling (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the user’s preferences. Because preference feedback is provided on trajectories rather than individual state/action pairs, we develop a Bayesian approach to solving the credit assignment problem, translating user preferences to a posterior distribution over state/action reward models. We prove an asymptotic no-regret rate for DPS with a Bayesian logistic regression credit assignment model; to our knowledge, this is the first regret guarantee for preference-based RL. We also discuss possible avenues for extending this proof methodology to analyze other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.

 

Dueling Posterior Sampling for Preference-Based Reinforcement Learning


  Ellen R. Novoseller1, Yanan Sui2, Yisong Yue1, Joel W. Burdick1 1 Department of Computing and Mathematical Sciences, California Institute of Technology {[email protected], [email protected], [email protected]} 2 Department of Computer Science, Stanford University, [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

In many domains, ranging from clinical trials [40] to autonomous driving [36] and human-robot interaction [26], it can be unclear how to define a reward signal for reinforcement learning (RL). In such situations, the RL agent seeks to interact optimally with a human user; thus, rewards should reflect the extent to which the algorithm achieves the user’s goals. Yet, for many systems, for instance in autonomous driving [8] and robotics [7, 3], users have difficulty with both specifying numerical reward functions and providing demonstrations of desired behavior. Furthermore, a misspecified reward function can result in “reward hacking” [6], which occurs when the agent learns an undesirable behavior that through some loophole, achieves a high reward. In such cases, the user’s preferences form a more reliable measure of desired system behavior, and the preference data may be leveraged in place of a standard numerical reward signal.

We thus study the problem of preference-based reinforcement learning (PBRL), where the RL agent executes a pair of trajectories, and the user provides (noisy) preference feedback regarding which trajectory has higher utility. While the study of PBRL has seen increased interest in recent years [18, 13, 48], it remains an open challenge to design formal frameworks that admit tractable theoretical analysis. Compared to the preference-based bandit setting, which has seen significant theoretical progress (e.g., [53, 56, 2, 44, 15, 55, 32, 52, 41, 42]), one major challenge is how to address credit assignment when only receiving feedback at the trajectory level compared to the state/action level.

In this paper, we present Dueling Posterior Sampling (DPS), which uses preference-based posterior sampling to tackle the PBRL problem in the Bayesian regime. Posterior sampling (also known as Thompson sampling) [45, 29, 20, 1, 30] is a Bayesian model-based approach to balancing exploration and exploitation, thereby enabling the algorithm to efficiently learn models of both the environment’s state transition dynamics and the reward function. Previous work on posterior sampling in RL [29, 20, 1, 30] all focused on learning from absolute rewards, while we show how to extend posterior sampling to both elicit and learn from trajectory-level preference feedback.

To elicit preference feedback, at every episode of learning, DPS draws two independent samples from the posterior to generate two trajectories. This approach is inspired by the Self-Sparring algorithm proposed for the bandit setting [41]. Our theoretical analysis is quite different from that in [41], due to the need to incorporate trajectory-level preference learning and state transition dynamics.

To learn from preference feedback, DPS internally maintains a Bayesian state/action reward model that explains the preferences. In other words, this reward model is a solution to the temporal credit assignment problem [3, 56, 44, 13, 51, 48] and determines which of the encountered states and actions are responsible for the trajectory-level preference feedback. Learning from trajectory-level preferences is in general a very challenging problem, as information about the rewards is sparse (often just one bit), is only relative to the pair of trajectories being compared, and does not explicitly include information about actions within trajectories. We thus develop our approach while restricting to standard Bayesian realizability assumptions inherent to most posterior sampling approaches.

We developed DPS concurrently with an analysis framework for characterizing regret convergence in the episodic learning setting. To justify our overall approach, we show how to mathematically integrate Bayesian credit assignment and draw dueling samples within the conventional posterior sampling framework. We evaluate several possible Bayesian credit assignment models, and prove an asymptotic no-regret rate for DPS using Bayesian logistic regression [5, 28] as the credit assignment model. To our knowledge, this is the first PBRL approach with theoretical guarantees. In addition, we also demonstrate that DPS delivers competitive performance in simulation.

2 Related work

Posterior sampling. Balancing exploration and exploitation is a key problem in reinforcement learning (RL) and bandits. In the episodic learning setting, the agent typically aims to balance exploration and exploitation to minimize its regret, i.e., the gap between the expected total rewards of the agent and the optimal policy. Posterior sampling, first proposed in [45], is a Bayesian approach toward achieving this goal, and iterates between (1) updating the posterior of a Bayesian environment model and (2) sampling from this posterior to inform the subsequent policy. In both the bandit and RL settings, posterior sampling has been demonstrated to perform competitively in experiments and enjoy favorable theoretical properties in terms of its regret [30, 29, 1, 12].

Our approach builds upon two prior posterior sampling algorithms: Self-Sparring [41] for preference-based bandit learning (also known as dueling bandits [53]) and posterior sampling RL [29]. Self-Sparring [41] is a posterior sampling approach, and draws multiple samples to “duel” or “spar” via preference elicitation. The algorithm iteratively: a) draws multiple samples from the posterior model of each action’s reward; b) for each sampled model, executes the action with the highest sampled reward; c) queries for preference feedback between the executed actions; and d) updates the posterior according to the acquired preference data. In [41], the authors prove an asymptotic no-regret guarantee for Self-Sparring with independent Beta-Bernoulli reward models for each action.

Within RL, posterior sampling has been applied to the finite-horizon setting with absolute rewards [29]. Posterior sampling RL iterates over four steps: a) draw a sample from the Bayesian posterior of the dynamics and rewards; b) compute the optimal policy for the sampled system; c) execute the policy to get a roll-out trajectory; and d) update the posteriors with the new observations from the roll-out. In [29], the authors show the expected regret is O(hSATlog(SAT)), for number of time-steps T, finite time horizon h, and discrete state and action spaces of sizes S and A, respectively.

A third line of relevant work is posterior sampling for Bayesian logistic regression [12, 16, 37], which is used as our Bayesian credit assignment model. One difficulty with Bayesian logistic regression [5] is the lack of a closed-form posterior. To handle this, we adopt the approach of [12] and use a Laplace approximation. Other approaches include using Gibbs sampling algorithm [16]. One relevant related application is [35], who apply Bayesian logistic regression to the multi-objective multi-armed bandit problem; to determine the utilities that a human assigns to different objectives, the algorithm queries for pairwise preferences between expected reward vectors corresponding to different actions.

Preference-based learning. Previous work on preference-based RL (PBRL) has shown successful performance in a number of applications, such as playing games [13], learning human preferences for autonomous driving [36], and selecting a robot’s controller parameters [26, 4]. Yet, to our knowledge, the PBRL literature still lacks theoretical guarantees.

Existing approaches for trajectory-level preference-based RL may be broadly divided into three categories [47]: a) directly optimizing policy parameters [46, 11, 26]; b) learning a preference model to predict action preferences in each state [18]; and c) learning a utility function to characterize the rewards, returns, or values of state/action pairs [49, 50, 3, 51, 13]. In c), the utility is often modeled as linear in the trajectory features. If those features are defined as visitations to each state/action pair, then maximizing utility directly corresponds to maximizing the total (undiscounted) reward.

One popular paradigm, which we also adopt, is PBRL with underlying utility functions. By inferring state/action rewards from preference feedback, one can derive relatively-interpretable reward models and also use such methods as value iteration. In addition, utility-based approaches may be more sample efficient compared to policy search and preference relation methods [47], as they extract more information from each observation. Notably, [46] learn a Bayesian model over policy parameters, and draw samples from its posterior to inform actions. From existing PBRL methods, their algorithm perhaps most resembles ours; however, compared to utility-based approaches, policy search methods typically require either more samples or expert knowledge to craft the policy parameters [48, 26].

Beyond RL, preference-based learning has been the subject of much research. The closest to RL is the bandit setting [53, 56, 2, 44, 15, 55, 32, 52, 41, 42], which is essentially a single-state variant of RL. Other settings include: active learning [36, 23, 17], which is focused exclusively on learning an accurate model rather than maximizing utility of decision-making; learning with more structured preference feedback [31, 38, 33, 39], where the learner receives more than one bit of information per preference elicitation; and batch supervised settings such as learning to rank [22, 14, 25, 9, 54, 10, 27].

3 Problem statement

Preliminaries. We consider fixed-horizon Markov Decision Processes (MDPs), in which rewards are replaced by preferences over trajectories. This class of MDPs can be represented as a tuple, =(𝒮,𝒜,,ϕ,p,p0,h), where the state space 𝒮 and action space 𝒜 are finite sets. The agent, using policy π, episodically interacts with the environment with length-h roll-out trajectories of the form τ={s1,a1,s2,a2,,sh,ah,sh+1}. Since we are eliciting preference feedback, in each episode i, the agent executes two roll-outs τi1 and τi2, and observes a preference between the two. The initial state is sampled from p0, while p defines the transition dynamics: st+1p(|st,at).

We use to denote the stochastic preference relationship between trajectories, and ϕ(τ,τ)=(τ>τ)[0,1] to capture the feedback generation mechanism. We assume that is a total ordering over trajectories, and ττϕ(τ,τ)>12. We use τ>τ to denote the event that trajectory τ was preferred over τ in a preference elicitation, i.e., τ>τ is observed with probability ϕ(τ,τ). We further assume an underlying utility function r¯(τ) for each trajectory, such that ττr¯(τ)>r¯(τ), and define ϕ using r¯. For instance, if the preferences are noiseless, then: ϕ(τi,τj)=𝕀[r¯(τi)>r¯(τj)]. Alternatively, ϕ could be the linear link function [2]: ϕlin(τi,τj):=(1+r¯(τi)-r¯(τj))/2. We primarily assume a logistic or Bradley-Terry link function: ϕlog(τi,τj):=[1+exp(-c(r¯(τi)-r¯(τj)))]-1 with “temperature” c(0,). Our problem setting resembles the PSDP defined in [50], except that additionally, we incorporate the noise model through which the underlying utilities are stochastically translated to preferences. Finally, we assume that the utilities decompose additively: r¯(τ)i=1hr¯(si,ai) for state/action pairs in τ.

Given a policy π, we can define the standard RL value function as the expected total utility of being in state s at step i, and following policy π:

Vπ,i(s)=𝔼[j=ihr¯(sj,π(sj))|si=s], (1)

and now we can define the optimal policy π* as the one with maximal value for all input states. Note that 𝔼s1p0[Vπ,1(s1)]𝔼τπ,[r¯(τ)]. Given fully specified dynamics and reward models, p and r¯, it is straightforward to apply standard dynamic programming approaches such as value iteration to arrive at the optimal policy under p and r¯ [43]. The goal of learning, then, is infer p and r¯ to the extent necessary for good decision-making.

Learning problem. In each iteration (or episode) i, the agent selects two policies, πi1 and πi2. The two policies are rolled out to obtain trajectories τi1 and τi2, and a binary preference bi{0,1} between them is sampled according to the underlying utilities of τi1 and τi2. We quantify the performance of the learning agent using expected cumulative regret relative to the optimal policy:

𝔼[RegT]=𝔼{i=1T/(2h)s𝒮p0(s)[2Vπ*,1(s)-Vπi1,1(s)-Vπi2,1(s)]}. (2)

To minimize regret, the agent must balance exploration (collecting new data) with exploitation (behaving optimally w.r.t. existing models). Over-exploration of bad trajectories will incur large regret, and under-exploration can prevent converging to the optimal policy. In contrast to the standard formulation in RL [29], at each iteration/episode we compare the utilities of both selected policies.

Algorithm 1 Dueling Posterior Sampling (DPS)
  H= {Initialize history}
  T= {Initialize list of preference data}
  Initialize prior for f {Initialize state transition model}
  Initialize prior for g {Initialize utility model}
  while True do
     π1 Advance(H, T, f, g)
     π2 Advance(H, T, f, g)
     Sample trajectories τ1 and τ2 from π1 and π2
     Observe feedback b=𝕀(τ2>τ1)
     H=H(s1τ1,a1τ1,s2τ1)(shτ2,ahτ2,sh+1τ2)
     T=T(τ1,τ2,b)
     feedback(H, T, f, g)
  end while

4 Algorithm

As outlined in Algorithm 1, Dueling Posterior Sampling (DPS) iterates over three main steps: (a) sample two policies π1,π2 from the Bayesian posteriors of the dynamics and utility models (Advance – Algorithm 2); (b) roll out π1 and π2 to obtain trajectories τ1 and τ2, and receive preference feedback between them; (c) store the new state transitions and feedback and update the posterior (feedback – Algorithm 3). Compared to conventional posterior sampling with absolute feedback [29], the two key differences are that: two policies are sampled rather than one each iteration, and a credit assignment problem is solved when learning from feedback.

Advance (Algorithm 2) samples from the Bayesian posteriors of the dynamics and utility models. The sampled dynamics and utilities form an MDP, and value iteration is used to derive the optimal policy π under the sample. One can also think of π as a random function whose randomness depends on the sampling of the dynamics and utility models. In the Bayesian setting, it can be shown that π is sampled according to its posterior probability of being the true optimal policy π* [29, 30]. Intuitively, peaked (i.e., certain) posteriors lead to less variability when sampling π, which implies less exploration. On the other hand, diffuse (i.e., uncertain) posteriors lead to greater variability when sampling π, which implies more exploration.

feedback (Algorithm 3) updates the Bayesian posteriors of the dynamics and utility models based on new data. Updating the dynamics posterior is relatively straightforward, as we assume that the dynamics are fully-observed; for instance, the dynamics prior can be modeled via Dirichlet distributions with multinomial conjugate observation likelihoods [29]. In contrast, performing Bayesian inference over state/action utilities from trajectory-level feedback is much more challenging. We considered a range of approaches (see Appendix A1), and found Bayesian logistic regression (Section 4.1) to both be well-performing and admit tractable analysis within our theoretical framework.

Algorithm 2 Advance: Sample policy from dynamics and utility models
  Input: H,T,f,g
  Sample Mf(|H) {Sample MDP transition dynamics from posterior}
  Sample Rg(|T) {Sample utilities from posterior}
  Compute π=argmaxπV(M,R) {Value iteration yields sampled MDP’s optimal policy}
  Return π
Algorithm 3 feedback: Update dynamics and utility models based on new user feedback
  Input: H,T,f,g
  Apply Bayesian update to f, given H {Update dynamics model given history}
  Apply Bayesian update to g, given T {Update utility model given preferences}
  Return f, g

4.1 Bayesian logistic regression for utility inference and credit assignment

Credit assignment [47] is the problem of inferring which state/action pairs are responsible for observed trajectory-level preferences. We detail a Bayesian logistic regression approach to address this task in our setting. Logistic regression is a binary classification method that learns a weight vector 𝒘 for the model p(y=1|𝒙,𝒘)=11+exp(-𝒘T𝒙). Bayesian logistic regression [5, 28] maintains a posterior over possible weight vectors. Because there is no convenient prior yielding a closed-form conjugate posterior, we use the Laplace approximation to the posterior as specified below.

Preliminaries. Let N be the number of trajectories pairs observed so far, and D=SA be the total number of state/action pairs. Let 𝒙ijD,j{1,2} be the visitation vector corresponding to trajectory τij, with the kth element 𝒙ij(k) being the number of times that state/action pair k was visited in τij. Define 𝒙i:=𝒙i1-𝒙i2. The observation matrix X and label vector 𝒚 are defined as:

X=[(𝒙11-𝒙12)T(𝒙N1-𝒙N2)T],𝒚=[y1yN]=[2𝕀[τ11>τ12]-12𝕀[τN1>τN2]-1], (3)

where the expression 2𝕀[τi1>τi2]-1 results in labels yi with values in {-1,1}.

The observation matrix XN×D has rank at most D-1, since the elements of 𝒙i=𝒙i1-𝒙i2 must sum to zero for each row. To obtain a full-row-rank observation matrix for Bayesian logistic regression, we transform XN×D to ZN×(D-1) via the matrix V=[𝒗1𝒗D-1]D×(D-1), where the columns 𝒗iD form an orthonormal basis spanning the (D-1)-dimensional, full possible row space of X. To obtain the vector 𝒛iD-1 that expresses 𝒙i in this basis, apply:

𝒛i=[𝒙iT𝒗1𝒙iT𝒗D-1]T=VT𝒙i, (4)

while converting any vector 𝒛iD-1 back to the original basis can be accomplished via:

𝒙𝒊=j=1D-1zij𝒗j=V𝒛i, where zij is the jth element of 𝒛i. (5)

Note that this transformation preserves inner products. Equation (5) can be applied to show:

𝒙1T𝒙2 =(i=1D-1z1i𝒗i)T(j=1D-1z2j𝒗j)=𝒛1T𝒛2, by orthonormality of {𝒗i}. (6)

In particular, the transformation preserves orthogonality, so that X and Z have the same row-rank and XTX and ZTZ have the same rank.

Utility model & posterior inference. We fit a Bayesian logistic regression model to the transformed data (Z,𝒚). Afterwards, this model predicts the probability that τ is preferred to τ as a logistic regression function of their visitation vector differences 𝒙τ-𝒙τ. The model parameters correspond exactly to the state/action utilities r¯. The model internally computes an element-wise product between 𝒙τ-𝒙τ and estimated reward vector 𝒓, within the (D-1)-dimensional space given by (4). Given the inner product equivalence (6), this is exactly the trajectory utility, and taking the expectation over trajectories generated by a policy is exactly the value function (1). We show in our experiments that Bayesian logistic regression can robustly learn even with preference modeling mismatch.

We are chiefly interested in sampling from the posterior of parameter/utility vector 𝒓¯D, which can be combined with the sampled dynamics to perform value iteration and obtain a policy. As shown below, via the Laplace approximation, the posterior is Gaussian distributed, and thus can be easily sampled. The internal utility representation lies in 𝒓~D-1, and we convert to 𝒓~D via (5).

We now describe the Bayesian logistic regression step itself. A Gaussian prior is defined over the utilities 𝒓D-1: p(𝒓)𝒩(𝒓|𝒓0,V0). The logistic regression likelihood is:

p(Z,𝒚|𝒓)=i=1Np(𝒛i,yi|𝒓)=i=1N11+exp(-yi𝒛iT𝒓). (7)

We approximate the posterior as Gaussian via the Laplace approximation:

p(𝒓|Z,𝒚) 𝒩(𝒓|𝒓^,H-1), where: (8)
𝒓^ =argmin𝒓f(𝒓),f(𝒓):=-logp(Z,𝒚,𝒓)=-logp(𝒓)-logp(Z,𝒚|𝒓), (9)
H =𝒓2f(𝒓)|𝒓^, and where the optimization problem in (9) is convex. (10)

To show a regret convergence using this approximate posterior, we leverage asymptotic normality of the maximum likelihood estimator of logistic regression in our proofs.

5 Theoretical results

We now sketch our asymptotic no-regret analysis for Dueling Posterior Sampling (DPS) with Bayesian logistic regression. The full proof is in Appendix A2, while Appendix A2.1 discusses possible avenues for extending this proof methodology toward additional credit assignment models. The proof has two main parts: first proving that DPS with logistic credit assignment is asymptotically consistent (Theorem 1), and then proving that DPS has a sublinear regret rate (Theorem 2). Both parts leverage results on the asymptotic behavior of logistic regression [21]. As before, we consider data ZN×(D-1) and labels 𝒚N, with [Z]ij=zij. To show that DPS  is asymptotically consistent in learning the reward function, we first provide some definitions and necessary conditions.

Definition 1 (Derivative of sigmoid).

f:, where f(x)=e-x(1+e-x)2. Note f(x)=f(-x).

Definition 2.

Let 𝐫¯RD be the vector of true state/action utilities (assumed to exist) and 𝐫¯RD-1 be its transformation via (4). Define 𝐫~kRD-1 as the state/action rewards sampled from the Bayesian logistic regression model posterior in episode k, 𝐫^kRD-1 as the model’s maximum a posteriori (MAP) estimate at episode k, and 𝐫^ML,kRD-1 as its maximum likelihood estimate at k. Lastly, 𝐫~kRD, 𝐫^kRD, and 𝐫^ML,kRD are their respective equivalents given by (5).

Condition 1.

m0< such that |zij|m0 for all i{1,,N},j{1,,D-1}.

Condition 2.

Let λ1(k) and λD-1(k) be the largest and smallest eigenvalues, respectively, of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT. Then, m1< such that λ1(k)λD-1(k)<m1, for all k.

Proposition 1 (Asymptotic consistency of logistic regression [21]).

If Conditions 1 and 2 are satisfied, then the maximum likelihood estimator 𝐫^ML,k of 𝐫¯ exists almost surely as k, and 𝐫^ML,k converges almost surely to the true values 𝐫¯ if and only if limkλD-1(k)=.

We first show that Proposition 1’s final condition is satisfied with known transition dynamics, and afterwards consider the convergence of the dynamics model posterior.

Lemma 1.

Under known transition dynamics, all eigenvalues of the matrix i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT approach infinity as k.

Lemma 2 (Convergence of dynamics model).

Given Lemma 1, DPS’s dynamics model converges to the true dynamics, and as it converges, all eigenvalues of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT approach infinity.

Combining these results, we obtain:

Theorem 1 (Asymptotic consistency of DPS).

If there exists a reward function such that a logistic regression model explains the preference feedback, then DPS with a Bayesian logistic regression credit assignment model will learn an asymptotically consistent reward model.

We turn next to characterizing the regret rate of DPS. We apply two prior results, one from [21] regarding the asymptotic distribution of the logistic regression maximum likelihood estimate (Prop. 2), and the other from [29] regarding a regret bound for posterior sampling RL (Prop. 3).

Proposition 2 (Asymptotic normality of logistic regression maximum likelihood estimator [21]).

If Conditions 1 and 2 are satisfied, and if 𝐫^ML,k converges almost surely to the true value 𝐫¯, then:

[i=1kf(𝒛iT𝒓^ML,k)𝒛i𝒛iT]12(𝒓^ML,k-𝒓¯)𝐷𝒩(𝟎,𝕀) as k, (11)

where 𝐷 implies convergence in distribution and Q12 is the positive definite matrix associated with positive definite matrix Q such that [Q12]2=Q.

Proposition 3 (Expected regret of posterior sampling RL [29]).

Posterior sampling RL has expected T-step regret O(hSAT𝑙𝑜𝑔(SAT)), with horizon h and numbers of states and actions S and A.

Leveraging these results, we show that under preference feedback, the regret can be decomposed into two terms: one that reflects the converging dynamics model, and one that reflects the converging reward model (inferred from trajectory-level preference feedback).

Lemma 3 (Regret decomposition).

The expected regret of DPS can be decomposed into two terms. One term can be bounded by the regret bound of [29], stated in Proposition (3). The other is bounded by: hk=1T/hE[||𝐫¯-𝐫~k||]hk=1T/hE[||𝐫^k-𝐫¯||]+hk=1T/hE[||𝐫^𝐤-𝐫~k||].

The final result is obtained by analyzing convergence of the samples 𝒓~k to 𝒓^k and of the credit assignment model 𝒓^k to 𝒓¯:

Theorem 2 (Asymptotic regret rate of DPS).

If there exists a reward function such that a logistic regression model explains the preferences, then DPS has an asymptotic no-regret rate of O(hSAT𝑙𝑜𝑔(SAT)+hSAc0T𝑙𝑜𝑔(T)), where c0 is a minimum rate at which eigenvalues of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT increase linearly with collection of samples 𝐳i that impact those eigenvalues.

6 Experiments

We validate the empirical performance of Dueling Posterior Sampling (DPS) on two simulated domains with varying levels of preference noise, as well as using alternative credit assignment models. We find that DPS generally performs well and outperforms standard PBRL baselines [49].

Experimental setup. We evaluate on two simulated environments: RiverSwim and random MDPs. The RiverSwim environment [29] has six states and two actions (actions 0 and 1); the optimal policy is to always choose action 1, which maximizes the probability of reaching a particular goal state/action pair. Meanwhile, a suboptimal policy—yielding a much smaller reward compared to the goal—is quickly and easily discovered and incentivizes the agent to always select action 0. The algorithm must demonstrate sufficient exploration to have hope of discovering the optimal policy quickly.

In the second simulated environment, we generate random MDPs as in [29] with 10 states and 5 actions. The transition dynamics and rewards are generated from Dirichlet (all parameters set to 0.1) and exponential (rate parameter set to 5) distributions, respectively. The parameter for these distributions were chosen to generate MDPs with sparse dynamics and rewards. The sampled reward vectors were shifted and normalized so that the minimum reward is zero and their mean is one.

In both of these environments, preferences between pairs of trajectories were generated by (noisily) comparing the total rewards that they accumulated; this reward information was hidden from the learning algorithm, which observed only the trajectory preferences and state transitions. Preference noise is generated according to a logistic model: for trajectories τi and τj, P(τi>τj)={1+exp[-c(r¯(τi)-r¯(τj))]}-1, where r¯(τi) and r¯(τj) are the total rewards accrued by the two trajectories, respectively, while the hyperparameter c controls the degree of noisiness.

Methods compared. We evaluate DPS  under four different noise levels (c{0.1,0.5,1,1000}) and three credit assignment models: 1) Bayesian logistic regression, 2) Bayesian linear regression, and 3) Gaussian process regression, where the latter two methods are described in Appendix A1. In addition, we evaluate the Every-Visit Preference Monte Carlo (EPMC) algorithm with probabilistic credit assignment [50, 47] as a baseline. Lastly, we compare against the posterior sampling RL algorithm [29], which learns using the true numerical rewards at each step, and thus yields an upper-bound on the performance that a preference-based algorithm could achieve.

(a)
(b)
(c)
Figure 4: Empirical performance of DPS. a) and b) show RiverSwim with noise hyperparameters c= 1,000, 1. c) displays random MDPs with c=1. Posterior sampling RL (PSRL) [29] is an upper bound that receives numerical rewards; Gaussian process regression (GPR), Bayesian linear regression, and Bayesian logistic regression are all instances of DPS. EPMC is a baseline from [50] as discussed. Plots display mean +/- one std over 100 runs of each algorithm tested. Additional results (more values of c) are in Appendix A3. Normalization is with respect to the total reward achieved by the optimal policy. Overall, we see that DPS performs well and is robust to the choice of credit assignment model.

Results. Figure 4 shows the performance comparison for c=1 in both environments, as well as c=1,000 in RiverSwim (additional results are in Appendix A3, including hyperparameter details). DPS performs well in all simulations, and significantly outperforms the EPMC baseline. This may be because the EPMC algorithm uses a uniform exploration strategy, while DPS prioritizes exploration by sampling high rewards in more uncertain regions of the state/action space. Notice that c= 1,000 results in nearly-noiseless preferences; this can decrease performance in RiverSwim in some cases, since preference noise can help the agent to escape the local minimum. We also see that DPS is competitive with PSRL, which has access to the full cardinal rewards at each state/action. Finally, we see that the performance of DPS is robust to the choice of credit assignment model, and in fact using Gaussian process regression (for which we do not have an end-to-end regret analysis) often leads to the best empirical performance. These results suggest that DPS is a practically promising approach that can robustly incorporate many modeling approaches as subroutines.

7 Conclusion

We investigate the preference-based reinforcement learning problem, which receives comparative preferences instead of absolute real-valued rewards as feedback. We develop the Dueling Posterior Sampling (DPS) algorithm, which optimizes policies in an highly efficient and flexible way. To our knowledge, DPS is the first preference-based RL algorithm with a regret guarantee. DPS also performs well in our simulations, and seems practically promising. That makes it both a theoretically-justified and practically promising algorithm.

There are many directions for future work. The Bayesian logistic regression model could be improved with more accurate posterior estimates. Assumptions governing the user’s preferences, such as requiring an underlying utility model, could be relaxed. One can also incorporate kernelized methods to further improve sample efficiency. It is also important to extend to other credit assignment models, such as the Gaussian process regression and Bayesian linear regression methods, for which the same concept of the regret decomposition still applies. We expect that DPS would perform well with any asymptotically consistent reward model that sufficiently captures users’ preference behavior.

Acknowledgments

This work was supported by NIH grant EB007615 and an Amazon graduate fellowship.

References

References

  • [1] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
  • [2] Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
  • [3] Riad Akrour, Marc Schoenauer, and Michèle Sebag. APRIL: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116–131. Springer, 2012.
  • [4] Riad Akrour, Marc Schoenauer, Michèle Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, volume 32, pages 1503–1511. JMLR. org, 2014.
  • [5] James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679, 1993.
  • [6] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
  • [7] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  • [8] Chandrayee Basu, Qian Yang, David Hungerman, Mukesh Sinahal, and Anca D Dragan. Do you want your autonomous car to drive like you? In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI, pages 417–425. IEEE, 2017.
  • [9] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05), pages 89–96, 2005.
  • [10] Christopher J Burges, Robert Ragno, and Quoc V Le. Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems, pages 193–200, 2007.
  • [11] Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, Weiwei Cheng, and Eyke Hüllermeier. Preference-based evolutionary direct policy search. In ICRA Workshop on Autonomous Learning, 2013.
  • [12] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, 2011.
  • [13] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
  • [14] Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 137–144. ACM, 2005.
  • [15] Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory (COLT), 2015.
  • [16] Bianca Dumitrascu, Karen Feng, and Barbara Engelhardt. PG-TS: Improved Thompson sampling for logistic contextual bandits. In Advances in Neural Information Processing Systems, pages 4624–4633, 2018.
  • [17] Brochu Eric, Nando D Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in neural information processing systems, pages 409–416, 2008.
  • [18] Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123–156, 2012.
  • [19] Subhashis Ghosal et al. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli, 5(2):315–331, 1999.
  • [20] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision processes. In Conference on Learning Theory, pages 861–898, 2015.
  • [21] Christian Gourieroux and Alain Monfort. Asymptotic properties of the maximum likelihood estimator in dichotomous logit models. Journal of Econometrics, 17(1):83–97, 1981.
  • [22] R Herbrich, T Graepel, and K Obermayer. Support vector learning for ordinal regression. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99.(Conf. Publ. No. 470), volume 1, pages 97–102. IET, 1999.
  • [23] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
  • [24] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • [25] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, pages 377–384. ACM, 2005.
  • [26] Andras Kupcsik, David Hsu, and Wee Sun Lee. Learning dynamic robot-to-human object handover from human feedback. In Robotics research, pages 161–176. Springer, 2018.
  • [27] Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
  • [28] Kevin P Murphy. Machine learning: A probabilistic perspective. The MIT Press, 2012.
  • [29] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • [30] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
  • [31] Filip Radlinski and Thorsten Joachims. Query chains: Learning to rank from implicit feedback. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 239–248. ACM, 2005.
  • [32] Siddartha Y Ramamohan, Arun Rajkumar, and Shivani Agarwal. Dueling bandits: Beyond condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems, pages 1253–1261, 2016.
  • [33] Karthik Raman, Thorsten Joachims, Pannaga Shivaswamy, and Tobias Schnabel. Stable coactive learning via perturbation. In International Conference on Machine Learning, pages 837–845, 2013.
  • [34] Carl Edward Rasmussen and Christopher K Williams. Gaussian processes for machine learning. The MIT Press, 2(3):4, 2006.
  • [35] Diederik M Roijers, Luisa M Zintgraf, and Ann Nowé. Interactive Thompson sampling for multi-objective multi-armed bandits. In International Conference on Algorithmic Decision Theory, pages 18–34. Springer, 2017.
  • [36] Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems (RSS), 2017.
  • [37] Steven L Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
  • [38] Pannaga Shivaswamy and Thorsten Joachims. Online structured prediction via coactive learning. In Proceedings of the 29th International Conference on Machine Learning, pages 59–66. Omnipress, 2012.
  • [39] Pannaga Shivaswamy and Thorsten Joachims. Coactive learning. Journal of Artificial Intelligence Research, 53:1–40, 2015.
  • [40] Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. Stagewise safe Bayesian optimization with Gaussian processes. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4781–4789. PMLR, 10–15 Jul 2018.
  • [41] Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multi-dueling bandits with dependent arms. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2017.
  • [42] Yanan Sui, Masrour Zoghi, Katja Hofmann, and Yisong Yue. Advancements in dueling bandits. In IJCAI, pages 5502–5510, 2018.
  • [43] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [44] Balázs Szörényi, Róbert Busa-Fekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for Plackett-Luce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pages 604–612, 2015.
  • [45] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • [46] Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pages 1133–1141, 2012.
  • [47] Christian Wirth. Efficient Preference-based Reinforcement Learning. PhD thesis, Technische Universität, 2017.
  • [48] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945–4990, 2017.
  • [49] Christian Wirth and Johannes Fürnkranz. EPMC: Every visit preference Monte Carlo for reinforcement learning. In Asian Conference on Machine Learning, pages 483–497, 2013.
  • [50] Christian Wirth and Johannes Fürnkranz. A policy iteration algorithm for learning from preference-based feedback. In International Symposium on Intelligent Data Analysis, pages 427–437. Springer, 2013.
  • [51] Christian Wirth, Johannes Fürnkranz, and Gerhard Neumann. Model-free preference-based reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [52] Huasen Wu and Xin Liu. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
  • [53] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
  • [54] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 271–278. ACM, 2007.
  • [55] Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten De Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.
  • [56] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten De Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. In International Conference on Machine Learning (ICML), 2014.

A1 Bayesian state/action credit assignment: two additional approaches

Credit assignment [48] is the problem of inferring which states or state/action pairs are responsible for observed user preferences. This paper derives theoretical guarantees for a Bayesian logistic regression credit assignment model. In this Appendix, we detail two additional credit assignment models—employing Bayesian linear regression and Gaussian process regression, respectively—for inferring a posterior over state/action utilities using trajectory preferences. (Note that these methods could similarly model utilities over states, rather than state/action pairs.)

In what follows, s~ denotes a state/action pair, with D=SA representing the number of possible values of s~. For each trajectory τ={s~1,s~2,s~3,,s~h}, we observe the user’s preference, yielding a dataset {τi|i1,,N} of N labeled trajectories. Define XN×D such that xij:=[X]ij is the number of times that trajectory i visits state/action s~j. Finally, we denote the label vector as 𝒚{0,1}N, where the ith element yi is the preference label corresponding to τi; for instance, if τ1 is preferred to τ2, then we would have y1=1 and y2=0.

As before, r¯(s~) represents the true state/action utilities, such that r¯(τ)=i=1hr¯(s~i), with r¯(τ) being trajectory τ’s total (unobserved) utility. To apply regression methods to this data using preference labels, we approximate yir¯(τi) to infer r¯(s~). In the following, we denote r^(s~) as our model of the true utilities r¯(s~). Also, define 𝒓^D as a vector in which the ith element is r^(s~i).

In the following, we discuss how to perform Bayesian inference on r¯(s~) using r¯(τi), which is approximated in practice via preferences. Note that the approximation yir¯(τi) performs well empirically, though future work could perhaps apply Bayesian methods such as those in [14] to infer trajectories’ total utilities from the preference labels.

A1.1 Bayesian linear regression

One can estimate state/action utilities from preferences via linear regression: 𝒚=X𝒓^+𝜺, where 𝜺 is a vector of residuals and the other quantities are defined above. Bayesian linear regression infers a distribution over likely values of 𝒓^. We define conjugate Gaussian prior and likelihood distributions over the state/action utilities and preference labels, respectively, to obtain a Gaussian posterior distribution over 𝒓^. The prior, likelihood, and posterior take the following form, where ΛD×D and σ are prior parameters, Λ is positive definite, and we set Λ=λ𝕀 for some λ>0:

Prior: 𝒓^𝒩(0,Λ-1),Λ=λ𝕀;Likelihood:
p(𝐲|X,𝒓^;σ2)=1(2πσ2)N2exp(-12σ2||𝐲-X𝒓^||2)
Posterior: 𝒓^|X,𝒚,σ2,Λ𝒩(𝝁,Σ), where: (12)
𝝁=(XTX+σ2λ𝕀)-1XT𝒚 (13)
Σ=σ2(XTX+σ2λ𝕀)-1 (14)

A1.2 Gaussian process regression

Gaussian process regression [34] extends Bayesian linear regression credit assignment to larger state and action spaces by generalizing across nearby states and actions. We will model the state/action utilities r^(s~) as a Gaussian process [34] over s~, and then use the observed preferences to perform Gaussian process inference on 𝒓^.

We model 𝒓^ as a Gaussian process: 𝒓^𝒢𝒫(𝝁𝒓,Kr), where 𝝁𝒓 is the prior mean vector and Kr is the prior covariance matrix, such that [Kr]ij gives the prior covariance between r^(s~i) and r^(s~j). We model r¯(τi), the total utility for trajectory τi, as a sum over the latent state/action utilities:

r¯(τi)=j=1Dxijr^(s~j)+εi, (15)

with i.i.d. residuals εi𝒩(0,σε2). Because any linear combination of jointly Gaussian variables is Gaussian, r¯(τi) is a Gaussian process over the xij’s . Let 𝑹N be the vector with ith element equal to Ri=r¯(τi). Calculating the relevant expectations and covariances (see A1.3), one can show that 𝒓^ and 𝑹 are jointly normally distributed as follows:

[𝒓^𝑹]𝒩([𝝁𝒓X𝝁𝒓],[KrKrXTXKrTXKrXT+σε2𝕀]). (16)

The standard approach for obtaining a conditional distribution from a joint Gaussian distribution [34] yields 𝒓^|𝑹𝒩(𝝁,Σ), where:

𝝁=𝝁𝒓+KrXT[XKrXT+σε2𝕀]-1(𝑹-X𝝁𝒓) (17)
Σ=Kr-KrXT[XKrXT+σε2𝕀]-1XKrT. (18)

In practice, 𝑹 is substituted with 𝒚 to perform the inference.

A1.3 Using Gaussian processes to infer state/action utilities from preferences

This section derives the posterior inference equations (17) and (18), used in Gaussian process credit assignment. In this derivation, we act as though we observe the trajectories’ total utilities 𝑹, while remembering that in practice, 𝑹 is approximated via the user’s preferences. Recall that 𝒓^D has ith element r^(s~i), which models the utility of state/action i, while 𝑹N has ith element Ri=r¯(τi). Let 𝒙𝒊 be the transpose of the ith row of X.

Our goal is to infer the values of 𝒓^ given observations 𝑹 of the trajectories’ total utilities. This can be accomplished via the following four steps:

  1. 1.

    Model the state/action utilities r^(s~) as a Gaussian process over s~.

  2. 2.

    Model the trajectory utilities 𝑹 as a Gaussian process that can be defined as a sum of the state/action utilities r^(s~).

  3. 3.

    Using the two Gaussian processes defined in 1) and 2), obtain the covariance matrix between the values of {r^(s~)|s~1,,D} and {Ri|i1,,N}.

  4. 4.

    Write the joint Gaussian distribution between the values of {r^(s~)|s~1,,D} and {Ri|i1,,N}, and obtain the posterior distribution of 𝒓^ at all state/actions given 𝑹.

The four subsequent subsections correspond to these four steps, respectively.

A1.3.1 The state/action utility Gaussian process

We model the state/action utilities as a Gaussian process over s~, with mean 𝔼[r^(s~)]=μr(s~) and covariance kernel Cov(r^(s~),r^(s~))=kr(s~,s~), for all state/action pairs s~,s~. Thus,

r^(s~)𝒢𝒫(μr(s~),kr(s~,s~)). (19)

Define 𝝁𝒓D such that the ith element is [𝝁𝒓]i=μr(s~i), the prior mean for the utility of state/action s~i. Let KrD×D be the covariance matrix over state/action utilities, such that [Kr]ij=kr(s~i,s~j). Therefore, 𝒓^, the vector for which [𝒓^]i=r^(s~i), is also a Gaussian process:

𝒓^𝒢𝒫(𝝁𝒓,Kr). (20)

A1.3.2 The trajectory utility Gaussian process

By assumption, the trajectory utilities 𝑹N are sums of the latent state/action utilities via the following relationship between 𝑹 and 𝒓^:

R(𝒙𝒊):=Ri=j=1Dxijr^(s~j)+εi, (21)

where εi are i.i.d. Gaussian noise variables with distribution 𝒩(0,σε2).

Note that R(𝒙i) is a Gaussian process over 𝒙iD because {r^(s~),s~} are jointly normally distributed by definition of a Gaussian process, and any linear combination of jointly Gaussian variables has a univariate normal distribution.

Next, we calculate the expectation and covariance of 𝑹 over the observations. The expectation of its ith element Ri=R(𝒙i) can be expressed as follows:

𝔼[Ri]=𝔼[j=1Dxijr^(s~j)+εi]=j=1Dxij𝔼[r^(s~j)]=j=1Dxijμr(s~j). (22)

The expectation over 𝑹(X) can thus be written as:

𝔼[𝑹(X)]=X𝝁𝒓. (23)

Next, we model the covariance matrix of 𝑹(X). The ijth element of this matrix is the covariance of R(𝒙i) and R(𝒙j):

Cov(R(𝒙i),R(𝒙j)) =𝔼[R(𝒙i)R(𝒙j)]-𝔼[R(𝒙i)]𝔼[R(𝒙j)] (24)
=𝔼[R(𝒙i)R(𝒙j)]-(k=1Dxikμr(s~k))(m=1Dxjmμr(s~m)) (25)
=𝔼[(k=1Dxikr^(s~k)+εi)(m=1Dxjmr^(s~m)+εj)] (26)
-(k=1Dxikμr(s~k))(m=1Dxjmμr(s~m)) (27)
=k=1Dm=1Dxikxjm𝔼[r^(s~k)r^(s~m)]+𝔼[εiεj]-k=1Dm=1Dxikxjmμr(s~k)μr(s~m) (28)
=k=1Dm=1Dxikxjm[Cov(r^(s~k),r^(s~m))+μr(s~k)μr(s~m)] (29)
-k=1Dm=1Dxikxjmμr(s~k)μr(s~m)+σε2𝕀[i=j] (30)
=k=1Dm=1DxikxjmCov(r^(s~k),r^(s~m))+σε2𝕀[i=j] (31)
=k=1Dm=1Dxikxjmkr(s~k,s~m)+σε2𝕀[i=j]=𝒙iTKr𝒙j+σε2𝕀[i=j]. (32)

We can then write the covariance matrix of 𝑹 as KR(X), where:

[KR(X)]ij=Cov(R(𝒙i),R(𝒙j))=𝒙iTKr𝒙j+σε2𝕀[i=j]. (33)

From here, it can be seen that KR(X)=XKrXT+σε2𝕀:

XKrXT =[𝒙1T𝒙2T𝒙NT]Kr[𝒙1𝒙2𝒙N]=[𝒙1T𝒙2T𝒙NT][Kr𝒙1Kr𝒙2Kr𝒙N] (34)
=[𝒙1TKr𝒙1𝒙1TKr𝒙N𝒙NTKr𝒙1𝒙NTKr𝒙N]=KR(X)-σε2𝕀. (35)

A1.3.3 Covariance between state/action and trajectory utilities

In this subsection, we consider the covariance between 𝒓^ and 𝑹, denoted Kr^,R:

[Kr^,R]ij=Cov([𝒓^]i,[𝑹]j)=Cov(r^(s~i),R(𝒙j)). (36)

This covariance matrix can be expressed in terms of X,Kr, and 𝝁𝒓:

[Kr^,R]ij =Cov(r^(s~i),R(𝒙j))=Cov(r^(s~i),k=1Dxjkr^(s~k)+εj) (37)
=𝔼[r^(s~i)k=1Dxjkr^(s~k)+εjr^(s~i)]-𝔼[r^(s~i)]𝔼[k=1Dxjkr^(s~k)+εj] (38)
=k=1Dxjk𝔼[r^(s~i)r^(s~k)]-[μr(s~i)][𝒙jT𝝁𝒓] (39)
=k=1Dxjk{Cov(r^(s~i),r^(s~k))+𝔼[r^(s~i)]𝔼[r^(s~k)]}-μr(s~i)𝒙jT𝝁𝒓 (40)
=k=1Dxjk[kr(s~i,s~k)+μr(s~i)μr(s~k)]-μr(s~i)𝒙jT𝝁𝒓 (41)
=k=1Dxjkkr(s~i,s~k)+μr(s~i)𝒙jT𝝁𝒓-μr(s~i)𝒙jT𝝁𝒓=k=1Dxjkkr(s~i,s~k)=𝒙jT[Kr]i,:, (42)

where [Kr]i,: is the column vector obtained by transposing the ith row of Kr. From here, one can see that Kr^,R=KrXT:

KrXT =[[Kr]1,:T2,:TD,:T]*[𝒙1𝒙2𝒙N]=Kr^,R. (43)

A1.3.4 Posterior inference over state/action utilities

The results of the previous three subsections can be combined to obtain the following joint probability density between 𝒓^ and 𝑹:

[𝒓^𝑹]𝒩([𝝁𝒓X𝝁𝒓],[KrKrXTXKrTXKrXT+σε2𝕀]). (44)

This relationship expresses all components of the joint Gaussian density in terms of X,Kr, and 𝝁𝒓, or in other words, in terms of the state/action visit counts in the observed trajectories (captured by X) and the Gaussian process prior on 𝒓^.

Using the standard approach for obtaining a conditional distribution from a jointly Gaussian distribution (see Appendix A.2 in [34]), we arrive at:

𝒓^|𝑹𝒩(𝝁,Σ), where: (45)
𝝁 =𝝁𝒓+KrXT[XKrXT+σε2𝕀]-1(𝑹-X𝝁𝒓) (46)
Σ =Kr-KrXT[XKrXT+σε2𝕀]-1XKrT. (47)

Thus, we have expressed the conditional posterior density of 𝒓^ in terms of X,Kr, 𝝁𝒓, and the observations 𝑹𝒚.

A2 Proofs

This section proves the theoretical results stated in Section 5 of the paper.

We begin by restating a result from Gourieroux and Monfort [21] establishing conditions in which logistic regression is asymptotically consistent, after defining two necessary conditions.

Definition 1 (Derivative of sigmoid).

f:, where f=e-x(1+e-x)2. Note that f(x)=f(-x). This is the derivative of the sigmoid function, σ(x)=11+e-x.

Definition 2.

Let 𝐫¯RD be the vector of true state/action utilities (assumed to exist) and 𝐫¯RD-1 be its transformation via (4). Define 𝐫~kRD-1 as the state/action rewards sampled from the Bayesian logistic regression model posterior in episode k, 𝐫^kRD-1 as the model’s maximum a posteriori (MAP) estimate at episode k, and 𝐫^ML,kRD-1 as its maximum likelihood estimate at k. Lastly, 𝐫~kRD, 𝐫^kRD, and 𝐫^ML,kRD are their respective equivalents given by (5).

Condition 1.

m0< such that |zij|m0 for all i{1,,N},j{1,,D-1}.

Condition 1 is always satisfied because zij=𝒙iT𝒗j by (4), where 𝒗j is a unit vector. Additionally, 𝒙i=𝒙1i-𝒙i2 is difference of two vectors that both count how many times each state/action pair is visited in an episode, and thus both have only positive elements that sum to the episode horizon, h. So, |zij|||𝒙i||2||𝒗j||2=||𝒙i||2||𝒙1i-𝒙i2||1=||𝒙1i||+||𝒙i2||1=2h, where the inequalities are respectively the Cauchy-Schwarz inequality, ||𝒙||p<||𝒙||q for p>q>0, and the triangle inequality. So, Condition 1 holds for m0=2h.

Condition 2.

Let λ1(k) and λD-1(k) be the largest and smallest eigenvalues, respectively, of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT. Then, m1< such that λ1(k)λD-1(k)<m1, for all k.

Intuitively, Condition 2 requires that an arbitrarily-high fraction of observations cannot lie in a proper subspace of their possible span. This condition is necessary, as otherwise data outside this subspace would become increasingly ignored as more data points are obtained. Condition 2 states that the inverse Hessian of the negative log reward posterior—evaluated at the true rewards 𝒓¯—has an upper-bounded ratio between its largest and smallest eigenvalues. This can be explicitly enforced by bounding this maximum-to-minimum eigenvalue ratio for the matrix i=1kf(𝒛iT𝒓^k)𝒛i𝒛iT, and only updating the reward posterior when the eigenvalue ratio is below-threshold. As shown in Lemma 1 below, satisfying this restriction bounds the eigenvalue ratio for i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT, as desired.

Condition 2 may also be guaranteed in certain situations, e.g., if randomness in the dynamics is sufficient to ensure a full-row-rank observation matrix Z. In this case, the eigenvalue ratio will be bounded regardless of the specific policies executed. If properties of the environment are known to guarantee Condition 2, then it need not be explicitly enforced. Meanwhile, weakening the requirements of Condition 2—for instance, by strengthening the Bayesian logistic regression convergence analysis or considering other credit assignment models—would be an interesting avenue for future work.

Lemma 1 (Enforcing Condition 2).

The eigenvalue ratio λ𝑚𝑎𝑥(i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT)λ𝑚𝑖𝑛(i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT) has a constant upper bound that holds for all k if and only if the eigenvalue ratio λ𝑚𝑎𝑥(i=1kf(𝐳iT𝐫^k)𝐳i𝐳iT)λ𝑚𝑖𝑛(i=1kf(𝐳iT𝐫^k)𝐳i𝐳iT) has a constant upper bound that holds for all k. Therefore, one can ensure that the former condition holds by enforcing the latter.

Proof.

We apply the result in Lemma 1.1 below. To verify that the conditions for this result hold, we must show that both f(𝒛iT𝒓¯) and f(𝒛iT𝒓^k) have upper and lower bounds for all i,k, where the lower bound exceeds zero. The upper bound exists because f(x)(0,1]x.

It remains to show that both f(𝒛iT𝒓¯) and f(𝒛iT𝒓^k) are lower-bounded above zero. Because f(x) monotonically decreases as |x| increases, this is true as long as |𝒛iT𝒓¯| and |𝒛iT𝒓^k| are upper-bounded. In the former case, |𝒛iT𝒓¯|<||𝒛i||2||𝒓¯||2. The quantity ||𝒛i||2 is upper-bounded by Condition 1. The rewards 𝒓¯ produce the same policy when scaled by any positive quantity, and so their magnitude can be viewed as fixed. The same logic holds in the latter case.

Lemma 1.1 (Bounded eigenvalue ratios).

Let Ak,BkRn×n be two matrices of the form Ak=i=1kαi𝐯i𝐯iT, Bk=i=1kβi𝐯i𝐯iT, where αi[α𝑚𝑖𝑛,α𝑚𝑎𝑥],βi[β𝑚𝑖𝑛,β𝑚𝑎𝑥], α𝑚𝑖𝑛>0,β𝑚𝑖𝑛>0, and 𝐯iRn for i{1,,k}. Let λ𝑚𝑎𝑥(M) and λ𝑚𝑖𝑛(M) respectively be the largest eigenvalues of a matrix M. Then, λ𝑚𝑎𝑥(Ak)λ𝑚𝑖𝑛(Ak) is upper-bounded for all T if and only if λ𝑚𝑎𝑥(Bk)λ𝑚𝑖𝑛(Bk) is upper-bounded for all T.

Proof.

Without loss of generality, we assume that λmax(Ak)λmin(Ak)<m1< for some m1 and for all k, and will show that λmax(Bk)λmin(Bk) has an upper bound for all k. Note that a𝒖𝒖T0 (i.e., is positive semidefinite) for any a>0 and vector 𝒖, and that sums of positive semidefinite matrices are also positive semidefinite. In addition, we will use the following facts about arbitrary positive semidefinite matrices A,B0 (which can be shown from the Courant-Fischer-Weyl min-max principle):

λmax(A) λmax(A+B) (48)
λmin(A) λmin(A)+λmin(B)λmin(A+B) (49)

The desired result is an outcome of the following four relations:

λmax(Ak) =λmax(i=1kαi𝒗i𝒗iT)=λmax(i=1k(αi-αmin)𝒗i𝒗iT+i=1kαmin𝒗i𝒗iT) (50)
λmax(i=1kαmin𝒗i𝒗iT)=αminλmax(i=1k𝒗i𝒗iT), by (48) (51)
λmin(Ak) =λmin(i=1kαi𝒗i𝒗iT)λmin(i=1kαi𝒗i𝒗iT)+λmin(i=1k(αmax-αi)𝒗i𝒗iT) (52)
λmin(i=1kαmax𝒗i𝒗iT)=αmaxλmin(i=1k𝒗i𝒗iT), by (49) (53)
λmax(Bk) =λmax(i=1kβi𝒗i𝒗iT)λmax(i=1kβi𝒗i𝒗iT+i=1k(βmax-βi)𝒗i𝒗iT), by (48) (54)
=λmax(i=1kβmax𝒗i𝒗iT)=βmaxλmax(i=1k𝒗i𝒗iT) (55)
λmin(Bk) =λmin(i=1kβi𝒗i𝒗iT)=λmin(i=1k(βi-βmin)𝒗i𝒗iT+i=1kβmin𝒗i𝒗iT) (56)
λmin(i=1kβmin𝒗i𝒗iT)=βminλmin(i=1k𝒗i𝒗iT), by (49) (57)

Now, we can upper bound the eigenvalue ratio for Bk:

λmax(Bk)λmin(Bk) βmaxλmax(i=1k𝒗i𝒗iT)βminλmin(i=1k𝒗i𝒗iT), by (55) and (57) (58)
βmaxβminλmax(Ak)αmin[λmin(Ak)αmax]-1, by (51) and (53) (59)
=βmaxαmaxβminαminλmax(Ak)λmin(Ak)βmaxαmaxβminαmin*m1. (60)

We will apply the following result from Gourieroux and Monfort [21] concerning asymptotic consistency of the maximum likelihood estimator:

Proposition 1 (Asymptotic consistency of logistic regression [21]).

If Conditions 1 and 2 are satisfied, then the maximum likelihood estimator 𝐫^ML,k of 𝐫¯ exists almost surely as k, and 𝐫^ML,k converges almost surely to the true value 𝐫¯ if and only if 𝑙𝑖𝑚kλD-1(k)=.

Proof.

This result is a restatement of Proposition 3 in Gourieroux and Monfort [21]. ∎

Remark: The proof of Proposition 1 in [21] can be adapted such that the same result holds when the maximum likelihood estimator 𝒓^ML,k is replaced with the MAP estimator 𝒓^k; thus, it applies to our setting. We show that Proposition 1’s final condition for convergence of the MAP estimator holds:

Lemma 2.

Under known transition dynamics, all eigenvalues of the matrix i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT approach infinity as k.

Proof.

Let αi=f(𝒛iT𝒓¯) and α^i(k)=f(yi𝒛iT𝒓^k). The values αi are both upper-bounded and non-decaying: f(x)(0,1) for all x, and 𝒛iT𝒓¯ is bounded in magnitude since we assume the true rewards 𝒓¯ are bounded. We can write:

i=1Nf(𝒛iT𝒓¯)𝒛i𝒛iT=i=1Nαi𝒛i𝒛iT=[α1𝒛1α2𝒛2αN𝒛N][α1𝒛1Tα2𝒛2TαN𝒛NT]. (61)

Define MN×(D-1), M1m×(D-1), and M2(N-m)×(D-1) as:

M=[α1𝒛1Tα2𝒛2TαN𝒛NT],M1=[α1𝒛1Tα2𝒛2Tαm𝒛mT],M2=[αm+1𝒛m+1TαN𝒛NT]. (62)

Then,

M=[M1M2], and MTM=i=1Nαi𝒛i𝒛iT=M1TM1+M2TM2. (63)

Also, define M^N×(D-1), M^1m×(D-1), and M^2(N-m)×(D-1) analogously, but replacing αi with α^i(N). Note that MTM and M^TM^ have the same rank, since M and M^ have the same row-rank; similarly MTMi and M^iTM^i have the same rank for i{1,2}.

Assume that there exists m such that M^2TM^2 has rank less than (D-1); let r1 be the number of zero eigenvalues of M^2TM^2. We will show that as N increases, the probability of drawing a sample 𝒛i that is linearly independent of the rows already in M^2 (that is, a sample that increases the rank of M2TM2) does not decay to zero. This means that there will be infinitely-many non-overlapping sets of indices {j,,k} such that i=jkαi𝒛i𝒛iT has rank (D-1). From here, we can show that (D-1) of the eigenvalues of MTM approach infinity (recall that the αi values are lower-bounded).

Under the Laplace approximation, the posterior covariance after N episodes takes the form ΣN=(i=1Nf(yi𝒛iT𝒓^N)𝒛i𝒛iT+V0-1)-1=(M^TM^+V0-1)-1. Under our assumption that M^2TM^2 has r1 eigenvalues equal to zero, the row-rank of M^2 is (D-1-r).

We show that ΣN has r eigenvalues that are bounded away from zero. We write the eigenvalues of any square matrix M0n×n as λ1(M0)λ2(M0)λn(M0). Then, we apply the following fact (which can be shown from the Courant-Fischer-Weyl min-max principle): for positive semidefinite matrices A,Bn×n,

λn(A)+λk(B)λk(A+B)λ1(A)+λk(B). (64)

Clearly, i=1Nα^i(N)𝒛i𝒛iT and V0-1 are both positive semidefinite (V0 is positive definite by its definition), and so:

0λk(i=1Nα^i(N)𝒛i𝒛iT+V0-1)λ1(V0-1)+λk(i=1Nα^i(N)𝒛i𝒛iT), (65)

where the first inequality holds because the sum of two positive semidefinite matrices is also positive semidefinite, and the second inequality is an application of the inequality in (64). The first term in (65) is a constant, while the second is zero for k>D-1-r. Thus, ΣN-1=i=1Nα^i(N)𝒛i𝒛iT+V0-1 has r eigenvalues that are upper-bounded, where the upper bound depends only on V0. Therefore, Σ has r eigenvalues that are lower-bounded by [λ1(V0-1)]-1=[1λD-1(V0)]-1=λD-1(V0)>0; the other D-1-r eigenvalues of Σ all decay to zero.

To complete the proof, we argue that the following statements hold; the first two of these statements are proven in the two subsequent sub-lemmas; the third statement completes the proof.

  1. 1.

    Note that MTM and M^TM^ have the same rank, due to M and M^ having the same row-rank. Assume that for some m, M^2TM^2 has r1 zero eigenvalues. The reward vector 𝒓~ sampled from the logistic regression posterior can be expressed in terms of the eigenbasis of M^2TM^2, {𝝂i|i=1,,D-1}: 𝒓~=i=1D-1βi𝝂i for some coefficients βi. Then, as N, for j such that 𝒗j is in the null-space of M^2TM^2, the probability of sampling 𝒓~N such that |βj|i=1D-1|βi|b for any b(0,1) does not decay to zero.

  2. 2.

    There exists b(0,1) such that if βji=1D-1βib, then with probability above zero, the policy will sample a trajectory such that the next 𝒛i sample has a nonzero component in 𝝂j.

  3. 3.

    If a trajectory as described in 2) is sampled, a zero-eigenvalue of M2TM2 increases by a lower-bounded amount. This comes from the fact that there are only finitely-many possible 𝒛i vectors, so if 𝒛i has nonzero inner product with an eigenvector, this inner product cannot be arbitrarily close to zero. This contradicts that Rank(M2TM2) remains below D-1 as N.

Lemma 2.1 (Proof of Statement 1 in Lemma 2).
Proof.

In episode N, the sampled reward vector 𝒓~N is drawn from the logistic regression posterior, 𝒩(𝒓^,ΣN), with covariance matrix ΣN=(i=1Nα^i(N)𝒛i𝒛iT+V0-1)-1. Consider a sample 𝒓~N with a multivariate normal distribution in D-1:

𝒩(𝒓~N|𝒓^N,ΣN)exp(-12(𝒓~N-𝒓^N)TΣN-1(𝒓~N-𝒓^N)). (66)

The intuition for this proof is as follows. There is a decreasing probability of sampling a vector 𝒓~N such that (𝒓~N-𝒓^N) has components in eigenvectors of ΣN whose eigenvalues decay toward zero. Rather, the probability density concentrates toward the span of eigenvectors of ΣN whose eigenvalues are bounded away from zero.

The probability density of 𝒓~N can be viewed as a function of (𝒓~N-𝒓^N)TΣN-1(𝒓~N-𝒓^N). This fact can be used to define Rα,ND-1:

Rα,N={𝒓~D-1|(𝒓~-𝒓^N)TΣ-1(𝒓~-𝒓^N)α}. (67)

For ε(0,1), define α0(ε) as the maximum possible value of α such that P(𝒓~Rα)1-ε:

α0(ε)=maxs.t. P(𝒓~Rα)1-εα. (68)

Thus, Rα0(ε) contains at least 1-ε of the probability density of 𝒓~N.

Recall that ΣN-1 has r eigenvalues that are upper-bounded, while the other D-1-r eigenvalues of ΣN-1 all approach infinity as N. Let (λi,𝝁i) represent the eigenvalues and eigenvectors of ΣN-1, respectively. Then, 𝒓~N-𝒓^N can be expressed in terms of the eigenbasis: 𝒓~N-𝒓^N=i=1D-1ηi𝝁i for some coefficients ηi, and:

(𝒓~N-𝒓^N)TΣ-1(𝒓~N-𝒓^N)=i=1D-1ηi2λi. (69)

Then, Rα0(ε) can be written as,

Rα0(ε)={𝒓~NL-1|i=1D-1ηi2λiα0(ε)}. (70)

We can divide 𝒓~N-𝒓^N for 𝒓~NRα0(ε) into two terms:

𝒓~N-𝒓^N=i=1D-1-rηi𝝁i+i=D-rD-1ηi𝝁i, (71)

where λi for iD-1-r. The first term contains eigenvectors 𝝁i corresponding to eigenvalues of ΣN-1 that grow to infinity, while the second term contains eigenvectors corresponding to upper-bounded eigenvalues of ΣN-1. For 𝒓~N belonging to Rα0(ε), as λi, the ηi coefficients in the first term will decay; otherwise, the sum i=1D-1ηi2λi grows unboundedly, while i=1D-1ηi2λi remains comparatively-smaller for increasingly-many vectors that place weight only on 𝝁i for i>D-1-r. In other words, as N, the probability density of 𝒓~N-𝒓^N increasingly concentrates toward vectors that can be expressed as linear combinations of eigenvectors of ΣN-1 corresponding to upper-bounded λi. Thus, as long as λj,j>D-1-r, are upper-bounded, the probability of sampling 𝒓~N-𝒓^N such that ηji=1D-1ηib for any j>D-1-r and any b(0,1) does not decay to zero.

Next, we argue that the eigenvectors of ΣN-1 approach a set of vectors spanning the null space of M^2TM^2 as N. For jD-1-r, (λj,𝝁j) is an eigenpair of ΣN-1 for which λj as N. The eigenpair relationship gives:

(i=1Nα^i(N)𝒛i𝒛iT+V0-1)𝝁j=λj𝝁j. (72)

Dividing by λj,

(1λji=1Nα^i(N)𝒛i𝒛iT+1λjV0-1)𝝁j=𝝁j. (73)

The term 1λjV0-1 becomes insignificant as λj, since V0 is constant, while scaling the first term by 1λj does not affect its eigenbasis. Thus, for jD-1-r, 𝝁j approaches an eigenvector of i=1Nα^i(N)𝒛i𝒛iT that has an unbounded eigenvector. Meanwhile, the span of eigenvectors 𝝁j for j>D-1-r approaches the span of eigenvectors of i=1Nα^i(N)𝒛i𝒛iT with upper-bounded eigenvalues. Equivalently, the span of eigenvectors 𝝁j for j>D-1-r approaches the null space of M^2TM^2=i=m+1Nα^i(N)𝒛i𝒛iT.

Recall from Statement 1 that the eigenvectors of M^2TM^2 are {𝝂i|i=1,,D-1}. The r smallest of these eigenvectors correspond to Null(M^2TM^2); so, for i>D-1-r, λi are upper-bounded and Span(𝝁i,i>D-r-1) approaches Span(𝝂i,i>D-r-1). The probability of sampling 𝒓~N-𝒓^N=i=1D-1βi𝝂i—with βi>ε for i<D-r-1 and fixed ε>0—decays as N. Thus 𝒓~N-𝒓^N increasingly coincides with Null(M^2TM^2). Meanwhile, for j such that 𝒗jNull(M^2TM^2), the probability of sampling 𝒓~N-𝒓^N such that |βj|i=1D-1|βi|b for any b(0,1) does not decay to zero.

We have shown that 𝒓~N-𝒓^N increasingly coincides with Null(M^2TM^2). To show that DPS  is guaranteed to draw samples 𝒓~N that coincide with this null space, we must consider the effect of 𝒓^N. It suffices to show that its magnitude does not increase enormously with N, since the probability of sampling 𝒓~N-𝒓^NNull(M^2TM^2) with an arbitrarily-large magnitude does not decay to zero. Note that the magnitude of 𝒓^N can be considered upper-bounded, since a reward vector can always be scaled arbitrarily without affecting the resultant policy learned by value iteration. ∎

Lemma 2.2 (Proof of Statement 2 in Lemma 2).
Proof.

As in Statement 1, express the reward vector 𝒓~N sampled from the logistic regression posterior in terms of the eigenbasis of M^2TM^2, {𝝂i|i=1,,D-1}: 𝒓~N=i=1D-1βi𝝂i for some coefficients βi. Then, as N, consider 𝒗j in the null-space of M^2TM^2. Assume that there exists b(0,1) such that |βj|i=1D-1|βi|b. We show that for b high enough, with probability above zero, the policy will sample a trajectory such that the next sampled observation 𝒘i has a nonzero component in 𝝂j.

First, note that if 𝒛i=k=1D-1γk𝝂k, then 𝝂jT𝒛i=k=1D-1γk𝝂jT𝝂k=k=1D-1γkδjk=γj. Thus, because the eigenvectors {𝝂i} form an orthonormal basis, for 𝒘i to have a nonzero component γj in 𝝂j is equivalent to 𝒛iT𝝂j0.

By assumption—that is, our ability to choose b arbitrarily close to 1—we can assume that 𝒓~N is arbitrarily aligned with 𝝂j: for any ε0>0, we can assume that ||α𝒓~N-𝝂j||2<ε0. Define 𝒛:=-α𝒓~N+𝝂j. Then, ||𝒛||2<ε0 and 𝝂j=α𝒓~N+𝒛.

We aim to show that 𝒛iT𝝂j0. We can write, 𝒛iT𝝂j=𝒛iT(α𝒓~N+𝒛)=α𝒛iT𝒓~N+𝒛iT𝒛. The second term has upper-bounded magnitude:

|𝒛iT𝒛| ||𝒛i||2||𝒛||2, by the Cauchy-Schwarz inequality (74)
||𝒛i||2ε0=ε0𝒛iT𝒛i=ε0𝒙iT𝒙i, because Equation (4) preserves inner products (75)
=ε0||𝒙i||2ε0||𝒙i||1, since ||𝒚||2||𝒚||1 holds for arbitrary vector 𝒚n (76)
2ε0h, where h is the episode horizon, (77)

where the final inequality holds because 𝒙i=𝒙i1-𝒙i2, and the vectors 𝒙i1 and 𝒙i2 contain positive integer elements that sum to h.

Set ε>0. Since |𝒛iT𝒛|2ε0h, we can set ε0αε2h to ensure that if ||𝒛||2=||α𝒓~N-𝝂j||2<ε0, then |𝒛iT𝒛|<αε. To show that 𝒛iT𝝂j=α𝒛iT𝒓~N+𝒛iT𝒛0, it thus suffices to show that |𝒛iT𝒓~N|>ε with nonzero probability, for some ε>0 that we specify (note that selecting the value of ε determines the possible values of b).

Observe that |𝒛iT𝒓~N|=|𝒙iT𝒓~N|=|(𝒙i1-𝒙i2)T𝒓~N|, again because the linear transformation 𝒛i=VT𝒙i (4) preserves inner products. This is the difference in the total rewards of the two trajectories. Recall that the probability of at least a certain component of 𝒓~N belonging to Null(M^2TM^2) is non-decaying. For 𝒓~NNull(M^2TM^2), the total reward is zero for any 𝒛i in the row-space of M^2. Thus, α𝒓~N=𝝂j maximally encourages the induced policy to explore necessary parts of the state/action space that lead to increasing the row-rank of M^2.

We have ||α𝒓~N-𝝂j||2<ε0. The optimal policy is determined by comparing value functions of competing policies; value functions are continuous in rewards, so ε0 can be set small enough that this necessary exploration occurs with positive probability. ∎

Lemma 3 (Convergence of dynamics model).

Given Lemma 2, DPS’s dynamics model converges to the true dynamics, and all eigenvalues of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT approach infinity as k.

Proof.

DPS  updates and samples from the dynamics model via the same procedure as the posterior sampling RL algorithm of Osband et al. [29], which conducts posterior sampling RL with numerical rewards; the difference between the two algorithms lies in their handling of rewards, but not of dynamics. In [29], the theoretical result uses the fact that as a particular state/action pair is observed infinitely-often, samples of the dynamics parameters p(s|s,a) concentrate to their true values. This fact comes from Lemma 17 in [24], which proves that the deviation between the empirical mean and true value of p(s|s,a) decays as the number of visits to state/action pair (s,a) increases. This lemma is proven via the following concentration inequality:

{||𝒑^()-𝒑()||1ε}(2S-2)exp(-nε22), (78)

where S is the size of the state space, n is the number of times that (s,a) has been visited, 𝒑(s,a) is the true vector of transition probabilities [p(s1|s,a),p(s2|s,a),,p(sS|s,a)]T, and 𝒑^(s,a) is the empirical mean of the observations of 𝒑(s,a).

We apply the following two results to prove Lemma 3:

  1. 1.

    For any state/action pair that is sampled infinitely-often, the posterior over that state/action’s transition probabilities (i.e., the distribution over possible next states) converges to its true values. Thus, if every state/action pair is sampled infinitely-often, then the dynamics model as a whole converges to the true dynamics.

  2. 2.

    Assuming a given dynamics model, the utility model is such that all eigenvalues of MTM approach infinity as infinitely-many samples are collected, with M defined in (62)-(63); Lemma 2 demonstrates that Bayesian logistic regression satisfies this requirement.

The proof of Lemma 3 consists of proving the two statements below:

  1. 1.

    If the sampled dynamics 𝒑~(s,a) are sufficiently-close to the true dynamics (in l1-norm ||𝒑~(s,a)-𝒑(s,a)||1) with high probability and M2TM2 is not full-rank, then there exists c(0,1) such that P(Sample 𝒛iRange(M2TM2))c>0.

  2. 2.

    Every state/action pair will be sampled infinitely-often.

Proof of Statement A:

Let 𝒗Null(M2TM2). Treating sampled dynamics model 𝒑~ as if it were the true dynamics, Lemma 2 establishes that P(Sample 𝒛i s.t. 𝒛iT𝒗0|𝒑~)f(𝒑~)>0, where the lower bound f(𝒑~) depends on the sampled dynamics 𝒑~, and the probability P() is with respect to sampling the reward model (which determines the policy to execute) and the dynamics transitions during the policy roll-out.

We argue that there exists such a lower bound f that is continuous in the dynamics 𝒑~. To do so, we define g(𝒑~):=P(Sample 𝒛i s.t. 𝒛iT𝒗0|𝒑~), and show that g(𝒑~) has at most finitely-many discontinuities. This is sufficient to show that there must exist a continuous function f such that for any dynamics 𝒑~, g(𝒑~)f(𝒑~)>0.

The rewards are sampled independently of the dynamics, and the probability of sampling rewards that are aligned with 𝒗 to at least a specified degree does not decay: P(|(𝒓~)T||𝒓~||2𝒗||𝒗||2|>b)0, for any b(0,1). Statement A considers only a specific iteration of the algorithm, rather than a sequence of episodes, and so the reward distribution can be treated as fixed.

Given dynamics 𝒑~, the probability of sampling any trajectory τ={s1,a1,s2,a2,,sh+1} under policy π is continuous in 𝒑~:

P(τ)=p~0(s1)i=1hπ(ai|si)p~(si+1|si,ai). (79)

Letting L be the set of trajectories such that 𝒛iT𝒗0, the probability of sampling a trajectory τL is:

P(τL)=τLP(τ), (80)

where the probability P(τ) is given by (79). The function g(𝒑~)=P(τL|𝒑~) is therefore continuous in the parameters of 𝒑~. The policy π, meanwhile, depends upon the dynamics 𝒑~ and sampled rewards 𝒓~, and is selected via value iteration to optimize the expected total reward:

π*=argmax𝜋Vπ,1M~(s)=argmax𝜋𝔼M,π[j=1hr~(sj,aj)|s1=s], (81)

where M~ is the sampled MDP (i.e., consisting of 𝒑~ and 𝒓~).

Value iteration selects the deterministic policy that maximizes the expected total trajectory reward in (81). Given finite state and action spaces and a finite time horizon, there are finitely-many possible deterministic policies. If for given dynamics 𝒑~, only a single policy π(𝒑~) maximizes the value function, then under a sufficiently-small change in the dynamics, that same policy π(𝒑~) still yields the optimal value function. Under a fixed policy, the probability of generating any trajectory is continuous in the dynamics, and so g is continuous at 𝒑~.

Thus, the function g(𝒑~) can only be discontinuous at points in the dynamics space at which multiple policies maximize (81). We argue that g(𝒑~) has finitely-many such discontinuities. If this were not the case, there would exist a region of the dynamics space in which the policy maximizing (81) changes infinitely-many times. Since the set of dynamics models is bounded (each dynamics parameter is in [0,1]), this would imply that for any ε>0, there exists a region of the dynamics parameter space with area below ε, in which the policy optimizing (81) changes infinitely-many times. To see that this is impossible, notice that any dynamics 𝒑~ and policy π induce a distribution n~(s,a) over the expected number of times a trajectory visits each state/action pair. The expected total reward, given n~(s,a) and sampled rewards 𝒓~, can be expressed as,

Vπ,1M~(s)=(s,a)𝒮×𝒜n~(s,a)r~(s,a). (82)

Since n~(s,a) depends on the policy and dynamics according to (79) for each possible trajectory, its slope with respect to the dynamics is upper-bounded, and it cannot oscillate infinitely with respect to the dynamics.

Given the continuous lower bound f, for ε>0, there exists δ such that if ||𝒑~-𝒑||1<δ, then |f(𝒑~)-f(𝒑)|<ε. If we set ε<f(𝒑)2, then P(Sample 𝒛i s.t. 𝒛iT𝒗0|𝒑)f(𝒑)>f(𝒑~)2>0.

Proof of Statement B:

Statement B asserts that every state/action pair will be sampled infinitely-often. To show this, we first define known and unknown state/action pairs:

Definition 3.

(ε,δ)-known state/action pair: Fix ε,δ>0, and let 𝐩~(s,a) be a sample from the dynamics model’s posterior. An (ε,δ)-known state/action pair is one for which ||𝐩~(s,a)-𝐩(s,a)||1<ε with probability 1-δ.

Definition 4.

Unknown state/action pair: An (ε,δ)-unknown state/action pair is one that does not satisfy Definition 3.

Previously, we noted that for any state/action pair that is sampled infinitely-often, the model of that state/action’s transition probabilities (that is, the distribution over possible next states) converges to the true values. For any fixed ε and δ, after enough steps in the MDP are taken, at least one state/action pair is guaranteed to eventually be sampled enough that it becomes (ε,δ)-known.

To prove Statement B, it suffices to show that if part of the dynamics are known and part are unknown, then the MDP is guaranteed to eventually leave its known portion. We will prove this by contradiction: we will assume that all sampled policies stay only in the known portion of the MDP, and then contradict this assumption by showing that the MDP is guaranteed to exit the known region.

For ε,δ>0, an (ε,δ)-known state/action pair is one for which ||𝒑~(s,a)-𝒑(s,a)||1<ε with probability 1-δ. For unknown state/action pairs (s,a), due to the assumption that only known state/actions are visited, the distribution from which 𝒑~(s,a) is sampled does not change. Without loss of generality, let s~1 be an unknown state/action pair. This means that s~1 is not visited past some iteration m.

Without visiting s~1, M2TM2 cannot be full-rank. As a result, there exists 𝒗Null(M2TM2) which directly corresponds to the lack of s~1 observations. As in the proof of Statement A, there exists a function f such that P(Sample 𝒛i s.t. 𝒛iT𝒗0|𝒑~)f(𝒑~)>0. While 𝒑~ itself changes in each episode, we can show that 𝔼[f(𝒑~)] does not decay to zero. To do so, denote the samples of the known and unknown portions of the dynamics parameters as 𝒑~known and 𝒑~unknown, respectively. Also, let Sε be the set of known dynamics parameters 𝒑~known that lie inside the ε1-ball ||𝒑~(s,a)-𝒑(s,a)||1<ε for known (s,a). Then:

𝔼[f(𝒑~)] =Pr(𝒑~)f(𝒑~)d(𝒑~) (83)
min𝒑~knownSεPr(𝒑~unknown)f(𝒑~)d(𝒑~unknown)  (with prob. 1-δ). (84)

By minimizing the known dynamics over the area in which they lie with high probability, we obtain an integral in which all dynamics parameters are either fixed (i.e., the known dynamics) or drawn from an unchanging distribution (i.e., the unknown dynamics). Therefore, since f(𝒑~) is strictly positive everywhere, and we are integrating it over an unchanging probability distribution, the integral must evaluate to a strictly positive, unchanging quantity.

Hence, as long as s~1 is unknown and not visited, 𝔼𝒑~[P(Sample 𝒛i s.t. 𝒛iT𝒗0)]c>0 with probability 1-δ, for some c>0. In general, when sampling a sequence of random variables Xi with bounded support and 𝔼[Xi]c for all i, we observe Xic infinitely-often. Since f() has support in [0,1], we are guaranteed to infinitely-often sample dynamics p~ such that P(Sample 𝒛i s.t. 𝒛iT𝒗0)c>0, in which event state/action s~1 has nonzero probability of being sampled. This completes the proof by contradicting the hypothesis that the MDP will only visit known state/action pairs. ∎

Theorem 1 (Asymptotic consistency of DPS).

If there exists a reward function such that a logistic regression model explains the user’s preferences, then DPS  with a Bayesian logistic regression credit assignment model will learn an asymptotically consistent reward model.

Proof.

To prove this result, we apply Proposition 1. Given that Lemma 2, Lemma 3, and Condition 1 hold, it remains only to argue that Condition 2 must hold. To see this, refer back to the proof of Lemma 2. As long as a particular eigenvalue of ZTZ does not increase, there is a non-decaying probability of sampling an observation that increases it by at least some minimum amount. The probabilities of sampling such vectors depend on the ratios of the covariance matrix Σk’s eigenvalues, and so the ratio of λ1 and λD-1 must always have some upper bound. ∎

We turn next to characterizing the regret rate of DPS. We will apply two prior results, one from Gourieroux and Monfort [21] regarding the asymptotic distribution of the logistic regression maximum likelihood estimate, while the other (Osband et al. [29]) presents a regret bound for posterior sampling RL.

Proposition 2 (Asymptotic normality of logistic regression maximum likelihood estimator [21]).

If Conditions 1 and 2 are satisfied, and if 𝐫^ML,k converges almost surely to the true value 𝐫¯, then:

[i=1kf(𝒛iT𝒓^ML,k)𝒛i𝒛iT]12(𝒓^ML,k-𝒓¯)𝐷𝒩(𝟎,𝕀) as k, (85)

where 𝐷 implies convergence in distribution and Q12 is the positive definite matrix associated with positive definite matrix Q such that [Q12]2=Q.

Proof.

See Proposition 4 in Gourieroux and Monfort [21]. ∎

Remark: Just as with Proposition 1, the proof of Proposition 2 in [21] can be adapted such that the result holds when the maximum likelihood estimator 𝒓^ML,k is replaced with the MAP estimator 𝒓^k.

Proposition 3 (Expected regret of posterior sampling RL).

In posterior sampling RL with episode horizon h and numbers of states and actions S and A, the expected T-step regret is bounded as:

𝔼[𝑅𝑒𝑔𝑟𝑒𝑡(T,πh𝑃𝑆)]=O(hSAT𝑙𝑜𝑔(SAT)). (86)
Proof.

See Theorem 1 in Osband et al. [29]. ∎

Next, we show that under preference feedback, the regret can be decomposed into two terms: one that reflects the converging dynamics model, and one that reflects the converging reward model.

Lemma 4 (Regret decomposition).

The expected regret of DPS  can be decomposed into two terms. One of these terms can be upper bounded by the same regret bound as in Osband et al. [29], stated in Proposition (3). The second term may be upper-bounded by

hk=1T/h𝔼[||𝒓¯-𝒓~k||]hk=1T/h𝔼[||𝒓^k-𝒓¯||]+hk=1T/h𝔼[||𝒓^𝒌-𝒓~k||]. (87)
Proof.

Recall that the value of state s at time-step i and under policy μ and MDP M is:

Vμ,iM(s):=𝔼M,μ[j=ihr¯M(sj,aj)|si=s], (88)

where r¯M(s,a) is the average utility for taking action a in state s and MDP M. Then, the regret from episode k is defined as:

Δk:=s𝒮p0(s)(Vμ*,1M*-Vμk,1M*), (89)

where M* is the true (unknown) MDP, μ* is the optimal policy in M*, and μk is the policy that the algorithm follows in episode k. The total regret over T time-steps is then:

Regret(T,π):=k=1T/hΔk. (90)

For a sampled MDP Mk, drawn from the posterior model of the environment, Osband et al. [29] define:

Δ~k:=s𝒮p0(s)(Vμk,1Mk-Vμk,1M*). (91)

In Osband et al. [29], the authors prove the following regret equivalence result:

𝔼[k=1mΔk]=𝔼[k=1mΔ~k], (92)

which they use to derive the posterior sampling RL regret bound restated in Proposition 3.

The true (unknown) MDP M* consists of the true dynamics P* and average rewards r¯(sj,aj), while Mk has a sampled dynamics model and sampled rewards r~(sj,aj). In addition, we introduce Mk*, which pairs the true dynamics model P* with the sampled rewards r~(sj,aj). By adding and subtracting terms based upon Mk*, the terms of Δ~k can be decomposed:

Vμk,1Mk(s)-Vμk,1M*(s) =𝔼Mk,μk[j=1hr~(sj,aj)|s1=s]-𝔼M*,μk[j=1hr¯(sj,aj)|s1=s] (93)
=𝔼Mk,μk[j=1hr~(sj,aj)|s1=s]-𝔼Mk*,μk[j=1hr~(sj,aj)|s1=s] (94)
+𝔼Mk*,μk[j=1hr~(sj,aj)|s1=s]-𝔼M*,μk[j=1hr¯(sj,aj)|s1=s] (95)
=𝔼Mk,μk[j=1hr~(sj,aj)|s1=s]-𝔼Mk*,μk[j=1hr~(sj,aj)|s1=s] (96)
+𝔼P*,μk[j=1h(r~(sj,aj)-r¯(sj,aj))|s1=s], (97)

where in the final term, recall that P* is the true transition model.

The regret arising from the first two terms of this expression can be upper-bounded by the same regret bound as in [29], stated in (86) above. This holds because the regret bound of [29] arises from both dynamics and reward models that converge based upon direct observations of each model parameter, i.e., direct observations of both state/action transitions and rewards. In the two terms in (96), however, the rewards are identical among the two terms, while the dynamics are modeled identically to the setup of [29].

It remains to consider the third term of the regret decomposition (97), which is the error due to credit assignment. Let RegTR be the total component of the T-step regret arising from this credit assignment error term. Our goal is to upper-bound its expectation:

𝔼[RegTR]=𝔼p0,r~{k=1T/h𝔼P*,μk[j=1h(r~(sj,aj)-r¯(sj,aj))|s1=s]}. (98)

Note that because M* and Mk* have the same transition dynamics and initial state distributions p0, and that because the expectation over both summations in (98) is taken with respect to the same policy μk, the two MDPs M* and Mk* have the same probabilities of reaching any state/action pair (s,a) at each step j.

Recall that 𝒓¯D is the vector of true state/action rewards, 𝒓~kD is the vector of sampled state/action rewards (from the model posterior) in episode k, and 𝒓^kD is the expected value of the reward model’s posterior distribution at episode k.

Given the observation that M* and Mk* have the same probabilities of reaching any state/action pair (s,a) at each step j, we can write:

𝔼[RegTR]hk=1T/h𝔼[||𝒓¯-𝒓~k||]. (99)

Applying the triangle inequality yields:

𝔼[RegTR]hk=1T/h𝔼[||𝒓¯-𝒓~k+𝒓^k-𝒓^k||]hk=1T/h𝔼[||𝒓^k-𝒓¯||]+hk=1T/h𝔼[||𝒓^𝒌-𝒓~k||]. (100)

Thus, to bound the total credit assignment error in (98), we must consider the rates at which 𝒓^k converges to 𝒓¯ and at which 𝒓~k converges to 𝒓^k.

The final two lemmas characterize the convergence of 𝒓~k to 𝒓^k, and of 𝒓^k to 𝒓¯, respectively.

Lemma 5.

Sampling 𝐫~k via the Laplace approximation to the Bayesian logistic regression posterior gives 𝐫~kN(𝐫^k,Σk), with 𝐫^k and Σk defined in (9) and (10). All eigenvalues of Σk-1 approach infinity as k. Moreover, given Condition 2, there exists c0 such that λD-1(k)kc0, where recall that λD-1(k) is the smallest eigenvalue of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT. Thus, asymptotically λD-1(k)(Σk-1)kc0, where λD-1(k)(Σk-1) is the smallest eigenvalue of Σk-1.

Proof.

From Theorem 1 (see subsequent remark), 𝒓^k𝒓¯ almost surely. Due to this convergence,

Σk-1=i=1kexp(𝒛iT𝒓^k)(1+exp(𝒛iT𝒓^k))2𝒛i𝒛iTki=1kexp(𝒛iT𝒓¯)(1+exp(𝒛iT𝒓¯))2𝒛i𝒛iT, (101)

and all eigenvalues of Σk-1 approach infinity as k by Lemma 3. Also, Condition 2 asymptotically holds for Σk-1 as k increases, so that if λ1(k)(Σk-1) and λD-1(k)(Σk-1) are the largest and smallest eigenvalues of Σk-1, respectively, then M1 such that λ1(k)(Σk-1)λD-1(k)(Σk-1)m1 for large enough k.

To complete the proof, we argue that Tr(Σk-1) increases by a lower-bounded amount on average with each time step, where Tr denotes the trace, or sum of a matrix’s eigenvalues. Combining that 1) λ1(k)(Σk-1)λD-1(k)(Σk-1)m1 for large enough k, and 2) Tr(Σk-1)=i=1D-1λi(Σk-1) increases (on average) by a lower-bounded amount with each new iteration, one can see that there exists c0 such that asymptotically λD-1(k)(Σk-1)kc0.

The quantity Tr(Σk-1) can only increase:

Tr[Σk-1] =i=1kexp(𝒛iT𝒓^k)(1+exp(𝒛iT𝒓^k))2Tr(𝒛i𝒛iT), by linearity of trace (102)
=i=1kexp(𝒛iT𝒓^k)(1+exp(𝒛iT𝒓^k))2𝒛iT𝒛i, by the cyclic property of trace (103)
=i=1kexp(𝒛iT𝒓^k)(1+exp(𝒛iT𝒓^k))2𝒙iT𝒙i, by (6). (104)

The quantity exp(𝒛iT𝒓^k)(1+exp(𝒛iT𝒓^k))2 is bounded away from zero because it approaches exp(𝒛iT𝒓¯)(1+exp(𝒛iT𝒓¯))2, which is bounded away from zero as justified previously. For any observation 𝒙i such that 𝒙i=𝒙i1-𝒙i20, 𝒙iT𝒙i1 always, since 𝒙i1 and 𝒙i2 differ by at least one. It remains to argue that the event {𝒙i1-𝒙i2=0} cannot occur increasingly-often: there must be some below-one upper-bound to the probability that this event takes place. This is indeed the case; if it were not, this would imply that the model converges toward always sampling a single trajectory 𝒙; however, as long as a particular eigenvalue of ZTZ does not increase, there is a non-decaying probability of sampling an observation 𝒙 that increases it by at least some minimum amount (Lemma 1). The probability of sampling a trajectory 𝒙𝒙 depends on the ratios of the covariance matrix’s eigenvalues, which are bounded according to Condition 2.

Lemma 6.

Under the conditions for asymptotic consistency of logistic regression (Theorem 1), the vector 𝐫^k-𝐫¯ behaves asymptotically as N(0,[i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT]-1). This is the same distribution as that characterizing 𝐫~k-𝐫^k as k, as discussed in Lemma 5. Thus, similarly, 𝐫^k-𝐫¯ asymptotically has covariance Σk.

Proof.

By Proposition 2 (see subsequent remark),

[i=1kf(𝒛iT𝒓^k)𝒛i𝒛iT]12(𝒓^k-𝒓¯)𝐷𝒩(𝟎,𝕀) as k. (105)

Multiplying both sides by [i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1/2 gives that:

[i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1/2[i=1kf(𝒛iT𝒓^k)𝒛i𝒛iT]1/2(𝒓^k-𝒓¯) (106)
behaves asymptotically as 𝒩(𝟎,[i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1). (107)

Note that [i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1/2[i=1kf(𝒛iT𝒓^k)𝒛i𝒛iT]1/2𝕀 because 𝒓^k𝒓¯, so that 𝒓^k-𝒓¯ behaves asymptotically as 𝒩(𝟎,[i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1). ∎

Theorem 2 (Asymptotic regret rate of DPS).

If there exists a reward function such that a logistic regression model explains the preferences, then DPS has an asymptotic no-regret rate of O(hSAT𝑙𝑜𝑔(SAT)+hSAc0T𝑙𝑜𝑔(T)), where c0 is a minimum rate at which eigenvalues of i=1kf(𝐳iT𝐫¯)𝐳i𝐳iT increase linearly with collection of samples 𝐳i that impact those eigenvalues.

Proof.

From Lemma 3, the regret is upper-bounded by a sum of two terms, where the first of those terms is bounded by O(hSATlog(SAT)) as proven in [29]. Here, we analyze the second term, given in (98), and upper-bounded by the expression in (99), restated here:

𝔼[RegTR]hk=1T/h𝔼[||𝒓¯-𝒓~k||].

We use that ||𝒙||||𝒙||2 for any 𝒙n, and that the linear transformation in (4) preserves inner products, as established in Equation (6). In particular, the linear transformation preserves l2-norms of vectors, since for any vector 𝒙n, ||𝒙||2=𝒙T𝒙. Applying both of these facts gives:

𝔼[RegTR]hk=1T/h𝔼[||𝒓¯-𝒓~k||]hk=1T/h𝔼[||𝒓¯-𝒓~k||2]. (108)

Consider the expected l2-norm of a Gaussian vector 𝒙n,x𝒩(𝟎,Σ). It can be upper-bounded in terms of n and the eigenvalues of Σ:

𝔼[||𝒙||2] 𝔼[||𝒙||22], by Jensen’s inequality (109)
=𝔼[i=1nxi2]=i=1n𝔼[xi2]=i=1nVar[xi] (110)
=Tr(Σ)=i=1nλi(Σ)nλmax(Σ). (111)

Next, consider the asymptotic probability distribution of 𝒓~k-𝒓¯=(𝒓~k-𝒓^k)+(𝒓^k-𝒓¯). Given 𝒓¯, which is constant, the two quantities in parentheses are independent due to the nature of Laplace sampling, in which 𝒓~k is drawn from a normal distribution centered at 𝒓^k. By Lemmas 5 and 6, both 𝒓~k-𝒓^k and 𝒓^k-𝒓¯ behave asymptotically according to 𝒩(𝟎,Σk), where Σk:=[i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1. Therefore, 𝒓~k-𝒓¯ behaves asymptotically according to 𝒩(𝟎,2Σk).

Combining (108) and (111) gives:

𝔼[RegTR]hk=1T/h(D-1)λmax(2Σk)h2SAk=1T/hλmax(Σk). (112)

Although λmax(Σk) decays to zero over time by Lemma 2, this result is too loose: there could be many consecutive iterations in which λmax(Σk) remains constant. For instance, if Condition 2 is enforced explicitly as discussed earlier, then as the model posterior converges, one would expect to find increasingly-many consecutive iterations in which the reward posterior is not updated; in these iterations, λmax(Σk) is fixed.

The key intuition, formalized below, is that when a non-optimal episode is sampled, the covariance matrix Σk shrinks with respect to the directions (i.e., its eigenvectors) responsible for estimating the rewards suboptimally. Meanwhile, if particular eigenvectors of Σk do not result in a suboptimal policy during episode k, then for that episode, the corresponding eigenvalues can be removed from the right-hand sum in (112). Thus, eigenvalues do not appear in the regret’s upper bound except for when their corresponding eigenvectors affect an episode’s regret. In this event, however, the sampled observations cause Σk to shrink along the dimensions of the guilty eigenvectors.

To see this, first note that in the context of preference-based learning, the true reward function 𝒓¯ is unobserved. One can define an equivalence class of reward functions that, when coupled with the true transition dynamics, induce the optimal policy with respect to the user’s preferences. In the preference-based setting, an optimal policy is defined as any policy π* such that when compared to any other policy πi:

𝔼τ*π*,τiπi[P(τ*>τi)]12, (113)

where τ* is a trajectory sampled from policy π*, and τi is a trajectory sampled from πi.

Let SeqD-1 be the equivalence class of reward functions that induce policies satisfying (113) when coupled with the true dynamics. To see that there are indeed multiple such functions, notice that for any reward vector 𝒓Seq, a𝒓+b1Seq for any a>0 and b, where 1D-1 is a vector of ones. Therefore, in (108), we can minimize each episode’s regret over reward vectors belonging to Seq:

𝔼[RegTR]hk=1T/hmin𝒓Seq{𝔼[||𝒓~k-𝒓||2]}. (114)

Even given specified scaling and normalization of the rewards (e.g. specified values of the smallest and largest state/action rewards), there could be multiple reward functions that give rise to the optimal policy. For instance, a small perturbation to the rewards might not change the policy. Given specified scaling and normalization, however, there is only one true latent reward function governing the preferences; this is the specific 𝒓¯Seq to which the logistic regression posterior converges.

Define Σk:=[i=1kf(𝒛iT𝒓¯)𝒛i𝒛iT]-1, and let {𝒗k1,,𝒗k(D-1)} be an orthonormal basis of eigenvectors of Σk. For 𝒓Seq, the quantity 𝒓~k-𝒓 can be expressed in this eigenbasis:

𝔼[RegTR]hk=1T/hmin𝒓Seq{𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2]}. (115)

Consider the inner sum, i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki. We will show that if any terms in this sum are known not contribute to the regret (i.e., to sampling a non-optimal policy in episode k), then they can be eliminated from the sum. In particular, if an episode’s sampled reward belongs to Seq, then its reward regret RegTR is zero. Otherwise, the components of the reward function that make the policy non-optimal cause the corresponding eigenvalues of Σk to shrink.

Assume that in episode k, the reward sample’s projection onto 𝒗i does not affect the regret. In other words, there exists 𝒓Seq such that ((𝒓~k-𝒓)T𝒗ki)𝒗ki=𝟎. Let k{1,,D-1} be a set of indices that jointly do not affect the regret, such that there exists 𝒓k*Seq satisfying:

ik(𝒗iT(𝒓~k-𝒓*))𝒗i=𝟎. (116)

Let 𝒥k be the set of indices not contained in k: 𝒥k={1,,D-1}k; we make no particular assumptions on how vectors in 𝒥k affect the regret. The vector 𝒓k* can then be written as:

𝒓k*=ik(𝒗iT𝒓~k)𝒗i+j𝒥kαj*𝒗j, (117)

for some constants αj*, j𝒥k.

Due to orthogonality of 𝒗i and 𝒗j, (116) can be modified by adding any linear combination of the vectors 𝒗j,j𝒥k, to 𝒓*:

ik(𝒗iT(𝒓~k-𝒓*-j𝒥kαj𝒗j))𝒗i=𝟎, for any values of αj. (118)

Conditioning on knowledge of the set k, we can upper-bound the minimization in (115) as follows:

min𝒓Seq𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k] (119)
=min𝒓Seq𝔼[||ik((𝒓~k-𝒓)T𝒗ki)𝒗ki+j𝒥k((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k] (120)
min𝒓Seq𝔼[||ik((𝒓~k-𝒓)T𝒗ki)𝒗ki||2+||j𝒥k((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k], (121)

where the last step comes from the triangle inequality. The expression can be further upper-bounded by restricting the set over which the minimization takes place:

min𝒓Seq𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k]
min𝒓Seq𝒓=𝒓*+j𝒥kαj𝒗j,αj𝔼[||ik((𝒓~k-𝒓)T𝒗ki)𝒗ki||2+||j𝒥k((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k].

Note that the constraint set of the minimization is nonempty, since 𝒓*Seq and satisfies 𝒓=𝒓*+j𝒥kαj𝒗j when αj=αj*j𝒥k. Next, for any 𝒓 satisfying the minimization constraints, the first l2-norm expression is equal to zero by (118). The second term, meanwhile, is not affected by the projections of 𝒓 onto 𝒗i for any ik, and so the constraint 𝒓=𝒓*+j𝒥kαj𝒗j is irrelevant to the minimization. This gives:

min𝒓Seq𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k]min𝒓Seq𝔼[||j𝒥k((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k].

This can again be upper-bounded by setting 𝒓 to any specific value in Seq. In particular, we can set 𝒓=𝒓¯, the true latent reward function governing the preferences. Recall that while multiple reward functions may give rise to the optimal policy, the true rewards 𝒓¯ are unique under fixed scaling and normalization. Therefore,

min𝒓Seq𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k]𝔼[||j𝒥k((𝒓~k-𝒓¯)T𝒗ki)𝒗ki||2|k]. (122)

Recall from above that 𝒓~k-𝒓¯ behaves asymptotically as 𝒩(𝟎,2Σk). The right-hand-side summation in (122) is the projection of 𝒓~k-𝒓¯ onto the subspace of eigenvectors of Σk corresponding to 𝒥k. Applying the result in (111) to (122), we finally obtain:

min𝒓Seq𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k] j𝒥kλj(2Σk) (123)
2SAmaxj𝒥kλj(Σk). (124)

This result demonstrates that if an eigenvector of Σk does not affect the regret, then its corresponding eigenvalue can be removed from the regret’s upper bound in (111). Therefore, an eigenvector of Σk does not affect the regret unless its contribution to the sampled reward 𝒓~k makes the policy non-optimal.

A non-optimal policy’s distribution over sampled observations differs from that of the optimal policy. When a non-optimal policy is sampled, the covariance matrix is more likely to shrink along directions of worse observations, decreasing the future probability of sampling a non-optimal policy. This shrinkage occurs linearly with respect to the episodes in which non-optimal policies are sampled because (Σk)-1 grows linearly in directions favored by the non-optimal policies, with respect to the number of times that they are executed. Indeed, for an arbitrary vector 𝒗D-1:

𝒗T(Σk)-1𝒗=i=1kαi𝒗T(𝒛i𝒛iT)𝒗=i=1kαi(𝒛iT𝒗)2, (125)

where αi=f(𝒛iT𝒓¯) lives in a bounded range as discussed in Lemma 1; this quantity increases linearly on average with respect to observations 𝒛i that are similarly-aligned with 𝒗. Also,

𝒗T(Σk)-1𝒗=𝒗T(i=1D-1[λi(Σk)]-1𝒗i𝒗iT)𝒗=i=1D-1[λi(Σk)]-1(𝒗iT𝒗)2. (126)

Comparing (125) and (126), one can see that the eigenvalues of (Σk)-1 associated with non-optimal policies should increase linearly with the collection of observations 𝒛i associated with those non-optimal policies.

The regret for an episode is zero if the optimal policy is selected. Otherwise, examine the bound from (123): 2SAmaxj𝒥kλj(Σk)=2SAminj𝒥kλj(Σk-1). The denominator increases linearly with respect to episodes with non-optimal policies, so that asymptotically:

min𝒓Seq𝔼[||i=1D-1((𝒓~k-𝒓)T𝒗ki)𝒗ki||2|k]2SAc0k, (127)

where c0 is a constant related to the rate at which eigenvalues of (Σk)-1 increase with respect to trajectories sampled from non-optimal policies. Summing over all the episodes gives:

𝔼[RegTR]h2SAc0k=1T1kh2SAc0Tlog(T) for all T17. (128)

Combining (128) with the regret decomposition result of Lemma 4 and the posterior sampling regret bound from Osband et al. [29] (Proposition 3) yields the stated result.

A2.1 Extending proof techniques to other credit assignment models

Currently, this proof methodology extends only to the Bayesian logistic regression credit assignment model. Therefore, extending it to other credit assignment models, such as the Gaussian process regression and Bayesian linear regression methods detailed in Appendix A1, is an important direction for future work. The concept of regret decomposition, as introduced in Lemma 3, is not dependent on a specific credit assignment model. Thus, the concept of decomposing the total regret into two terms, dependent on the convergence of the dynamics and reward models, respectively, holds under any credit assignment model. For the dynamics-dependent term, the regret result from Osband et al. [29] can be applied. The second, reward-dependent term can be bounded if: (1) posterior samples from the reward model concentrate to the posterior distribution’s MAP estimate, (2) the reward model’s posterior concentrates to the true underlying utilities, and (3) the events in (1) and (2) occur at a sufficiently-fast rate.

We hypothesize that existing results on asymptotic consistency of Bayesian linear regression [19] and Gaussian process regression [34] could be leveraged toward extending notions of consistency and regret toward these credit assignment models. Unlike with classification-based credit assignment, it may be necessary to consider the residuals between preference labels and utilities. The concept of approximate linearity [41] has facilitated bridging the gap between the preference and absolute-reward domains in the bandit setting, and could potentially also apply here. In practice, we expect that DPS would perform well with any asymptotically consistent model for rewards that sufficiently captures the users’ preference behavior.

A3 Additional experimental details

As described in Section 6, experiments were conducted with the RiverSwim and random MDP environments. We use an episode horizon time of 50 in both cases. Figures 9 and 14 display performance in both environments for four values of the hyperparameter c (c{0.1,0.5,1,1000}), which governs the degree of preference noise. Experiments were run on an Ubuntu 16.04.3 machine with 16 GB of RAM and an 8-core Intel i7 processor.

We detail the ranges of hyperparameter values tested, as well as those displayed on the plots, for the different algorithms. For DPS, hyperparameters were tuned manually. For Gaussian process regression credit assignment, we used RBF kernels for the Gaussian process model, considering kernel variances from 0.001 to 0.5, kernel lengthscales from 0 to 0.1, and noise variances from 0.001 to 0.1. The plots depict values of (0.03, 0, 0.05), respectively, for these parameter values in the RiverSwim case, while depicting (1, 0, 0.03) in the random MDP case.

For Bayesian linear regression, we considered ranges of hyperparameter values between 0.1 and 3 for both hyperparameters (σ and λ). The RiverSwim plots display hyperparameter values of 0.5 and 0.1 for σ and λ, respectively, while for the random MDP environment, both are set to 0.1. For Bayesian logistic regression, we set the prior mean to zero for all states and actions. All prior covariance matrices considered were of the form b𝕀SA×SA, where 𝕀 is the identity matrix, and tested values of b ranged from 0.1 to 30; the plots all show results for a prior covariance matrix of 20𝕀.

The EMPC algorithm [50] has two hyperparameter values, α and η. We optimize both of these jointly via a grid search over values of (0.1,0.2,,0.9), with 100 runs of each pair of values. The best-performing hyperparameter values (i.e. those achieving the highest total reward) are displayed in Table 1; these are the hyperparameter values depicted in Figures 9 and 14.

Noise: 0.1 0.5 1 1000
RiverSwim 0.1/0.5 0.1/0.7 0.2/0.2 0.1/0.6
Random MDP 0.3/0.8 0.9/0.7 0.4/0.2 0.7/0.4
Table 1: Each table element shows best-performing α/η values for the corresponding simulation domain and noise parameter.
(a)
(b)
(c)
(d)
Figure 9: Empirical performance of DPS in the RiverSwim environment for varying values of the noise hyperparameter c. For trajectories τi and τj, P(τi>τj)={1+exp[-c(r¯(τi)-r¯(τj))]}-1, where r¯(τi) and r¯(τj) are the total rewards accrued by the two trajectories. Posterior sampling RL (PSRL) [29] is an upper bound that receives numerical rewards; Gaussian process regression (GPR), Bayesian linear regression, and Bayesian logistic regression are all instances of DPS. EPMC is a baseline from [50] as discussed in Section 6. Normalization is with respect to the total reward achieved by the optimal policy. Plots display the mean +/- one std over 100 runs of each algorithm tested. Overall, we see that DPS performs well and is robust to the choice of credit assignment model.
(a)
(b)
(c)
(d)
Figure 14: Empirical performance of DPS in the random MDP environment for varying values of the noise hyperparameter c. For trajectories τi and τj, P(τi>τj)={1+exp[-c(r¯(τi)-r¯(τj))]}-1, where r¯(τi) and r¯(τj) are the total rewards accrued by the two trajectories. Posterior sampling RL (PSRL) [29] is an upper bound that receives numerical rewards; Gaussian process regression (GPR), Bayesian linear regression, and Bayesian logistic regression are all instances of DPS. EPMC is a baseline from [50] as discussed in Section 6. Normalization is with respect to the total reward achieved by the optimal policy. Plots display the mean +/- one std over 100 runs of each algorithm tested. Overall, we see that DPS performs well and is robust to the choice of credit assignment model.