Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems

  • 2019-11-14 16:55:29
  • Aditya Narayan Ravi, Pranav Poduval, Dr. Sharayu Moharir
  • 2

Abstract

We use a novel modification of Multi-Armed Bandits to create a new model forrecommendation systems. We model the recommendation system as a bandit seekingto maximize reward by pulling on arms with unknown rewards. The catch howeveris that this bandit can only access these arms through an unreliableintermediate that has some level of autonomy while choosing its arms. Forexample, in a streaming website the user has a lot of autonomy while choosingcontent they want to watch. The streaming sites can use targeted advertising asa means to bias opinions of these users. Here the streaming site is the banditaiming to maximize reward and the user is the unreliable intermediate. We modelthe intermediate as accessing states via a Markov chain. The bandit is allowedto perturb this Markov chain. We prove fundamental theorems for this settingafter which we show a close-to-optimal Explore-Commit algorithm.

 

Quick Read (beta)

Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems
thanks: *Equal contribution, first author was assigned randomly

Aditya Narayan Ravi* IIT Bombay
Mumbai, India
[email protected]
   Pranav Poduval* IIT Bombay
Mumbai, India
[email protected]
   Sharayu Moharir IIT Bombay
Mumbai, India
[email protected]
Abstract

We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.

Multi-Armed Bandits, Recommendation Systems, Markov Chains
\usetikzlibrary

automata,arrows,positioning,calc \SetCommentStymycommfont \SetKwInputKwInputInput \SetKwInputKwOutputOutput

I Introduction and Problem Formulation

Multi-Armed Bandits and algorithms to characterize the exploration-exploitation trade-off have been studied for a long time in various literature [1, 5]. It’s use in recommendation systems has also been studied in various literature [3, 2].
However most literature analyses recommendation systems from a data-acquisition and use perspective, usually fails to address the question on how whether the cost incurred in collecting and using this data for recommendation systems is worth the reward generated from how they bias autonomous user choices. We attempt to do that by developing the Unreliable Multi-Armed Bandits framework, and prove a few fundamental bottlenecks on the cost incurred by recommendation systems
A K-armed bandit is defined by Random Variables (RV) 𝐑𝐢,𝐣:1iK,1j. The RVs 𝐑𝐢,𝐣 refers to reward produced by arm i at the jth time instant. Here 𝐑𝐢,𝐣 is independent of 𝐑𝐬,𝐭 if is,jt and additionally identically distributed if i=s,jt. These rewards are governed by an unknown underlying distribution (hence with an unknown mean μi). The optimal arm is the arm such that maxμi=μ* is mean reward of that arm.
Let an unreliable intermediate I visit different arms by traversing an irreducible and aperiodic Markov chain with state space S and a transition probability matrix P=[pi,j]|S|×|S|.

{tikzpicture}

[-¿, ¿=stealth’, auto, semithick, node distance=3cm] \tikzstyleevery state=[fill=white,draw=black,thick,text=black,scale=1] \node[state] (1) 1; \node[state] (2)[right of=1] 2; \path(1) edge[loop left] nodep1,1 (1) (1) edge[bend left] nodep1,2 (2) (2) edge[bend left] nodep2,1 (1) (2) edge[loop right] nodep2,2 (2);

{tikzpicture}

[-¿, ¿=stealth’, auto, semithick, node distance=3cm] \tikzstyleevery state=[fill=white,draw=black,thick,text=black,scale=1] \node[state] (1) 1; \node[state] (2)[right of=1] 2; \path(1) edge[loop left] nodep1,1-δ (1) (1) edge[bend left] nodep1,2+δ (2) (2) edge[bend left] nodep2,1-δ (1) (2) edge[loop right] nodep2,2+δ (2);

Fig. 1: 2-state Markov chain a) Without the effect of recommendation system b) With the effect of recommendation system

The state space S is the set of arms and has a cardinality |S|=K and P models the autonomy of the intermediate in pulling the arms. This Markovian structure shows the dependence of future choices of the intermediate based on the current choice. For example a person who watches a “Horror” movie and likes it, is likely to pick the next movie they watch as a “Horror” movie as well. In our analysis we make the assumption that P is known. This is justified because recommendation systems usually possess enough data about the user to make predictions on the user’s behaviour. Thus even if P is initially unknown, it is easily estimated. Figure 1a) shows the structure for a simplified 2-state Markov chain
Let δ be known as the perturbation variable. This variable is used to model the effect of recommendation systems. The streaming site can use methods like “targeted advertising” to bias the user choices (i.e. Suggesting they pick another movie, say a “Documentary” which increases the chance they pick that movie). Therefore δ is used to perturb the probability of transition from a particular state. The following rules regarding δ is followed,

  • This δ is added and the transition probabilities are suitably modified to maintain a sum to one

  • If this modification causes negative probabilities (probabilities greater than 1) to arise, the probabilities are suitably truncated to 0 (1)

  • We make the simplifying assumption that δ is a pre-fixed parameter, future work will deal with more realistic scenarios

Figure 1b) shows the perturbed version of the Markov chain
Let π𝚷, be a perturbation policy which specifies the perturbation of P. We define Regret as Δj=μ*-rj, where rj refers to the real reward obtained at time instant i. Therefore the Cumulative Regret (CT) after T plays / biases is,

CT=j=1TΔi=Tμ*-j=1Trj. (1)

Here π* is the optimal policy, (i.e. the one that has least cumulative regret in expectation). The goal is to minimize the CT over T.
The structure of the paper is as follows: First we state 2 theorems on the regret bound and optimal policy of our setup. Following the proofs of these theorems, we look at a heuristic explore-commit algorithm based inspired by these theorems and show (via simulations) that it is near optimal for enough number of states (cardinality of state space is reasonably large)

II Main Results

Our first theorem characterizes which policy is optimal to maximize reward obtained. We first define a few concepts to make the theorem clear.
Definition 1. The bandit is said to have a “genie” available if μi is known to the bandit for all i[1:K]

Definition 2. (When genie is available) A πδ* policy is a policy that perturbs P to Pδ*, where Pδ* is the modified transition probability matrix that maximizes the probability of transitioning to the state with maximum mean reward (from any state)

Theorem 1. The optimal policy (when genie is available) π*𝚷 is πδ*

The proof for this theorem for the case where the recommendation system happens after the perturbed Markov chain attains it’s stationary distribution. This is easily justified since we can allow the Markov chain to evolve from T=- and start our observations from T=0 The proof generalises to the finite case (where the Markov chain doesn’t attain it’s stationary distribution) quite easily, but we exclude it for brevity. The importance of this theorem is that it provides us a heuristic to construct the explore-commit algorithm discussed in Section IV.
Our next theorem shows that there exists a fundamental bottleneck on the rewards that can be extracted from the setup. This is intuitively due to the fact that even if the bandit knows exactly which state gives what reward, the bandit doesn’t have full control over which arm it can pull at each instant. The following theorem asserts that this bottleneck is in fact linear.

Theorem 2. For any π𝚷, the CT can be lower bounded (in expectation) as,

𝐄(CT)T(μ*-μ~). (2)

where μ~ is the sum of the means of each the underlying distribution of each arm, weighted by the stationary probability distribution of the transition probability matrix after πδ* policy is applied on the Markov Chain
Again for brevity the proof technique in the next section assumes that the perturbed Markov chain attains it’s stationary distribution. The theorem holds true in order (both are lower bounded by functions which are linear in T) even otherwise as well, though some of the constants vary. The importance of this theorem is that it gives us a metric to evaluate our algorithm against. The explore-commit algorithm described in Section IV is observed to be near optimal and robust with respect to this theorem.

III Proofs

Let 𝝂=[νi],i[1:K] be a vector representing the stationary probability distribution of the Markov chain. This is defined as the solution to the equation νP=ν. This equation has a unique solution since the Markov chain is irreducible and aperiodic. When T is infinity this stationary distribution defines the amount of time (in expectation) spent in each state

Proof of Theorem 1: Without loss of generality assume μi is a decreasing function of i
We prove this by induction on the number of states K of the Markov chain
Base case: Let K=2. Then the total reward (R) in T steps is,

R=i=1Tri (3)
𝐄(R)=i=1T(ν1μ1+ν2μ2)=T(ν1μ1+ν2μ2). (4)

Since ν1+ν2=1, We can rewrite (4) as

𝐄(R)=T(ν1μ1+(1-ν1)μ2)=T(ν1(μ1-μ2)+μ2). (5)

Since μ1μ2, maximizing ν1 is our goal
It is easy to see that policy πδ* maximizes ν1 (and hence the expression).
Inductive step: Assume the theorem is true for K=l-1 states. We now prove that the theorem is true for K=l states. Note that,

𝐄(R)=T(j=1l-1νjμj+νlμl). (6)

Now (6) can be rewritten as,

𝐄(R)=T(j=1l-1νj(μj-μl)+μl). (7)

Take the first term in (7). We can normalise this by dividing this term by B=j=1l-1νj=1-νl. Let νiB=νi*. There a Markov chain with l-1 states that has a unique stationary probability distribution 𝝂*.
By induction hypothesis, this new Markov Chain is maximised by a πδ* policy. Note here that B doesn’t depend νi,i[1:l-1] when νl is specified. Since (7) only depends on νi,i[1:l-1], the πδ* policy applied on 𝝂* also maximises (7). This proves the first theorem. A similar analysis also follows for a finite T (though the calculations are a bit cumbersome)

Proof of Theorem 2: We note the following facts

  • If a genie is available, the cumulative regret of the original problem can only go down, in expectation

  • When genie is available, Theorem 1 states that a πδ* policy is optimal

  • A πδ* policy perturbs 𝝂 to 𝝂𝜹

The above observation tells us that calculating 𝐄(CT) when genie is available, for the πδ* policy will give us a lower bound on 𝐄(CT) for any general policy and unknown reward distributions. Let the setting required be 𝐄(CT*).
To this end lets calculate 𝐄(CT*).

𝐄(CT*) =i=1T(j=1Kνjδ(μ*-μj)) (8)
=T(μ*-j=1Kνjδμj) (9)
=T(μ*-μ~). (10)

It is easy to see that μ~μ*. This proves theorem 2.

IV Algorithms

Definition 3: A πl policy is a policy that perturbs P to Pl, where Pl is the modified transition probability matrix when Markov chain is biased towards state l[1:K]
We construct the Perturbation Explore (P2EE), Perturbation Commit algorithm based on insights from Theorem 1. The intuition behind the algorithm is as follows,

  • We explore “enough” so that we get a distinguishable estimate of the means of each of the arms. To do this we perturb the Markov chain towards the arm that is least explored at a given time instant (i.e. most uncertain)

  • Since after exploration we are pretty much the “genie” now, theorem 1 kicks in. We commit to the arm with highest empirical reward by using the πδ* policy

\SetAlgoLinedInputs: K, T, Transition Probability Matrix
Initialize: Pulls = [0,,0], t=0
\WhiletKT Exploration: bias = argmin Pulls
πbias𝚷 \WhileKTtT Exploitation: πδ*𝚷
\algorithmcfname 1 P2EE

Note: We haven’t characterized exactly how to choose K in the above algorithm. Here we just choose K manually based on simulations. Future work will deal with proving a theoretically optimal choice of K.
In the upcoming section we compare our algorithm two other algorithms, Upper Confidence Bound (UCB) [1] and a Greedy algorithm. But slight modifications need to be made to these algorithms to fit our problem statement.
The UCB algorithm remains same, except that in our case we cannot guarantee that we pick the arm with the most confidence in each time-step. Rather we can only bias the Markov chain so that it is more likely to pick said arm. Therefore we compare our algorithm keeping this in mind.
The vanilla ϵ-Greedy algorithm [1] chooses a random arm with probability ϵ. The structure of our Markov chain allows this exploration anyway, since there is always a non-zero chance to pull some sub-optimal arm.

{tikzpicture} CTTδ=0.1 {tikzpicture} Tδ=0.3 {tikzpicture} Tδ=0.5
Fig. 2: CT vs T comparison for P2EE, Our method P2EE has cumulative regret very close to the optimal “genie” and it’s regret decreases as δ increases. UCB and greedy are clearly sub-optimal

V Simulations

The simulations in Figure 2 are done for a 10-state Markov chain with randomly generated transition probability matrix and rewards such that μi is a decreasing function of i[1:10]. As required by the problem statement, the algorithms don’t have an access to the rewards Apriori.
We compare our P2EE algorithm via simulation with the modified UCB and greedy algorithm displayed in the previous section. We plot the cumulative regret plots for three different values of δ.
Notice that a purely greedy strategy is unreliable as compared to P𝟐EE method as we have no guarantee the exploration is sufficient to distinguish the optimal arm from the sub-optimal ones. Which could lead us to be pulling the sub-optimal arms for a long time.
It is apparent that UCB is a sub-optimal strategy for this setting since it is in violation of our Theorem 1 and is often stuck trying to pull sub-optimal arms for too long.
P2EE is theoretically supported by Theorem 1 and is empirically very close to the Oracle “Genie”) making it close to optimal.
An interesting observation that can be made here is that as the perturbation value δ increases the cumulative regret decreases for our algorithm. This is consistent with the hypothesis that increasing the influence of the recommendation system will give improved rewards. This is plotted in Figure 3.

VI Conclusion

Our work has propensity for future work. The first of this would be to generalise the notion of biasing, such that the bias is a) state dependent and b) apply to multiple transition probabilities. The second would be to make the model more realistic in terms of the intermediate’s behavior. For example, it is unrealistic to assume that in the example we use that user would watch the same movies multiple times and derive the same reward from it. We therefore need to bring in a rotting bandit [4] or restless bandit [6] assumption on the rewards. We also haven’t taken into account dynamic updations of the Markov chain (because the way users make choices evolves over time). Moreover, there is propensity to extend the probabilistic behavior of the intermediate to more than a Markovian behavior.

{tikzpicture}

δCT

Fig. 3: Variation of CT as δ increases for P2EE

References

  • [1] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, May 2002.
  • [2] Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski. A contextual-bandit algorithm for mobile context-aware recommender system. In Tingwen Huang, Zhigang Zeng, Chuandong Li, and Chi Sing Leung, editors, Neural Information Processing, pages 324–331, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  • [3] Y. Deshpande and A. Montanari. Linear bandits in high dimension and recommendation systems. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1750–1754, Oct 2012.
  • [4] Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3074–3083. Curran Associates, Inc., 2017.
  • [5] Steven L. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
  • [6] C. Tekin and M. Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, Aug 2012.