Online Learning with Diverse User Preferences

Abstract

In this paper, we investigate the impact of diverse user preference onlearning under the stochastic multi-armed bandit (MAB) framework. We aim toshow that when the user preferences are sufficiently diverse and each arm canbe optimal for certain users, the O(log T) regret incurred by exploring thesub-optimal arms under the standard stochastic MAB setting can be reduced to aconstant. Our intuition is that to achieve sub-linear regret, the number oftimes an optimal arm being pulled should scale linearly in time; when all armsare optimal for certain users and pulled frequently, the estimated armstatistics can quickly converge to their true values, thus reducing the need ofexploration dramatically. We cast the problem into a stochastic linear banditsmodel, where both the users preferences and the state of arms are modeled as{independent and identical distributed (i.i.d)} d-dimensional random vectors.After receiving the user preference vector at the beginning of each time slot,the learner pulls an arm and receives a reward as the linear product of thepreference vector and the arm state vector. We also assume that the state ofthe pulled arm is revealed to the learner once its pulled. We propose aWeighted Upper Confidence Bound (W-UCB) algorithm and show that it can achievea constant regret when the user preferences are sufficiently diverse. Theperformance of W-UCB under general setups is also completely characterized andvalidated with synthetic data.

Quick Read (beta)

loading the full paper ...