Abstract
We propose a model for learning with bandit feedback while accounting fordeterministically evolving and unobservable states that we call Bandits withDeterministically Evolving States ($B$-$DES$). The workhorse applications ofour model are learning for recommendation systems and learning for online ads.In both cases, the reward that the algorithm obtains at each round is afunction of the short-term reward of the action chosen and how "healthy" thesystem is (i.e., as measured by its state). For example, in recommendationsystems, the reward that the platform obtains from a user's engagement with aparticular type of content depends not only on the inherent features of thespecific content, but also on how the user's preferences have evolved as aresult of interacting with other types of content on the platform. Our generalmodel accounts for the different rate $\lambda \in [0,1]$ at which the stateevolves (e.g., how fast a user's preferences shift as a result of previouscontent consumption) and encompasses standard multi-armed bandits as a specialcase. The goal of the algorithm is to minimize a notion of regret against thebest-fixed sequence of arms pulled, which is significantly harder to attaincompared to standard benchmark of the best-fixed action in hindsight. Wepresent online learning algorithms for any possible value of the evolution rate$\lambda$ and we show the robustness of our results to various modelmisspecifications.