Variational inference for the multi-armed contextual bandit

Abstract

In many biomedical, science, and engineering problems, one must sequentiallydecide which action to take next so as to maximize rewards. One general classof algorithms for optimizing interactions with the world, while simultaneouslylearning how the world operates, is the multi-armed bandit setting and, inparticular, the contextual bandit case. In this setting, for each executedaction, one observes rewards that are dependent on a given 'context', availableat each interaction with the world. The Thompson sampling algorithm hasrecently been shown to enjoy provable optimality properties for this set ofproblems, and to perform well in real-world settings. It facilitates generativeand interpretable modeling of the problem at hand. Nevertheless, the design andcomplexity of the model limit its application, since one must both sample fromthe distributions modeled and calculate their expected rewards. We here showhow these limitations can be overcome using variational inference toapproximate complex models, applying to the reinforcement learning caseadvances developed for the inference case in the machine learning communityover the past two decades. We consider contextual multi-armed banditapplications where the true reward distribution is unknown and complex, whichwe approximate with a mixture model whose parameters are inferred viavariational inference. We show how the proposed variational Thompson samplingapproach is accurate in approximating the true distribution, and attainsreduced regrets even with complex reward distributions. The proposed algorithmis valuable for practical scenarios where restrictive modeling assumptions areundesirable.

Quick Read (beta)

loading the full paper ...