Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Abstract

Reinforcement learning studies how to balance exploration and exploitation inreal-world systems, optimizing interactions with the world while simultaneouslylearning how the world operates. One general class of algorithms for suchlearning is the multi-armed bandit setting. Randomized probability matching,based upon the Thompson sampling approach introduced in the 1930s, has recentlybeen shown to perform well and to enjoy provable optimality properties. Itpermits generative, interpretable modeling in a Bayesian setting, where priorknowledge is incorporated, and the computed posteriors naturally capture thefull state of knowledge. In this work, we harness the information contained inthe Bayesian posterior and estimate its sufficient statistics via sampling. Inseveral application domains, for example in health and medicine, eachinteraction with the world can be expensive and invasive, whereas drawingsamples from the model is relatively inexpensive. Exploiting this viewpoint, wedevelop a double sampling technique driven by the uncertainty in the learningprocess: it favors exploitation when certain about the properties of each arm,exploring otherwise. The proposed algorithm does not make any distributionalassumption and it is applicable to complex reward distributions, as long asBayesian posterior updates are computable. Utilizing the estimated posteriorsufficient statistics, double sampling autonomously balances theexploration-exploitation tradeoff to make better informed decisions. Weempirically show its reduced cumulative regret when compared tostate-of-the-art alternatives in representative bandit settings.

Quick Read (beta)

loading the full paper ...