Policy Gradients for Contextual Bandits

Abstract

We study a generalized contextual-bandits problem, where there is a statethat decides the distribution of contexts of arms and affects the immediatereward when choosing an arm. The problem applies to a wide range of realisticsettings such as personalized recommender systems and natural languagegenerations. We put forward a class of policies in which the marginalprobability of choosing an arm (in expectation of other arms) in each state hasa simple closed form and is differentiable. In particular, the gradient of thisclass of policies is in a succinct form, which is an expectation of theaction-value multiplied by the gradient of the marginal probability over pairsof states and single contexts. These findings naturally lead to an algorithm,coined policy gradient for contextual bandits (PGCB). As a further theoreticalguarantee, we show that the variance of PGCB is less than the standard policygradients algorithm. We also derive the off-policy gradients, and evaluate PGCBon a toy dataset as well as a music recommender dataset. Experiments show thatPGCB outperforms both classic contextual-bandits methods and policy gradientmethods.

Quick Read (beta)

loading the full paper ...