Abstract
Personalised discount codes provide a powerful mechanism for managingcustomer relationships and operational spend in e-commerce. Bandits are wellsuited for this product area, given the partial information nature of theproblem, as well as the need for adaptation to the changing businessenvironment. Here, we introduce DISCO, an end-to-end contextual banditframework for personalised discount code allocation at ASOS. DISCO adapts thetraditional Thompson Sampling algorithm by integrating it within an integerprogram, thereby allowing for operational cost control. Because bandit learningis often worse with high dimensional actions, we focused on building lowdimensional action and context representations that were nonetheless capable ofgood accuracy. Additionally, we sought to build a model that preserved therelationship between price and sales, in which customers increasing theirpurchasing in response to lower prices ("negative price elasticity"). Theseaims were achieved by using radial basis functions to represent the continuous(i.e. infinite armed) action space, in combination with context embeddingsextracted from a neural network. These feature representations were used withina Thompson Sampling framework to facilitate exploration, and further integratedwith an integer program to allocate discount codes across ASOS's customer base.These modelling decisions result in a reward model that (a) enables pooledlearning across similar actions, (b) is highly accurate, including inextrapolation, and (c) preserves the expected negative price elasticity.Through offline analysis, we show that DISCO is able to effectively enactexploration and improves its performance over time, despite the globalconstraint. Finally, we subjected DISCO to a rigorous online A/B test, and findthat it achieves a significant improvement of >1% in average basket value,relative to the legacy systems.