Abstract
Existing web-scale recommendation systems commonly use supervised learningmethods that prioritize immediate user feedback. Although reinforcementlearning (RL) offers a solution to optimize longer-term goals, such asin-session engagement, applying it at web scale is challenging due to theextremely large action space and engineering complexity. In this paper, weintroduce RecoMind, a simulator-based RL framework designed for the effectiveoptimization of session-based goals at web-scale. RecoMind leverages existingrecommendation models to establish a simulation environment and to bootstrapthe RL policy to optimize immediate user interactions from the outset. Thismethod integrates well with existing industry pipelines, simplifying thetraining and deployment of RL policies. Additionally, RecoMind introduces acustom exploration strategy to efficiently explore web-scale action spaces withhundreds of millions of items. We evaluated RecoMind through extensive offlinesimulations and online A/B testing on a video streaming platform. Both methodsshowed that the RL policy trained using RecoMind significantly outperformstraditional supervised learning recommendation approaches in in-session usersatisfaction. In online A/B tests, the RL policy increased videos watched formore than 10 seconds by 15.81\% and improved session depth by 4.71\% forsessions with at least 10 interactions. As a result, RecoMind presents asystematic and scalable approach for embedding RL into web-scale recommendationsystems, showing great promise for optimizing session-based user satisfaction.