MBCAL: Sample Efficient and Variance Reduced Reinforcement Learning for Recommender Systems

  • 2020-06-18 04:45:44
  • Fan Wang, Xiaomin Fang, Lihang Liu, Hao Tian, Zhiming Peng
  • 0


In recommender systems such as news feed stream, it is essential to optimizethe long-term utilities in the continuous user-system interaction processes.Previous works have proved the capability of reinforcement learning in thisproblem. However, there are many practical challenges to implement deepreinforcement learning in online systems, including low sample efficiency,uncontrollable risks, and excessive variances. To address these issues, wepropose a novel reinforcement learning method, namely model-basedcounterfactual advantage learning (MBCAL). The proposed method takes advantageof the characteristics of recommender systems and draws ideas from themodel-based reinforcement learning method for higher sample efficiency. It hastwo components: an environment model that predicts the instant user behaviorone-by-one in an auto-regressive form, and a future advantage model thatpredicts the future utility. To alleviate the impact of excessive variance whenlearning the future advantage model, we employ counterfactual comparisonsderived from the environment model. In consequence, the proposed methodpossesses high sample efficiency and significantly lower variance; Also, it isable to use existing user logs to avoid the risks of starting from scratch. Incontrast to its capability, its implementation cost is relatively low, whichfits well with practical systems. Theoretical analysis and elaborateexperiments are presented. Results show that the proposed method transcends theother supervised learning and RL-based methods in both sample efficiency andasymptotic performances.


Quick Read (beta)

MBCAL: Sample Efficient and Variance Reduced Reinforcement Learning for Recommender Systems

Fan Wang [email protected] Baidu Inc. Xiaomin Fang [email protected] Baidu Inc. Lihang Liu [email protected] Baidu Inc. Hao Tian [email protected] Baidu Research  and  Zhiming Peng [email protected] Baidu Inc.

In recommender systems such as news feed stream, it is essential to optimize the long-term utilities in the continuous user-system interaction processes. Previous works have proved the capability of reinforcement learning in this problem. However, there are many practical challenges to implement deep reinforcement learning in online systems, including low sample efficiency, uncontrollable risks, and excessive variances. To address these issues, we propose a novel reinforcement learning method, namely model-based counterfactual advantage learning (MBCAL). The proposed method takes advantage of the characteristics of recommender systems and draws ideas from the model-based reinforcement learning method for higher sample efficiency. It has two components: an environment model that predicts the instant user behavior one-by-one in an auto-regressive form, and a future advantage model that predicts the future utility. To alleviate the impact of excessive variance when learning the future advantage model, we employ counterfactual comparisons derived from the environment model. In consequence, the proposed method possesses high sample efficiency and significantly lower variance; Also, it is able to use existing user logs to avoid the risks of starting from scratch. In contrast to its capability, its implementation cost is relatively low, which fits well with practical systems. Theoretical analysis and elaborate experiments are presented. Results show that the proposed method transcends the other supervised learning and RL-based methods in both sample efficiency and asymptotic performances.

Recommender Systems, Model-based Reinforcement Learning
journalyear: 2020copyright: acmlicensedconference: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 22–27, 2020; San Diego, CA, USAbooktitle: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20), August 22–27, 2020, San Diego, CA, USAccs: Information systems Recommender systemsccs: Computing methodologies Sequential decision making

1. Introduction

A recommender system (RS) provides users with personalized contents, which significantly improves the efficiency of information acquisition. Nowadays, a typical RS such as news feed stream needs to address multiple steps of user-system interactions in a session. The recommended content in historical interactions can affect the subsequent user behaviors. For instance, exploration of new topics may stimulate the user’s interest in related topics; Repeating overlapped contents can make the user lose his/her interest quickly. Traditional recommender systems employ collaborative filtering (schafer2007collaborative; koren2009matrix), or neural networks (cheng2016wide; hidasi2017recurrent; he2017neural) to estimate users’ instant behaviors, e.g., instant clicks. However, merely focusing on users’ instant behaviors causes many problems, such as overcrowded recommendations, which damage users’ experience in the long run. Recently there is increased attention on applying deep reinforcement learning (Deep RL) (mnih2015human; lillicrap2015continuous) to recommender systems, which models the user-system interactions as Markov Decision Processes (MDP). A large number of the studies in this area lies in model-free reinforcement learning (MFRL) methods such as Policy Gradient (chen2019top), DDPG (dulac2015deep), DRR (liu2018deep), DeepPage (zhao2018deep) and DQN based recommender systems (zheng2018drn; zhao2018recommendations). However, many challenges remain in this area. One of them is the over-consumption of data in the training process, which is also referred to as low sample efficiency. Another challenge of MFRL is the practical risks in implementation. On the one hand, on-policy RL can hardly utilize off-policy user logs for training, which raises challenges in online infrastructures and performances at the early stage. On the other hand, off-policy RL suffers from the risk of falling into Deadly Triad (Hassselt2018Deep). It refers to the case of non-convergence when combining off-policy RL with function approximation (such as neural networks) and offline training.

As an alternative choice of MFRL, model-based RL (MBRL) possesses higher sample efficiency and lower practical risks. In MBRL, an environment model (also known as world model) is employed to predict the instant feedbacks and state transitions, and a planning module is implemented to search for an optimal trajectory (Peter2011PILCO). However, MBRL needs much computation when coming to inference. To make things worse, planning is completely infeasible in a multi-stage retrieval framework, which is widely used in modern recommender systems. In such systems, an earlier stage generates the candidate set of items for the next one; thus, the candidates can not be predetermined. To avoid those issues, authors have proposed to apply Dyna algorithms in recommender systems (liu2019personalized; zoupseudo). The Dyna algorithm (Sutton1991Dyna) accelerate the convergence through generating virtual interaction by taking advantage of the environment model. However, as a cost of faster convergence, Dyna suffers from losses in asymptotic performance, due to the accumulation of error from the virtual interactions.

Figure 1. (\subreffig:Intro_1) Illustration of the user-system interaction; (\subreffig:Intro_2) Two examples of trajectories of interaction, Ta and Tb, generated by different users; (\subreffig:Intro_3) The original trajectory Ta and couterfactual comparison Tc from the same user

Excessive variance of gradients in optimization is an important challenge of deploying RL as well. The variance may originate from stochastic transition, noisy rewards, and stochastic policy. Longer horizons are found to exacerbate the variance. Excessive variance significantly slows down the convergence and introduces instabilities. Previous work (schulman2016high; Ziyu2016Dueling) has shed light on this issue by showing that using advantage function instead of value function can reduce the variance and thus improve the performance. However, those proposals aim at MFRL, and variance reduction in MBRL has not been studied yet.

Specifically, in recommender systems, the variance may come from the following aspects. First, there are extensive noises in the observed user feedbacks. For example, some users are more prone to give positive/negative feedbacks than the others. Even for a single user, he/she may behave differently at different times of a day (e.g., before sleeping vs. at work). Second, for stochastic policy, re-sampling the trajectory starting from any state can lead to variant long-term returns. Although the influence of variances can be alleviated by inducing from a sufficiently large amount of data, the variances still have negative impacts due to the sparsity of the data for specific user and item.

To clearly explain the influence of variances, we show a demonstration of the interaction process in Figure 1(\subreffig:Intro_1). For each step, the agent selects an item for display, and the user returns his/her feedbacks. An observed trajectory includes multiple steps of interactions, as Ta and Tb shown in Figure 1(\subreffig:Intro_2). In our settings, the candidate user behaviors include ”Like (Thumbs Up)” and ”Unlike (Thumbs Down)”, and the utility is the number of ”Likes” in the trajectory. Here, we consider the document d3 in the 3rd step in trajectory Ta, which yields 2 ”Likes” considering the instant and future utilities. It is hard to judge whether d3 is a superior action here due to the lack of comparison. To give a better evaluation, we can find a comparison such as Trajectory Tb in Figure 1(\subreffig:Intro_2). Tb possesses the same historical interactions as Ta, which implies that the user share many interests with that of Ta. The trajectory starting from d3 in Tb yields 3 following ”Likes”, by which we may judge that d3 is better than d3 . However, it is a quite arbitrary conclusion, because the difference in future utility can possibly be attributed to the user biases, or the quality of follow-up items (d4,d5 vs. d4,d5).

Aiming at further reducing the variances, a key idea of our work is to compare Ta with another trajectory, Tc in Figure 1. Tc shares all the contexts with Ta, including the user, historical interactions, and follow-up items (d4,d5), except for replacing the current document d3 with some other documents. By comparing Ta with Tc, we can come to a more solid conclusion on the advantage of taking d3. Unfortunately, it is impossible to find records such as Tc from user logs, as a user can not possibly go through the same trajectory twice. However, by taking advantage of the environment model, we can do simulations into the future (often referred to as simulated rollout), and we can indeed generate trajectories like Tc.

Following the idea mentioned above, we propose a novel MBRL solution toward RS, namely the Model-based Counterfactual Advantage Learning (MBCAL). First, the overall utility is decomposed into the instant utility (the rewards acquired in the current step) and the future utility (the rewards acquired in the future). The instant utility is naturally predicted with the environment model, and the future utility is approximated through simulated rollout. Second, to further reduce the variance in the future utilities, we try to do two comparative simulated rollouts. Before doing so, we introduce the masking item to the environment model, which allows us to generate simulated rollouts by masking the document in the step that we are interested in (the trajectory Tc). We then calculate the counterfactual future advantage (CFA) as the difference of the future utility with and without masking. At last, we introduce the future advantage model to approximate the CFA.

We conduct simulative experiments by utilizing three real-world datasets. The methods for comparison include supervised learning, MFRL and MBRL. We also put our attention on Batch-RL and Growing Batch-RL settings(lange2012batch), which is more compatible with the practical infrastructures. Extensive results of experiments show the superiority of the proposed method.

2. Preliminaries

2.1. Problem Settings and Notations

We formalize the recommendation as Markov Decision Processes (MDP), denoted with the symbols (𝒮,𝒜,,μ). The details are explained as follows:

  • State st𝒮 represents the unknown user interests and contextual information at step t. As it is impossible to know the user interest exactly, the observed user interaction history is frequently used as the state descriptor (zhao2019toward; liu2018deep), i.e.,

    (1) st=(o1,o2,,ot-1)

    where we use ot to represent the pair of exposed item at𝐨 and corresponding feedback bt𝐨, i.e., ot=(at𝐨,bt𝐨). For simple we also use 𝐨[1:t-1] to represent the trajectory o1,o2,,ot-1.

  • Action at𝒜t denotes the selected item by the agent, with 𝒜t being the candidate set that is passed from the earlier stage of the recommender system.

  • Reward rt is the utility that we want to maximize. Usually, it depends on the observed user behaviors.

  • Conditional Transition Probabilities μ(s|s,a) denotes the transition probability to state s given state and action pair (s,a), which is hidden in most cases.

Table 1. Notations.
Notations Descriptions
Categories of user behaviors, e.g.,
=(”Skip”, ”Click”, ”Click and Thumbs up”)
Rewards corresponding to the
behaviors in , e.g., =(0,1,5)
r r, reward / utility
s state
a action
π(a|s) policy for the recommendation
r(s,a) reward function
μ(s|s,a) transition probabilities
Qπ(s,a) state-action value function of π
Vπ(s) state value function of π
observed interactive trajectory
𝐨=(a1𝐨,b1𝐨,a2𝐨,b2𝐨, …)
𝐨[t1:t2] sub-trajectories from step t1 to t2
at𝐨 action taken at t in trajectory 𝐨
bt𝐨 bt𝐨, user behavior observed at t in 𝐨
𝒟π collection of trajectories with policy π

Additionally, we denote the user feedback (behavior) at step t with bt, where ={1,2,,n} is the set of candidate user behaviors (e.g., ”Click”, ”Skip”, ”Click and Thumbs Up”). Also, notice that we use the superscript 𝐨 to denote that an action at and user feedback bt belongs to a specific trajectory, given at𝐨 and bt𝐨. Without losing generality, we focus on maximizing the overall utility within a fixed T steps of interactions. In addition to the symbols mentioned above, we also adopt other commonly used notations in RL, shown in Table 1.

2.2. RL-based Recommender Systems

Without confusion, we replace state st with the trajectory 𝐨[1:t-1], and denote the value and policy function as Qπ(𝐨[1:t-1],a) and π(a|𝐨[1:t-1]), respectively. It is worth to notice that other user-side features (user-ids, user properties, context features such as time of the day) can be involved in the state representation without any difficulty. However, for simplicity, we omit the notation of those features.

Typical MFRL methods try to learn the function approximator denoted as Qθ, with θ being trainable parameters, which is optimized by minimizing the loss function:

(2) MFRL(θ)=1|𝒟|𝐨𝒟1Tt=1T[Q^target-Qθ(𝐨[1:t-1],at𝐨)]2.

Q^target here represents the value backup (backup: updating the value of current states with that of future states (sutton1998introduction)), which is calculated differently in different algorithms. E.g., Deep Q Network (DQN) (mnih2015human) uses the temporal difference errors, where Q^target= rt+γmaxaQθ(𝐨[1:t],a). θ here is the set of parameters which is periodically copied from θ during the optimization process.

2.3. Growing Batch-RL

As online update is typically not achievable in realistic recommender systems, we are more interested in Batch-RL and Growing Batch-RL (lange2012batch) settings. Batch-RL refers to the policy learning with static user logs. It can usually be used to evaluate the capability of utilizing the offline data. Growing Batch-RL lies somewhere in between the Batch-RL and online learning. The data can be re-collected in a batch manner and used for policy update iteratively. Many recently proposed RL-based systems use a framework that is close to Growing Batch RL settings (zheng2018drn; zhao2018deep; chen2019top). More specifically, Growing Batch RL includes two periodic interleaving stages:

Data Collection. The agent uses the policy πk to interact with the environment (the users). The policy is kept static during this process. The collected interactive trajectories are denoted as 𝒟πk.

Policy Update. The interactive trajectories 𝒟πk are used for training, the agent update the policy from πk to πk+1.

The essential problems of Batch-RL and Growing Batch-RL are to guarantee the Policy Improvement, i.e., how much the performance of πk+1 is improved compared with πk.

3. Methodology

The fundamental idea of MBCAL is explained with Figure 2. We use two models to approximate the instant user behavior and the future advantage separately: the Masked Environment Model (MEM) and the Future Advantage Model (FAM). For the training, we start with optimizing the environment model to predict user behaviors, where we also introduce the masking item into the model. With the MEM, we can calculate the Counterfactual Future Advantage (CFA) by comparing the future utility of masking the action or not. CFA further serves as the label of FAM. For the inference, we use the combination of the two models to pick actions.

In this section, we first formalize the environment model, and then we describe MEM, FAM, and the learning process overall. They are followed by the theoretical analysis of the proposed method.

Figure 2. Overview of MBCAL

3.1. Environment Modeling

As the most common setting, an environment model predicts the transition and the reward separately. Here, we use μ^(st,at,st+1) to approximate μ(st+1|st,at) and r^(st,at) to approximate the reward r(st,at). Specifically, to derive the formulation of environment model in RS, we use Equation (1), and μ(st+1|st,at) can be rewritten by Equation (3).

μ(st+1|st,at) =μ(𝐨[1:t]|𝐨[1:t-1],at𝐨)
(3) =μ(bt𝐨|𝐨[1:t-1],at𝐨)

In other words, the prediction of transition degenerates into the prediction of instant user behaviors. Notice that reward is dependent on the user behavior only as well, thus it is possible to use only one model in place of μ^ and r^. We introduce fϕ(𝐨[1:t-1],at𝐨,bt𝐨) with trainable parameters ϕ to approximate μ(bt𝐨|𝐨[1:t-1],at𝐨), which predicts the probability of the next user behavior being bt𝐨. The transition and the reward are then naturally approximated with

(4) μ^(st,at,st+1)=fϕ(𝐨[1:t-1],at𝐨,bt𝐨).
(5) r^(st,at)=nnfϕ(𝐨[1:t-1],at𝐨,n),

3.2. Masked Environment Model

To eliminate the intractable noises hidden in the feedbacks, we introduce a masking item into the model. The motivation of this is to try to find a counterfactual comparison to the current trajectory, which answers the question: ”If this action was not taken, what would the future behavior be like?” To do so, we introduce a virtual item aM, which is represented by a trainable embedding vector. For convenience, given an observation trajectory 𝐨, we denote the trajectory where the actions at positions y={t1,t2,} are replaced by aM as (𝐨,y,aM). For example, suppose y={2,3}, it gives a masked trajectory of


where actions a2𝐨 and a3𝐨 in 𝐨 are replaced by aM.

The training is straightforward. We sample random positions y𝐨 for each trajectory 𝐨, such that each position has uniform probability of pmask to be replaced. We want the MEM to recover the user behavior as close as possible in case some items are masked. With the collected masked trajectories 𝒟M={(𝐨,y𝐨,aM)|𝐨𝒟}, we maximize the likelihood, or minimize the negative log-likelihood (NLL).

(6) MEM(ϕ)=-1|𝒟M|𝐨𝒟M1Tt=1T[logfϕ(𝐨[:t-1],at𝐨,bt𝐨)].

To model the sequential observations, the architecture of MEM follows that of session-based recurrent RS((Hidasi2015Session; hidasi2017recurrent)). We use Gated Neural Network(Graves2013Speech) to encode the trajectory 𝐨[1:t-1]. As we need to encode 𝐨[1:t-1] and at𝐨 at the same time, we concatenate the input in a staggered way, equivalent to the setting of (santoro2016meta). For each step t, the model takes bt-1𝐨 and at𝐨 as input and output the probability of the next possible behavior. An additional bs is introduced as the start of the observed user behavior (see Figure 3). Concretely, the architecture is formulated as follows.

(7) h0MEM=𝐄𝐦𝐛(u),
(8) xtMEM=𝐂𝐨𝐧𝐜𝐚𝐭(𝐄𝐦𝐛(bt-1),𝐄𝐦𝐛(at)),
(9) htMEM=𝐆𝐑𝐔(ht-1MEM,xtMEM),
(10) fϕ=𝐒𝐨𝐟𝐭𝐦𝐚𝐱(𝐌𝐋𝐏(htMEM)).

Here Emb denotes a representation layer; Concat denotes a concat operation and MLP denotes multilayer perceptron; GRU represents a Gated Recurrent Unit.

3.3. Counterfactual Future Advantage

With the MEM, we can estimate the difference in future utilities between the original trajectory and the counterfactual comparison, namely the Counterfactual Future Advantage (CFA). Specifically, given the trained MEM fϕ, we first define the Simulated Future Reward (SFR, denoted with R^future) of the observed trajectory 𝐨 at time step t[1,T] as

(11) R^ϕfuture(𝐨,t)=τ=t+1Tγτ-trϕ(𝐨[1:τ-1],aτ𝐨).

We then calculate CFA (denoted with A^future) by subtracting the SFR of counterfactual comparison from the original one, see Equation (12).

(12) A^ϕfuture(𝐨,t)=R^ϕfuture(𝐨,t)-R^ϕfuture((𝐨,{t},aM),t).

Finally, we introduce the Future Advantage Model (FAM) denoted with gη(𝐨[1:t-1],at), with trainable parameters η. To train FAM, we minimize the mean square error shown in Equation (13).

FAM(η)= -1|𝒟|(𝐨)𝒟1Tt=1T[gη(𝐨[1:t-1],at𝐨)
(13) -A^ϕfuture(𝐨,t)]2.

FAM takes the equivalent input as the MEM. We use the same neural architecture as MEM except for the last layer, but with different parameters. For the last layer FAM predicts a scalar (the advantage) instead of a distribution, shown as follows:

(14) h0FAM=𝐄𝐦𝐛(u),
(15) xtFAM=𝐂𝐨𝐧𝐜𝐚𝐭(𝐄𝐦𝐛(bt-1),𝐄𝐦𝐛(at)),
(16) htFAM=𝐆𝐑𝐔(ht-1FAM,xtFAM),
(17) gη=𝐌𝐋𝐏(htFAM).

3.4. Summary of MBCAL

An integrate illustration of MBCAL is shown in Figure 3. For the inference, we select the item(action) based on both the MEM and FMA. Formally, given the user information u and the observation trajectory 𝐨[1:t-1], we pick the next action according to Equation (18).

(18) at*=argmaxa[rϕ(𝐨[1:t-1],a)+gη(𝐨[1:t-1],a)].

To avoid the local optimum in policy improvement, we employ ϵ-greedy strategy (sutton2018reinforcement). With probability ϵ, we select a random action, and with the left probability, we select according to Equation (18). It is written as

(19) at={at*,withprobability 1-ϵ;randomactiona𝒜t,withprobabilityϵ.

MBCAL fits well with the Growing Batch-RL settings. The summary of the algorithm is shown in Algorithm 3, with the function PolicyUpdate shown in Algorithm 3. Although we use the notation π to represent the policy, we do not require any explicit formulation of the policy; The common policy gradient type algorithms require the explicit form of policy, which is hard to acquire in many recommender systems.

Figure 3. Illustration of the MBCAL architectures.

Model-Based Counterfactual Advantage Learning(MBCAL) {algorithmic}[1] \StateInitial Policy π0. \Fork = 0,1,2,… until convergence \StateUse policy πk to interact with users for N trajectories, and record the interaction history 𝒟πk. \Statefϕ,gη=PolicyUpdate(𝒟πk). \StateSet πk+1 according to Equation (19). \EndFor\StateReturn πk+1 in the last iteration.


PolicyUpdate(𝒟) {algorithmic}[1] \StateInput: Interaction history 𝒟. \StateRandomly masking 𝒟 to acquire 𝒟M. \StateMinimize MEM(Equation (6)) on 𝒟M to optimize fϕ. \StateFor 𝐨𝒟, calculate A^𝐨,t with t[1,T] by Equation (11) and Equation (12) \StateMinimize FAM(Equation (13)) on 𝒟 to optimize gη. \StateReturn fϕ,gη.

The essence of the variance reduction in MBCAL lies in Equation (12), where the subtraction eliminates the noises from user feedbacks and other sources. We borrow ideas from the advantage function (wang2015dueling), however, CFA is different from the advantage function in that we do not resample the trajectory but we keep the rest of trajectory (that is 𝐨[t+1:T]) unchanged. Although this could bring severe biases in many MDP problems, we argue that the recommender systems embrace weaker correlations between sequential decisions than the other problems (such as robot control and game control). Additionally, as the FAM averages out CFA across different trajectories, the bias turn out to be negligible compared with the benefits of reducing the variances.

4. Experiments

4.1. Datasets

Evaluation of RL-based recommender systems is challenging. The most convincing metric requires running online A/B tests, but it is not only too costly but also too risky to compare all the baselines in an online system. Offline evaluation of long-term utility using user logs is tricky as we can not have the feedback if the agent recommends a different item than what was stored in the log. In order to thoroughly study the performance of the proposed systems, we follow the previous works to build simulators (dulac2015deep; zhao2018deep; liu2019personalized; Eugene2019RecSim). But instead of synthetic simulators, we use real-data-driven simulators. The datasets used include: MovieLens ml-20m11 1 http://files.grouplens.org/datasets/movielens/ml-20m-README.html (harper2016movielens), Netflix Prize22 2 https://www.kaggle.com/netflix-inc/netflix-prize-data and NewsFeed33 3 Data collected from Baidu App News Feed System, as shown in Table 2. Details of the datasets are explained as follows.

  • MovieLens ml-20m: The dataset describes 5-star rating activities from MovieLens. The user behavior =[0,1,2,3,4,5] corresponds to the star ratings, with the reward to be =[0,1,2,3,4,5]. There are 3 kinds of features (movie-id, movie-genre and movie-tag).

  • Netflix Prize: The dataset is a 5-star rating dataset from Netflix. The reward follow the setting of MovieLens. There are only 1 type of features (movie-id).

  • NewsFeed: The dataset is collected from a real online news recommendation system. We focus on predicting the dwelling time on the clicked news. The dwelling time is partitioned into 12 levels (e.g., dwelling time ¡ 30, 30 ¡ dwelling time ¡ 40, …), corresponding to 12 different user behaviors, with the corresponding rewards to be =[1,2,,12]. There are 7 kinds of features (news-id, news-tag, news-title, news-category, news-topics, news-type, news-source).

4.2. Experimental Settings

4.2.1. Simulator Details

In order to fairly evaluate different methods, it is necessary to avoid the agent in the evaluated system to ”hack” the simulator. For this purpose, we add two special settings in the evaluation processes. First, all the agents to be evaluated are allowed to use only a subset of features, while the simulator uses the full feature set. In MovieLens and Netflix, only 1 feature (movie-id) is used in the agents. In NewsFeed, 4 kinds out of 7 are used (news-id, category, news-type, and news-source). Second, we artificially set the model architecture of the simulator to be different from that of the agents. We use the LSTM unit for the simulators, while GRU is used in the agents. To get a view of how close the simulator is to the real environment, we list the micro-F1, weighted-F1, and RMSE with respect to the accuracy of user behavior classification. Properties of datasets and simulators are shown in Table 2.44 4 The source code can be found at: https://github.com/LihangLiu/MBCAL In NewsFeed, we also retrieved over 400 historical A-B test records online. Concerning long-term rewards, including total clicks or dwelling time of a session, the correlation of prediction of our simulators to the real case is 0.90+.

Table 2. Properties of Datasets and Simulators.
Properties MovieLens Netflix NewsFeed
# of Users 130K 480K 920K
# of Items 20K 17K 110K
# of Different Labels 6 6 12
# of Types of Features 3 1 7
Size of Training Set 2.48M 4.53M 9.41M
Size of Validation Set 1.27M 2.27M 4.70M
Simulator Macro-F1 0.545 0.511 0.923
Simulator Weighted-F1 0.532 0.498 0.887
Simulator RMSE 0.770 0.848 1.810

4.2.2. Evaluation Settings

There are two types of iterations in the evaluation: the training round and the test round. For a training round, the agent to be evaluated produces actions by using ϵ-greedy policy (ϵ=0.1 throughout all experiments). It then updates its policy using the collected feedback from the simulator. In the test round, the algorithm produces actions by using a greedy policy, which is evaluated by the simulator. The data generated in the test round are not used in training. For each session in training or test rounds, we consider T=20 steps of interaction between the simulator and the agent. Each training round or test round includes 256,000 sessions.

For each experiment, we report the average reward per session in the test round, defined by 1/|𝒟test|𝐨𝒟testt=1Trt, where 𝒟test is the collection of trajectories in test round. Each experiment is repeated with different random seed for three times, the mean and variance of the score is reported. We also simulate the Batch RL and the Growing Batch-RL evaluation separately (Figure 4). In the Batch RL evaluation, the agent can use only the static user log for training, and it interacts with the simulator for test. In Growing Batch RL evaluation, for both training round and test round, the agent needs to interact with the simulator. The training round repeats up to 40 times.

Figure 4. The evaluation processes.

4.3. Methods for Comparison

We compare different methods ranging from Supervised Learning (GRU4Rec), bandits (GRU4Rec (ϵ-greedy)) to MFRL (MCPE, DQN, DDQN, and DDPG) and MBRL (Dyna-Q). For bandits, LinUCB (Chu2011Contextual) is commonly used as a baseline. However, in our environments, LinUCB performs poorly due to the insufficient representative power of the linear model. Thus, we post the results of ϵ-greedy version of NN models (GRU4Rec (ϵ-greedy)) instead of LinUCB.

The methods for comparison include:

  • GRU4Rec (Hidasi2015Session): It adopts GRU to encode the interactive history to predict the instant user behavior. The model architecture is equivalent to the environment model. We use the entropy losses in GRU4Rec.

  • GRU4Rec (ϵ-greedy): It applies the ϵ-greedy selection of items in GRU4Rec during the training rounds.

  • DQN (zheng2018drn): A classical off-policy learning algorithm (mnih2015human). For state representation, to ensure a fair comparison between different learning algorithms, GRU is used to encode the historical observations, which is equivalent to GRU4Rec and our method.

  • DDQN (zheng2018drn): Double DQN (Hasselt2016DDQN) uses a different action selection for value backup to avoid the value overestimation in off-policy learning. The model architecture is kept equivalent to GRU4Rec.

  • DDPG (zhao2018deep): Deep Deterministic Policy Gradient (DDPG) (lillicrap2015continuous) is an off-policy learning algorithm for continuous action space. The inferred action is used to select the item that lies closest to it for display (nearest neighbor). We use the same neural structure as GRU4Rec for both the actor and critic networks in DDPG.

  • MCPE: Monte Carlo Policy Evaluation (Dimitrakakis2008Roll) is a straight-forward value iteration algorithm. The Monte Carlo evaluation of the whole trajectory is used, i.e., we apply Equation (2) with Q^target=trt. Again, we keep the model architecture equal to the other baselines.

  • Dyna-Q (liu2019personalized; zoupseudo): Dyna-Q is an MBRL method that augments DQN with the imagined rollouts from an environment model. The ratio of imagined rollouts to real trajectories is set to 1:1.

  • MBCAL: The full version of our method.

  • MBCAL (w/o variance reduction): It is an ablative version of MBCAL. We use SFR instead of CFA as the label of FAM, i.e., in Equation (13), R^ϕfuture is used instead of A^ϕfuture.

All the parameters are optimized by adam optimizer with learning rate = 10-3, β1=0.9 and β2=0.999. The decay factor in the long-term reward is set to be γ=0.95. The embedding sizes for the item-id and other id type features are all set to 32. The hidden size for MLP is set to 32. For training MEM in MBCAL, we use pmask=0.20 to generate the masked trajectories DM. In DDPG, we found that high dimensional action space yields really poor performance. Thus we use a 4-dimensional action space. Correspondingly we use an additional layer to map the item representation into a 4-dimensional vector.

4.4. Experimental Results

Table 3. Average reward per session of different algorithms and datasets in Batch-RL evaluation.
Algorithms Average reward per session
Movielens Netflix NewsFeed
GRU4Rec 77.93±0.06 79.63±0.02 11.58±0.14
DDPG 70.99±0.70 72.50±0.35 10.90±0.42
DQN 77.27±0.06 77.75±0.01 12.44±0.33
DDQN 77.23±0.02 77.70±0.04 12.48±0.17
MCPE 77.20±0.10 77.70±0.03 13.21±0.53
Dyna-Q 77.25±0.05 77.81±0.02 13.04±0.33
MBCAL 78.02±0.03 79.71±0.04 16.32±0.24
variance reduction)
77.70±0.04 79.50±0.04 15.61±0.38

4.4.1. Results of Batch-RL Evaluation

The results on Batch-RL evaluation are shown in Tab. 3. We evaluate the reward of a session according to the reward generated by the simulator. To conclude from the result, MFRL can not compete with MBRL in all three environments. As MFRL is sample inefficient, it tends to have poor startup performance. Surprisingly DDPG has the weakest performance in all three environments. By carefully investigating the value functions in DDPG, we found that DDPG overestimates the value function a lot compared with the other MFRL. We thought that the overestimation comes from value backups from continuous actions that may not correspond to realistic items. The overestimation problem in actor-critic methods has also been thoroughly investigated (Fujimoto18Addressing).

As is expected, the MBCAL leads the performance of all the tested systems with substantial margins, demonstrating its sample efficiency. However, for Movielens and Netflix, our method earns a smaller margin over the supervised learning method compared with that of NewsFeed. It is likely that the long term reward plays a more significant role in NewsFeed than the other two environments. Furthermore, as learning to predict long-term utility requires more data than the instant reward, the preponderance of RL has not yet been sufficiently revealed in Batch-RL settings. However, it is essential that the performance of MBCAL at the start stage is already state-of-the-art, which proves that MBCAL has low risks and high sample efficiency.

Figure 5. Average reward per session of different algorithms and datasets in Growing Batch-RL evaluation, with the horizontal axis represent the training rounds.

4.4.2. Results of Growing Batch-RL Evaluation

Figure 5 shows the results of the Online Evaluation in three environments. GRU4Rec(ϵ-greedy) surpasses the purely supervised learning GRU4Rec by a small margin in every environment, showing the benefit of exploration in online systems. Performances of DDPG in all three environments are again surprisingly bad. In Figure 5, we show DDPG curve in NewsFeed environment only because in the other two environments, DDPG lags too much behind all the other methods. We believe that continuous action space for recommendation systems with dynamic discrete item space can not work well enough.

With the assistance of the environment model, Dyna-Q gains some advantages at the beginning, but it gradually subsides as the learning continues. This phenomenon is just in line with the expectations, for the virtual experience quickly loses its benefit with the accumulation of sufficient real user feedback. MBCAL again keeps its performance ahead of the other methods in all the environments. Even for Netflix and Movielens, where the other RL-based system fails to gain any benefits over traditional GRU4Rec, MBCAL wins with a considerable margin. In NewsFeed, where the long term rewards play a more critical role, MBCAL strengthens the leading edge.

MCPE, DQN, DDQN, and Dyna-Q lag entirely behind the other methods, including supervised learning baselines in Movielens and Netflix environment, while this is not true in NewsFeed. We investigate the reason by setting the output of GRU4Rec to the instant reward instead of the user behavior classification, which turned the classification into regression, and the entropy loss to mean square error loss. We found a significant drop in performance in GRU4Rec, which is more consistent with the results in NewsFeed. The results show that classification and entropy loss benefit the system more than regressions. An explanation is that user behavior contains more abundant information than the rewards, which also made MBRL more advantageous than MFRL.

4.4.3. Analysis of the variance

The key point in MBCAL is the variance reduction through counterfactual comparisons. The previous proposals (Geman92Neural) suggest that the mean square error (MSE) in a well-trained model is composed of the model bias and the variance(noise) in the labels. As we use equivalent neural architectures in all the methods for comparison, they share the same model bias. Therefore the mean square error shall be dominated by the noise. To study whether CFA truly reduces the variance, we compare the MSE from Equation (2) and Equation (13). We compare the MSE of MCPE, DQN, Dyna-Q, MBCAL (w/o variance reduction), and MBCAL, based on the interactive logs collected in the test round of Batch-RL evaluation. The average MSE is presented in Table 4.

Table 4. The mean square error (MSE) loss of different algorithms in different environments.
Algorithms MSE loss
Movielens Netflix NewsFeed
DQN 1.50 1.22 4.29
MCPE 17.1 9.21 46.9
Dyna-Q 0.94 1.04 7.87
MBCAL 0.004 0.009 0.07
MBCAL (w/o
variance reduction)
3.45 3.29 3.07

According to the previous theoretical analysis, using value backup of longer horizons suffer from more significant variance. The variance of MCPE is indeed higher than that of DQN and Dyna-Q, as backup of the whole trajectory is used. The MBCAL (w/o variance reduction) has the second-largest variance. It is smaller compared with MCPE because using simulated rollout from the environment model already eliminates part of the noises. The variances of DQN and Dyna-Q are smaller because one-step value backup is employed. Compared with the other methods, MBCAL embraces significantly lower variances, which shows that variance has been reduced as expected.

5. Conclusion

To conclude this work, we are focused on the sequential decision-making problems in the recommender systems. To maximize the long-term utility, we propose a sample efficient and variance reduced reinforcement learning method: MBCAL. It involves a masked environment model to capture the instant user behavior, and a future advantage model to capture the future utility. Through counterfactual comparison, MBCAL significantly reduces the variance in learning. Experiments on real-data-driven simulations show that the proposed method transcends the previous ones in both sample efficiency and asymptotic performances. Possible future extensions to this work may be to theoretically calculating the error bound, and to extend the fixed horizon settings to infinite and dynamic horizon recommender systems.