Abstract
In recommender systems such as news feed stream, it is essential to optimizethe longterm utilities in the continuous usersystem interaction processes.Previous works have proved the capability of reinforcement learning in thisproblem. However, there are many practical challenges to implement deepreinforcement learning in online systems, including low sample efficiency,uncontrollable risks, and excessive variances. To address these issues, wepropose a novel reinforcement learning method, namely modelbasedcounterfactual advantage learning (MBCAL). The proposed method takes advantageof the characteristics of recommender systems and draws ideas from themodelbased reinforcement learning method for higher sample efficiency. It hastwo components: an environment model that predicts the instant user behavioronebyone in an autoregressive form, and a future advantage model thatpredicts the future utility. To alleviate the impact of excessive variance whenlearning the future advantage model, we employ counterfactual comparisonsderived from the environment model. In consequence, the proposed methodpossesses high sample efficiency and significantly lower variance; Also, it isable to use existing user logs to avoid the risks of starting from scratch. Incontrast to its capability, its implementation cost is relatively low, whichfits well with practical systems. Theoretical analysis and elaborateexperiments are presented. Results show that the proposed method transcends theother supervised learning and RLbased methods in both sample efficiency andasymptotic performances.
Quick Read (beta)
MBCAL: Sample Efficient and Variance Reduced Reinforcement Learning for Recommender Systems
Abstract.
In recommender systems such as news feed stream, it is essential to optimize the longterm utilities in the continuous usersystem interaction processes. Previous works have proved the capability of reinforcement learning in this problem. However, there are many practical challenges to implement deep reinforcement learning in online systems, including low sample efficiency, uncontrollable risks, and excessive variances. To address these issues, we propose a novel reinforcement learning method, namely modelbased counterfactual advantage learning (MBCAL). The proposed method takes advantage of the characteristics of recommender systems and draws ideas from the modelbased reinforcement learning method for higher sample efficiency. It has two components: an environment model that predicts the instant user behavior onebyone in an autoregressive form, and a future advantage model that predicts the future utility. To alleviate the impact of excessive variance when learning the future advantage model, we employ counterfactual comparisons derived from the environment model. In consequence, the proposed method possesses high sample efficiency and significantly lower variance; Also, it is able to use existing user logs to avoid the risks of starting from scratch. In contrast to its capability, its implementation cost is relatively low, which fits well with practical systems. Theoretical analysis and elaborate experiments are presented. Results show that the proposed method transcends the other supervised learning and RLbased methods in both sample efficiency and asymptotic performances.
1. Introduction
A recommender system (RS) provides users with personalized contents, which significantly improves the efficiency of information acquisition. Nowadays, a typical RS such as news feed stream needs to address multiple steps of usersystem interactions in a session. The recommended content in historical interactions can affect the subsequent user behaviors. For instance, exploration of new topics may stimulate the user’s interest in related topics; Repeating overlapped contents can make the user lose his/her interest quickly. Traditional recommender systems employ collaborative filtering (schafer2007collaborative; koren2009matrix), or neural networks (cheng2016wide; hidasi2017recurrent; he2017neural) to estimate users’ instant behaviors, e.g., instant clicks. However, merely focusing on users’ instant behaviors causes many problems, such as overcrowded recommendations, which damage users’ experience in the long run. Recently there is increased attention on applying deep reinforcement learning (Deep RL) (mnih2015human; lillicrap2015continuous) to recommender systems, which models the usersystem interactions as Markov Decision Processes (MDP). A large number of the studies in this area lies in modelfree reinforcement learning (MFRL) methods such as Policy Gradient (chen2019top), DDPG (dulac2015deep), DRR (liu2018deep), DeepPage (zhao2018deep) and DQN based recommender systems (zheng2018drn; zhao2018recommendations). However, many challenges remain in this area. One of them is the overconsumption of data in the training process, which is also referred to as low sample efficiency. Another challenge of MFRL is the practical risks in implementation. On the one hand, onpolicy RL can hardly utilize offpolicy user logs for training, which raises challenges in online infrastructures and performances at the early stage. On the other hand, offpolicy RL suffers from the risk of falling into Deadly Triad (Hassselt2018Deep). It refers to the case of nonconvergence when combining offpolicy RL with function approximation (such as neural networks) and offline training.
As an alternative choice of MFRL, modelbased RL (MBRL) possesses higher sample efficiency and lower practical risks. In MBRL, an environment model (also known as world model) is employed to predict the instant feedbacks and state transitions, and a planning module is implemented to search for an optimal trajectory (Peter2011PILCO). However, MBRL needs much computation when coming to inference. To make things worse, planning is completely infeasible in a multistage retrieval framework, which is widely used in modern recommender systems. In such systems, an earlier stage generates the candidate set of items for the next one; thus, the candidates can not be predetermined. To avoid those issues, authors have proposed to apply Dyna algorithms in recommender systems (liu2019personalized; zoupseudo). The Dyna algorithm (Sutton1991Dyna) accelerate the convergence through generating virtual interaction by taking advantage of the environment model. However, as a cost of faster convergence, Dyna suffers from losses in asymptotic performance, due to the accumulation of error from the virtual interactions.
Excessive variance of gradients in optimization is an important challenge of deploying RL as well. The variance may originate from stochastic transition, noisy rewards, and stochastic policy. Longer horizons are found to exacerbate the variance. Excessive variance significantly slows down the convergence and introduces instabilities. Previous work (schulman2016high; Ziyu2016Dueling) has shed light on this issue by showing that using advantage function instead of value function can reduce the variance and thus improve the performance. However, those proposals aim at MFRL, and variance reduction in MBRL has not been studied yet.
Specifically, in recommender systems, the variance may come from the following aspects. First, there are extensive noises in the observed user feedbacks. For example, some users are more prone to give positive/negative feedbacks than the others. Even for a single user, he/she may behave differently at different times of a day (e.g., before sleeping vs. at work). Second, for stochastic policy, resampling the trajectory starting from any state can lead to variant longterm returns. Although the influence of variances can be alleviated by inducing from a sufficiently large amount of data, the variances still have negative impacts due to the sparsity of the data for specific user and item.
To clearly explain the influence of variances, we show a demonstration of the interaction process in Figure 1(\subreffig:Intro_1). For each step, the agent selects an item for display, and the user returns his/her feedbacks. An observed trajectory includes multiple steps of interactions, as ${T}_{a}$ and ${T}_{b}$ shown in Figure 1(\subreffig:Intro_2). In our settings, the candidate user behaviors include ”Like (Thumbs Up)” and ”Unlike (Thumbs Down)”, and the utility is the number of ”Likes” in the trajectory. Here, we consider the document ${d}_{3}$ in the 3rd step in trajectory ${T}_{a}$, which yields 2 ”Likes” considering the instant and future utilities. It is hard to judge whether ${d}_{3}$ is a superior action here due to the lack of comparison. To give a better evaluation, we can find a comparison such as Trajectory ${T}_{b}$ in Figure 1(\subreffig:Intro_2). ${T}_{b}$ possesses the same historical interactions as ${T}_{a}$, which implies that the user share many interests with that of ${T}_{a}$. The trajectory starting from ${d}_{3}^{\prime}$ in ${T}_{b}$ yields 3 following ”Likes”, by which we may judge that ${d}_{3}^{\prime}$ is better than ${d}_{3}$ . However, it is a quite arbitrary conclusion, because the difference in future utility can possibly be attributed to the user biases, or the quality of followup items (${d}_{4},{d}_{5}$ vs. ${d}_{4}^{\prime},{d}_{5}^{\prime}$).
Aiming at further reducing the variances, a key idea of our work is to compare ${T}_{a}$ with another trajectory, ${T}_{c}$ in Figure 1. ${T}_{c}$ shares all the contexts with ${T}_{a}$, including the user, historical interactions, and followup items (${d}_{4},{d}_{5}$), except for replacing the current document ${d}_{3}$ with some other documents. By comparing ${T}_{a}$ with ${T}_{c}$, we can come to a more solid conclusion on the advantage of taking ${d}_{3}$. Unfortunately, it is impossible to find records such as ${T}_{c}$ from user logs, as a user can not possibly go through the same trajectory twice. However, by taking advantage of the environment model, we can do simulations into the future (often referred to as simulated rollout), and we can indeed generate trajectories like ${T}_{c}$.
Following the idea mentioned above, we propose a novel MBRL solution toward RS, namely the Modelbased Counterfactual Advantage Learning (MBCAL). First, the overall utility is decomposed into the instant utility (the rewards acquired in the current step) and the future utility (the rewards acquired in the future). The instant utility is naturally predicted with the environment model, and the future utility is approximated through simulated rollout. Second, to further reduce the variance in the future utilities, we try to do two comparative simulated rollouts. Before doing so, we introduce the masking item to the environment model, which allows us to generate simulated rollouts by masking the document in the step that we are interested in (the trajectory ${T}_{c}$). We then calculate the counterfactual future advantage (CFA) as the difference of the future utility with and without masking. At last, we introduce the future advantage model to approximate the CFA.
We conduct simulative experiments by utilizing three realworld datasets. The methods for comparison include supervised learning, MFRL and MBRL. We also put our attention on BatchRL and Growing BatchRL settings(lange2012batch), which is more compatible with the practical infrastructures. Extensive results of experiments show the superiority of the proposed method.
2. Preliminaries
2.1. Problem Settings and Notations
We formalize the recommendation as Markov Decision Processes (MDP), denoted with the symbols $(\mathcal{S},\mathcal{A},\mathcal{R},\mu )$. The details are explained as follows:

•
State ${s}_{t}\in \mathcal{S}$ represents the unknown user interests and contextual information at step $t$. As it is impossible to know the user interest exactly, the observed user interaction history is frequently used as the state descriptor (zhao2019toward; liu2018deep), i.e.,
(1) $${s}_{t}=({o}_{1},{o}_{2},\mathrm{\dots},{o}_{t1})$$ where we use ${o}_{t}$ to represent the pair of exposed item ${a}_{t}^{\text{\mathbf{o}}}$ and corresponding feedback ${b}_{t}^{\text{\mathbf{o}}}$, i.e., ${o}_{t}=({a}_{t}^{\text{\mathbf{o}}},{b}_{t}^{\text{\mathbf{o}}})$. For simple we also use ${\text{\mathbf{o}}}_{[1:t1]}$ to represent the trajectory ${o}_{1},{o}_{2},\mathrm{\dots},{o}_{t1}$.

•
Action ${a}_{t}\in {\mathcal{A}}_{t}$ denotes the selected item by the agent, with ${\mathcal{A}}_{t}$ being the candidate set that is passed from the earlier stage of the recommender system.

•
Reward ${r}_{t}\in \mathcal{R}$ is the utility that we want to maximize. Usually, it depends on the observed user behaviors.

•
Conditional Transition Probabilities $\mu ({s}^{\prime}s,a)$ denotes the transition probability to state ${s}^{\prime}$ given state and action pair $(s,a)$, which is hidden in most cases.
Notations  Descriptions  

$\mathcal{B}$ 


$\mathcal{R}$ 


$r$  $r\in \mathcal{R}$, reward / utility  
$s$  state  
$a$  action  
$\pi (as)$  policy for the recommendation  
$r(s,a)$  reward function  
$\mu ({s}^{\prime}s,a)$  transition probabilities  
${Q}^{\pi}(s,a)$  stateaction value function of $\pi $  
${V}^{\pi}(s)$  state value function of $\pi $  
$\text{\mathbf{o}}$ 


${\text{\mathbf{o}}}_{[{t}_{1}:{t}_{2}]}$  subtrajectories from step ${t}_{1}$ to ${t}_{2}$  
${a}_{t}^{\text{\mathbf{o}}}$  action taken at $t$ in trajectory $\text{\mathbf{o}}$  
${b}_{t}^{\text{\mathbf{o}}}$  ${b}_{t}^{\text{\mathbf{o}}}\in \mathcal{B}$, user behavior observed at $t$ in $\text{\mathbf{o}}$  
${\mathcal{D}}^{\pi}$  collection of trajectories with policy $\pi $ 
Additionally, we denote the user feedback (behavior) at step $t$ with ${b}_{t}\in \mathcal{B}$, where $\mathcal{B}=\{{\mathcal{B}}_{1},{\mathcal{B}}_{2},\mathrm{\dots},{\mathcal{B}}_{n}\}$ is the set of candidate user behaviors (e.g., ”Click”, ”Skip”, ”Click and Thumbs Up”). Also, notice that we use the superscript $\text{\mathbf{o}}$ to denote that an action ${a}_{t}$ and user feedback ${b}_{t}$ belongs to a specific trajectory, given ${a}_{t}^{\text{\mathbf{o}}}$ and ${b}_{t}^{\text{\mathbf{o}}}$. Without losing generality, we focus on maximizing the overall utility within a fixed $T$ steps of interactions. In addition to the symbols mentioned above, we also adopt other commonly used notations in RL, shown in Table 1.
2.2. RLbased Recommender Systems
Without confusion, we replace state ${s}_{t}$ with the trajectory ${\text{\mathbf{o}}}_{[1:t1]}$, and denote the value and policy function as ${Q}^{\pi}({\text{\mathbf{o}}}_{[1:t1]},a)$ and $\pi (a{\text{\mathbf{o}}}_{[1:t1]})$, respectively. It is worth to notice that other userside features (userids, user properties, context features such as time of the day) can be involved in the state representation without any difficulty. However, for simplicity, we omit the notation of those features.
Typical MFRL methods try to learn the function approximator denoted as ${Q}_{\theta}$, with $\theta $ being trainable parameters, which is optimized by minimizing the loss function:
(2)  $$\underset{\text{MFRL}}{\mathcal{L}}(\theta )=\frac{1}{\mathcal{D}}\sum _{\text{\mathbf{o}}\in \mathcal{D}}\frac{1}{T}\sum _{t=1}^{T}{[{\widehat{Q}}_{\text{target}}{Q}_{\theta}(\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}})]}^{2}.$$ 
${\widehat{Q}}_{\text{target}}$ here represents the value backup (backup: updating the value of current states with that of future states (sutton1998introduction)), which is calculated differently in different algorithms. E.g., Deep Q Network (DQN) (mnih2015human) uses the temporal difference errors, where ${\widehat{Q}}_{\text{target}}=$ ${r}_{t}+\gamma {\mathrm{max}}_{{a}^{\prime}}{Q}_{{\theta}^{\prime}}({\text{\mathbf{o}}}_{[1:t]},{a}^{\prime})$. ${\theta}^{\prime}$ here is the set of parameters which is periodically copied from $\theta $ during the optimization process.
2.3. Growing BatchRL
As online update is typically not achievable in realistic recommender systems, we are more interested in BatchRL and Growing BatchRL (lange2012batch) settings. BatchRL refers to the policy learning with static user logs. It can usually be used to evaluate the capability of utilizing the offline data. Growing BatchRL lies somewhere in between the BatchRL and online learning. The data can be recollected in a batch manner and used for policy update iteratively. Many recently proposed RLbased systems use a framework that is close to Growing Batch RL settings (zheng2018drn; zhao2018deep; chen2019top). More specifically, Growing Batch RL includes two periodic interleaving stages:
Data Collection. The agent uses the policy ${\pi}_{k}$ to interact with the environment (the users). The policy is kept static during this process. The collected interactive trajectories are denoted as ${\mathcal{D}}^{{\pi}_{k}}$.
Policy Update. The interactive trajectories ${\mathcal{D}}^{{\pi}_{k}}$ are used for training, the agent update the policy from ${\pi}_{k}$ to ${\pi}_{k+1}$.
The essential problems of BatchRL and Growing BatchRL are to guarantee the Policy Improvement, i.e., how much the performance of ${\pi}_{k+1}$ is improved compared with ${\pi}_{k}$.
3. Methodology
The fundamental idea of MBCAL is explained with Figure 2. We use two models to approximate the instant user behavior and the future advantage separately: the Masked Environment Model (MEM) and the Future Advantage Model (FAM). For the training, we start with optimizing the environment model to predict user behaviors, where we also introduce the masking item into the model. With the MEM, we can calculate the Counterfactual Future Advantage (CFA) by comparing the future utility of masking the action or not. CFA further serves as the label of FAM. For the inference, we use the combination of the two models to pick actions.
In this section, we first formalize the environment model, and then we describe MEM, FAM, and the learning process overall. They are followed by the theoretical analysis of the proposed method.
3.1. Environment Modeling
As the most common setting, an environment model predicts the transition and the reward separately. Here, we use $\widehat{\mu}({s}_{t},{a}_{t},{s}_{t+1})$ to approximate $\mu ({s}_{t+1}{s}_{t},{a}_{t})$ and $\widehat{r}({s}_{t},{a}_{t})$ to approximate the reward $r({s}_{t},{a}_{t})$. Specifically, to derive the formulation of environment model in RS, we use Equation (1), and $\mu ({s}_{t+1}{s}_{t},{a}_{t})$ can be rewritten by Equation (3).
$\mu ({s}_{t+1}{s}_{t},{a}_{t})$  $=\mu (\underset{[1:t]}{\text{\mathbf{o}}}\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}})$  
$=\mu (\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}},{b}_{t}^{\text{\mathbf{o}}}\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}})$  
(3)  $=\mu ({b}_{t}^{\text{\mathbf{o}}}\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}})$ 
In other words, the prediction of transition degenerates into the prediction of instant user behaviors. Notice that reward is dependent on the user behavior only as well, thus it is possible to use only one model in place of $\widehat{\mu}$ and $\widehat{r}$. We introduce ${f}_{\varphi}({\text{\mathbf{o}}}_{[1:t1]},{a}_{t}^{\text{\mathbf{o}}},{b}_{t}^{\text{\mathbf{o}}})$ with trainable parameters $\varphi $ to approximate $\mu ({b}_{t}^{\text{\mathbf{o}}}{\text{\mathbf{o}}}_{[1:t1]},{a}_{t}^{\text{\mathbf{o}}})$, which predicts the probability of the next user behavior being ${b}_{t}^{\text{\mathbf{o}}}$. The transition and the reward are then naturally approximated with
(4)  $\widehat{\mu}({s}_{t},{a}_{t},{s}_{t+1})={f}_{\varphi}(\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}},{b}_{t}^{\text{\mathbf{o}}}).$  
(5)  $\widehat{r}({s}_{t},{a}_{t})={\displaystyle \sum _{n}}\underset{n}{\mathcal{R}}\cdot {f}_{\varphi}(\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}},\underset{n}{\mathcal{B}}),$ 
3.2. Masked Environment Model
To eliminate the intractable noises hidden in the feedbacks, we introduce a masking item into the model. The motivation of this is to try to find a counterfactual comparison to the current trajectory, which answers the question: ”If this action was not taken, what would the future behavior be like?” To do so, we introduce a virtual item ${a}_{M}$, which is represented by a trainable embedding vector. For convenience, given an observation trajectory $\text{\mathbf{o}}$, we denote the trajectory where the actions at positions $y=\{{t}_{1},{t}_{2},\mathrm{\dots}\}$ are replaced by ${a}_{M}$ as $\mathcal{M}(\text{\mathbf{o}},y,{a}_{M})$. For example, suppose $y=\{2,3\}$, it gives a masked trajectory of
$$\begin{array}{cc}& \mathcal{M}(\text{\mathbf{o}},\{2,3\},{a}_{M})=({a}_{1}^{\text{\mathbf{o}}},{b}_{1}^{\text{\mathbf{o}}},{a}_{M},{b}_{2}^{\text{\mathbf{o}}},{a}_{M},{b}_{3}^{\text{\mathbf{o}}},{a}_{4}^{\text{\mathbf{o}}},{b}_{4}^{\text{\mathbf{o}}},\mathrm{\dots}),\hfill \end{array}$$ 
where actions ${a}_{2}^{\text{\mathbf{o}}}$ and ${a}_{3}^{\text{\mathbf{o}}}$ in $\text{\mathbf{o}}$ are replaced by ${a}_{M}$.
The training is straightforward. We sample random positions ${y}^{\text{\mathbf{o}}}$ for each trajectory $\text{\mathbf{o}}$, such that each position has uniform probability of ${p}_{\text{mask}}$ to be replaced. We want the MEM to recover the user behavior as close as possible in case some items are masked. With the collected masked trajectories ${\mathcal{D}}_{M}=\{\mathcal{M}(\text{\mathbf{o}},{y}^{\text{\mathbf{o}}},{a}_{M})\text{\mathbf{o}}\in \mathcal{D}\}$, we maximize the likelihood, or minimize the negative loglikelihood (NLL).
(6)  $$\begin{array}{cc}& \underset{\text{MEM}}{\mathcal{L}}(\varphi )=\frac{1}{{\mathcal{D}}_{M}}\sum _{\text{\mathbf{o}}\in {\mathcal{D}}_{M}}\frac{1}{T}\sum _{t=1}^{T}[\mathrm{log}{f}_{\varphi}(\underset{[:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}},{b}_{t}^{\text{\mathbf{o}}})].\hfill \end{array}$$ 
To model the sequential observations, the architecture of MEM follows that of sessionbased recurrent RS((Hidasi2015Session; hidasi2017recurrent)). We use Gated Neural Network(Graves2013Speech) to encode the trajectory ${\text{\mathbf{o}}}_{[1:t1]}$. As we need to encode ${\text{\mathbf{o}}}_{[1:t1]}$ and ${a}_{t}^{\text{\mathbf{o}}}$ at the same time, we concatenate the input in a staggered way, equivalent to the setting of (santoro2016meta). For each step $t$, the model takes ${b}_{t1}^{\text{\mathbf{o}}}$ and ${a}_{t}^{\text{\mathbf{o}}}$ as input and output the probability of the next possible behavior. An additional ${b}_{s}$ is introduced as the start of the observed user behavior (see Figure 3). Concretely, the architecture is formulated as follows.
(7)  ${h}_{0}^{MEM}=\text{\mathbf{E}\mathbf{m}\mathbf{b}}(u),$  
(8)  ${x}_{t}^{MEM}=\text{\mathbf{C}\mathbf{o}\mathbf{n}\mathbf{c}\mathbf{a}\mathbf{t}}(\text{\mathbf{E}\mathbf{m}\mathbf{b}}({b}_{t1}),\text{\mathbf{E}\mathbf{m}\mathbf{b}}({a}_{t})),$  
(9)  ${h}_{t}^{MEM}=\text{\mathbf{G}\mathbf{R}\mathbf{U}}({h}_{t1}^{MEM},{x}_{t}^{MEM}),$  
(10)  ${f}_{\varphi}=\text{\mathbf{S}\mathbf{o}\mathbf{f}\mathbf{t}\mathbf{m}\mathbf{a}\mathbf{x}}(\text{\mathbf{M}\mathbf{L}\mathbf{P}}({h}_{t}^{MEM})).$ 
Here Emb denotes a representation layer; Concat denotes a concat operation and MLP denotes multilayer perceptron; GRU represents a Gated Recurrent Unit.
3.3. Counterfactual Future Advantage
With the MEM, we can estimate the difference in future utilities between the original trajectory and the counterfactual comparison, namely the Counterfactual Future Advantage (CFA). Specifically, given the trained MEM ${f}_{\varphi}$, we first define the Simulated Future Reward (SFR, denoted with ${\widehat{R}}^{\text{future}}$) of the observed trajectory $\text{\mathbf{o}}$ at time step $t\in [1,T]$ as
(11)  $$\begin{array}{c}\hfill {\widehat{R}}_{\varphi}^{\text{future}}(\text{\mathbf{o}},t)=\sum _{\tau =t+1}^{T}{\gamma}^{\tau t}{r}_{\varphi}(\underset{[1:\tau 1]}{\text{\mathbf{o}}},{a}_{\tau}^{\text{\mathbf{o}}}).\end{array}$$ 
We then calculate CFA (denoted with ${\widehat{A}}^{\text{future}}$) by subtracting the SFR of counterfactual comparison from the original one, see Equation (12).
(12)  $$\begin{array}{cc}& {\widehat{A}}_{\varphi}^{\text{future}}(\text{\mathbf{o}},t)={\widehat{R}}_{\varphi}^{\text{future}}(\text{\mathbf{o}},t){\widehat{R}}_{\varphi}^{\text{future}}(\mathcal{M}(\text{\mathbf{o}},\{t\},{a}_{M}),t).\hfill \end{array}$$ 
Finally, we introduce the Future Advantage Model (FAM) denoted with ${g}_{\eta}({\text{\mathbf{o}}}_{[1:t1]},{a}_{t})$, with trainable parameters $\eta $. To train FAM, we minimize the mean square error shown in Equation (13).
$\underset{\text{FAM}}{\mathcal{L}}(\eta )=$  ${\displaystyle \frac{1}{\mathcal{D}}}{\displaystyle \sum _{(\text{\mathbf{o}})\in \mathcal{D}}}{\displaystyle \frac{1}{T}}{\displaystyle \sum _{t=1}^{T}}[{g}_{\eta}(\underset{[1:t1]}{\text{\mathbf{o}}},{a}_{t}^{\text{\mathbf{o}}})$  
(13)  ${\widehat{A}}_{\varphi}^{\text{future}}(\text{\mathbf{o}},t)]{}^{2}.$ 
FAM takes the equivalent input as the MEM. We use the same neural architecture as MEM except for the last layer, but with different parameters. For the last layer FAM predicts a scalar (the advantage) instead of a distribution, shown as follows:
(14)  ${h}_{0}^{FAM}=\text{\mathbf{E}\mathbf{m}\mathbf{b}}(u),$  
(15)  ${x}_{t}^{FAM}=\text{\mathbf{C}\mathbf{o}\mathbf{n}\mathbf{c}\mathbf{a}\mathbf{t}}(\text{\mathbf{E}\mathbf{m}\mathbf{b}}({b}_{t1}),\text{\mathbf{E}\mathbf{m}\mathbf{b}}({a}_{t})),$  
(16)  ${h}_{t}^{FAM}=\text{\mathbf{G}\mathbf{R}\mathbf{U}}({h}_{t1}^{FAM},{x}_{t}^{FAM}),$  
(17)  ${g}_{\eta}=\text{\mathbf{M}\mathbf{L}\mathbf{P}}({h}_{t}^{FAM}).$ 
3.4. Summary of MBCAL
An integrate illustration of MBCAL is shown in Figure 3. For the inference, we select the item(action) based on both the MEM and FMA. Formally, given the user information $u$ and the observation trajectory ${\text{\mathbf{o}}}_{[1:t1]}$, we pick the next action according to Equation (18).
(18)  $${a}_{t}^{*}=argma{x}_{a}[{r}_{\varphi}(\underset{[1:t1]}{\text{\mathbf{o}}},a)+{g}_{\eta}(\underset{[1:t1]}{\text{\mathbf{o}}},a)].$$ 
To avoid the local optimum in policy improvement, we employ $\u03f5$greedy strategy (sutton2018reinforcement). With probability $\u03f5$, we select a random action, and with the left probability, we select according to Equation (18). It is written as
(19)  $${a}_{t}=\{\begin{array}{ccc}& {a}_{t}^{*},\hfill & \hfill withprobability\mathrm{\hspace{0.25em}1}\u03f5;\\ & randomactiona\in \underset{t}{\mathcal{A}},\hfill & \hfill withprobability\u03f5.\end{array}$$ 
MBCAL fits well with the Growing BatchRL settings. The summary of the algorithm is shown in Algorithm 3, with the function PolicyUpdate shown in Algorithm 3. Although we use the notation $\pi $ to represent the policy, we do not require any explicit formulation of the policy; The common policy gradient type algorithms require the explicit form of policy, which is hard to acquire in many recommender systems.
{algorithmic}[1] \StateInitial Policy ${\pi}_{0}$. \Fork = 0,1,2,… until convergence \StateUse policy ${\pi}_{k}$ to interact with users for $N$ trajectories, and record the interaction history ${\mathcal{D}}^{{\pi}_{k}}$. \State${f}_{\varphi},{g}_{\eta}=\text{PolicyUpdate}({\mathcal{D}}^{{\pi}_{k}})$. \StateSet ${\pi}_{k+1}$ according to Equation (19). \EndFor\StateReturn ${\pi}_{k+1}$ in the last iteration.
{algorithmic}[1] \StateInput: Interaction history $\mathcal{D}$. \StateRandomly masking $\mathcal{D}$ to acquire ${\mathcal{D}}_{M}$. \StateMinimize ${\mathcal{L}}_{\text{MEM}}$(Equation (6)) on ${\mathcal{D}}_{M}$ to optimize ${f}_{\varphi}$. \StateFor $\text{\mathbf{o}}\in \mathcal{D}$, calculate ${\widehat{A}}_{\text{\mathbf{o}},t}$ with $t\in [1,T]$ by Equation (11) and Equation (12) \StateMinimize ${\mathcal{L}}_{\text{FAM}}$(Equation (13)) on $\mathcal{D}$ to optimize ${g}_{\eta}$. \StateReturn ${f}_{\varphi},{g}_{\eta}$.
The essence of the variance reduction in MBCAL lies in Equation (12), where the subtraction eliminates the noises from user feedbacks and other sources. We borrow ideas from the advantage function (wang2015dueling), however, CFA is different from the advantage function in that we do not resample the trajectory but we keep the rest of trajectory (that is ${\text{\mathbf{o}}}_{[t+1:T]}$) unchanged. Although this could bring severe biases in many MDP problems, we argue that the recommender systems embrace weaker correlations between sequential decisions than the other problems (such as robot control and game control). Additionally, as the FAM averages out CFA across different trajectories, the bias turn out to be negligible compared with the benefits of reducing the variances.
4. Experiments
4.1. Datasets
Evaluation of RLbased recommender systems is challenging. The most convincing metric requires running online A/B tests, but it is not only too costly but also too risky to compare all the baselines in an online system. Offline evaluation of longterm utility using user logs is tricky as we can not have the feedback if the agent recommends a different item than what was stored in the log. In order to thoroughly study the performance of the proposed systems, we follow the previous works to build simulators (dulac2015deep; zhao2018deep; liu2019personalized; Eugene2019RecSim). But instead of synthetic simulators, we use realdatadriven simulators. The datasets used include: MovieLens ml20m^{1}^{1} 1 http://files.grouplens.org/datasets/movielens/ml20mREADME.html (harper2016movielens), Netflix Prize^{2}^{2} 2 https://www.kaggle.com/netflixinc/netflixprizedata and NewsFeed^{3}^{3} 3 Data collected from Baidu App News Feed System, as shown in Table 2. Details of the datasets are explained as follows.

•
MovieLens ml20m: The dataset describes 5star rating activities from MovieLens. The user behavior $\mathcal{B}=[0,1,2,3,4,5]$ corresponds to the star ratings, with the reward to be $\mathcal{R}=[0,1,2,3,4,5]$. There are 3 kinds of features (movieid, moviegenre and movietag).

•
Netflix Prize: The dataset is a 5star rating dataset from Netflix. The reward follow the setting of MovieLens. There are only 1 type of features (movieid).

•
NewsFeed: The dataset is collected from a real online news recommendation system. We focus on predicting the dwelling time on the clicked news. The dwelling time is partitioned into 12 levels (e.g., dwelling time ¡ 30, 30 ¡ dwelling time ¡ 40, …), corresponding to 12 different user behaviors, with the corresponding rewards to be $\mathcal{R}=[1,2,\mathrm{\dots},12]$. There are 7 kinds of features (newsid, newstag, newstitle, newscategory, newstopics, newstype, newssource).
4.2. Experimental Settings
4.2.1. Simulator Details
In order to fairly evaluate different methods, it is necessary to avoid the agent in the evaluated system to ”hack” the simulator. For this purpose, we add two special settings in the evaluation processes. First, all the agents to be evaluated are allowed to use only a subset of features, while the simulator uses the full feature set. In MovieLens and Netflix, only 1 feature (movieid) is used in the agents. In NewsFeed, 4 kinds out of 7 are used (newsid, category, newstype, and newssource). Second, we artificially set the model architecture of the simulator to be different from that of the agents. We use the LSTM unit for the simulators, while GRU is used in the agents. To get a view of how close the simulator is to the real environment, we list the microF1, weightedF1, and RMSE with respect to the accuracy of user behavior classification. Properties of datasets and simulators are shown in Table 2.^{4}^{4} 4 The source code can be found at: https://github.com/LihangLiu/MBCAL In NewsFeed, we also retrieved over $400$ historical AB test records online. Concerning longterm rewards, including total clicks or dwelling time of a session, the correlation of prediction of our simulators to the real case is $0.90+$.
Properties  MovieLens  Netflix  NewsFeed 

# of Users  130K  480K  920K 
# of Items  20K  17K  110K 
# of Different Labels  6  6  12 
# of Types of Features  3  1  7 
Size of Training Set  2.48M  4.53M  9.41M 
Size of Validation Set  1.27M  2.27M  4.70M 
Simulator MacroF1  0.545  0.511  0.923 
Simulator WeightedF1  0.532  0.498  0.887 
Simulator RMSE  0.770  0.848  1.810 
4.2.2. Evaluation Settings
There are two types of iterations in the evaluation: the training round and the test round. For a training round, the agent to be evaluated produces actions by using $\u03f5$greedy policy ($\u03f5=0.1$ throughout all experiments). It then updates its policy using the collected feedback from the simulator. In the test round, the algorithm produces actions by using a greedy policy, which is evaluated by the simulator. The data generated in the test round are not used in training. For each session in training or test rounds, we consider $T=20$ steps of interaction between the simulator and the agent. Each training round or test round includes 256,000 sessions.
For each experiment, we report the average reward per session in the test round, defined by $1/{\mathcal{D}}_{test}\cdot {\sum}_{\text{\mathbf{o}}\in {\mathcal{D}}_{test}}{\sum}_{t=1}^{T}{r}_{t}$, where ${\mathcal{D}}_{test}$ is the collection of trajectories in test round. Each experiment is repeated with different random seed for three times, the mean and variance of the score is reported. We also simulate the Batch RL and the Growing BatchRL evaluation separately (Figure 4). In the Batch RL evaluation, the agent can use only the static user log for training, and it interacts with the simulator for test. In Growing Batch RL evaluation, for both training round and test round, the agent needs to interact with the simulator. The training round repeats up to 40 times.
4.3. Methods for Comparison
We compare different methods ranging from Supervised Learning (GRU4Rec), bandits (GRU4Rec ($\u03f5$greedy)) to MFRL (MCPE, DQN, DDQN, and DDPG) and MBRL (DynaQ). For bandits, LinUCB (Chu2011Contextual) is commonly used as a baseline. However, in our environments, LinUCB performs poorly due to the insufficient representative power of the linear model. Thus, we post the results of $\u03f5$greedy version of NN models (GRU4Rec ($\u03f5$greedy)) instead of LinUCB.
The methods for comparison include:

•
GRU4Rec (Hidasi2015Session): It adopts GRU to encode the interactive history to predict the instant user behavior. The model architecture is equivalent to the environment model. We use the entropy losses in GRU4Rec.

•
GRU4Rec ($\u03f5$greedy): It applies the $\u03f5$greedy selection of items in GRU4Rec during the training rounds.

•
DQN (zheng2018drn): A classical offpolicy learning algorithm (mnih2015human). For state representation, to ensure a fair comparison between different learning algorithms, GRU is used to encode the historical observations, which is equivalent to GRU4Rec and our method.

•
DDQN (zheng2018drn): Double DQN (Hasselt2016DDQN) uses a different action selection for value backup to avoid the value overestimation in offpolicy learning. The model architecture is kept equivalent to GRU4Rec.

•
DDPG (zhao2018deep): Deep Deterministic Policy Gradient (DDPG) (lillicrap2015continuous) is an offpolicy learning algorithm for continuous action space. The inferred action is used to select the item that lies closest to it for display (nearest neighbor). We use the same neural structure as GRU4Rec for both the actor and critic networks in DDPG.

•
MCPE: Monte Carlo Policy Evaluation (Dimitrakakis2008Roll) is a straightforward value iteration algorithm. The Monte Carlo evaluation of the whole trajectory is used, i.e., we apply Equation (2) with ${\widehat{Q}}_{\text{target}}={\sum}_{t}{r}_{t}$. Again, we keep the model architecture equal to the other baselines.

•
DynaQ (liu2019personalized; zoupseudo): DynaQ is an MBRL method that augments DQN with the imagined rollouts from an environment model. The ratio of imagined rollouts to real trajectories is set to 1:1.

•
MBCAL: The full version of our method.

•
MBCAL (w/o variance reduction): It is an ablative version of MBCAL. We use SFR instead of CFA as the label of FAM, i.e., in Equation (13), ${\widehat{R}}_{\varphi}^{\text{future}}$ is used instead of ${\widehat{A}}_{\varphi}^{\text{future}}$.
All the parameters are optimized by adam optimizer with learning rate = ${10}^{3}$, ${\beta}_{1}=0.9$ and ${\beta}_{2}=0.999$. The decay factor in the longterm reward is set to be $\gamma =0.95$. The embedding sizes for the itemid and other id type features are all set to 32. The hidden size for MLP is set to 32. For training MEM in MBCAL, we use ${p}_{mask}=0.20$ to generate the masked trajectories ${D}_{M}$. In DDPG, we found that high dimensional action space yields really poor performance. Thus we use a 4dimensional action space. Correspondingly we use an additional layer to map the item representation into a 4dimensional vector.
4.4. Experimental Results
Algorithms  Average reward per session  

Movielens  Netflix  NewsFeed  
GRU4Rec  $77.93\pm 0.06$  $79.63\pm 0.02$  $11.58\pm 0.14$  
DDPG  $70.99\pm 0.70$  $72.50\pm 0.35$  $10.90\pm 0.42$  
DQN  $77.27\pm 0.06$  $77.75\pm 0.01$  $12.44\pm 0.33$  
DDQN  $77.23\pm 0.02$  $77.70\pm 0.04$  $12.48\pm 0.17$  
MCPE  $77.20\pm 0.10$  $77.70\pm 0.03$  $13.21\pm 0.53$  
DynaQ  $77.25\pm 0.05$  $77.81\pm 0.02$  $13.04\pm 0.33$  
MBCAL  $\mathbf{\text{78.02}}\pm 0.03$  $\mathbf{\text{79.71}}\pm 0.04$  $\mathbf{\text{16.32}}\pm 0.24$  

$77.70\pm 0.04$  $79.50\pm 0.04$  $15.61\pm 0.38$ 
4.4.1. Results of BatchRL Evaluation
The results on BatchRL evaluation are shown in Tab. 3. We evaluate the reward of a session according to the reward generated by the simulator. To conclude from the result, MFRL can not compete with MBRL in all three environments. As MFRL is sample inefficient, it tends to have poor startup performance. Surprisingly DDPG has the weakest performance in all three environments. By carefully investigating the value functions in DDPG, we found that DDPG overestimates the value function a lot compared with the other MFRL. We thought that the overestimation comes from value backups from continuous actions that may not correspond to realistic items. The overestimation problem in actorcritic methods has also been thoroughly investigated (Fujimoto18Addressing).
As is expected, the MBCAL leads the performance of all the tested systems with substantial margins, demonstrating its sample efficiency. However, for Movielens and Netflix, our method earns a smaller margin over the supervised learning method compared with that of NewsFeed. It is likely that the long term reward plays a more significant role in NewsFeed than the other two environments. Furthermore, as learning to predict longterm utility requires more data than the instant reward, the preponderance of RL has not yet been sufficiently revealed in BatchRL settings. However, it is essential that the performance of MBCAL at the start stage is already stateoftheart, which proves that MBCAL has low risks and high sample efficiency.
4.4.2. Results of Growing BatchRL Evaluation
Figure 5 shows the results of the Online Evaluation in three environments. GRU4Rec($\u03f5$greedy) surpasses the purely supervised learning GRU4Rec by a small margin in every environment, showing the benefit of exploration in online systems. Performances of DDPG in all three environments are again surprisingly bad. In Figure 5, we show DDPG curve in NewsFeed environment only because in the other two environments, DDPG lags too much behind all the other methods. We believe that continuous action space for recommendation systems with dynamic discrete item space can not work well enough.
With the assistance of the environment model, DynaQ gains some advantages at the beginning, but it gradually subsides as the learning continues. This phenomenon is just in line with the expectations, for the virtual experience quickly loses its benefit with the accumulation of sufficient real user feedback. MBCAL again keeps its performance ahead of the other methods in all the environments. Even for Netflix and Movielens, where the other RLbased system fails to gain any benefits over traditional GRU4Rec, MBCAL wins with a considerable margin. In NewsFeed, where the long term rewards play a more critical role, MBCAL strengthens the leading edge.
MCPE, DQN, DDQN, and DynaQ lag entirely behind the other methods, including supervised learning baselines in Movielens and Netflix environment, while this is not true in NewsFeed. We investigate the reason by setting the output of GRU4Rec to the instant reward instead of the user behavior classification, which turned the classification into regression, and the entropy loss to mean square error loss. We found a significant drop in performance in GRU4Rec, which is more consistent with the results in NewsFeed. The results show that classification and entropy loss benefit the system more than regressions. An explanation is that user behavior contains more abundant information than the rewards, which also made MBRL more advantageous than MFRL.
4.4.3. Analysis of the variance
The key point in MBCAL is the variance reduction through counterfactual comparisons. The previous proposals (Geman92Neural) suggest that the mean square error (MSE) in a welltrained model is composed of the model bias and the variance(noise) in the labels. As we use equivalent neural architectures in all the methods for comparison, they share the same model bias. Therefore the mean square error shall be dominated by the noise. To study whether CFA truly reduces the variance, we compare the MSE from Equation (2) and Equation (13). We compare the MSE of MCPE, DQN, DynaQ, MBCAL (w/o variance reduction), and MBCAL, based on the interactive logs collected in the test round of BatchRL evaluation. The average MSE is presented in Table 4.
Algorithms  MSE loss  

Movielens  Netflix  NewsFeed  
DQN  1.50  1.22  4.29  
MCPE  17.1  9.21  46.9  
DynaQ  0.94  1.04  7.87  
MBCAL  0.004  0.009  0.07  

3.45  3.29  3.07 
According to the previous theoretical analysis, using value backup of longer horizons suffer from more significant variance. The variance of MCPE is indeed higher than that of DQN and DynaQ, as backup of the whole trajectory is used. The MBCAL (w/o variance reduction) has the secondlargest variance. It is smaller compared with MCPE because using simulated rollout from the environment model already eliminates part of the noises. The variances of DQN and DynaQ are smaller because onestep value backup is employed. Compared with the other methods, MBCAL embraces significantly lower variances, which shows that variance has been reduced as expected.
5. Conclusion
To conclude this work, we are focused on the sequential decisionmaking problems in the recommender systems. To maximize the longterm utility, we propose a sample efficient and variance reduced reinforcement learning method: MBCAL. It involves a masked environment model to capture the instant user behavior, and a future advantage model to capture the future utility. Through counterfactual comparison, MBCAL significantly reduces the variance in learning. Experiments on realdatadriven simulations show that the proposed method transcends the previous ones in both sample efficiency and asymptotic performances. Possible future extensions to this work may be to theoretically calculating the error bound, and to extend the fixed horizon settings to infinite and dynamic horizon recommender systems.