Abstract
There are great interests as well as many challenges in applyingreinforcement learning (RL) to recommendation systems. In this setting, anonline user is the environment; neither the reward function nor the environmentdynamics are clearly defined, making the application of RL challenging. In thispaper, we propose a novel modelbased reinforcement learning framework forrecommendation systems, where we develop a generative adversarial network toimitate user behavior dynamics and learn her reward function. Using this usermodel as the simulation environment, we develop a novel Cascading DQN algorithmto obtain a combinatorial recommendation policy which can handle a large numberof candidate items efficiently. In our experiments with real data, we show thisgenerative adversarial user model can better explain user behavior thanalternatives, and the RL policy based on this model can lead to a betterlongterm reward for the user and higher click rate for the system.
Quick Read (beta)
Generative Adversarial User Model for
Reinforcement Learning Based Recommendation System
Abstract
There are great interests as well as many challenges in applying reinforcement learning (RL) to recommendation systems. In this setting, an online user is the environment; neither the reward function nor the environment dynamics are clearly defined, making the application of RL challenging. In this paper, we propose a novel modelbased reinforcement learning framework for recommendation systems, where we develop a generative adversarial network to imitate user behavior dynamics and learn her reward function. Using this user model as the simulation environment, we develop a novel Cascading DQN algorithm to obtain a combinatorial recommendation policy which can handle a large number of candidate items efficiently. In our experiments with real data, we show this generative adversarial user model can better explain user behavior than alternatives, and the RL policy based on this model can lead to a better longterm reward for the user and higher click rate for the system.
1 Introduction
Recommendation systems have become a crucial part of almost all online service platforms. A typical interaction between the system and its users is — users are recommended a page of items and they provide feedback, and then the system recommends a new page of items. A common way of building recommendation systems is to estimate a model which minimizes the discrepancy between the model prediction and the immediate user response according to some loss function. In other words, these models do not explicitly take into account the longterm user interest. However, user’s interest can evolve over time based on what she observes, and the recommender’s action may significantly influence such evolution. In some sense, the recommender is guiding users’ interest by displaying particular items and hiding the rest. Thus, it is more favorable to design a recommendation strategy, such as one based on reinforcement learning (RL), which can take users’ longterm interest into account. However, it is challenging to apply RL framework to recommendation system setting since the environment will correspond to the logged online user.
First, a user’s interest (reward function) driving her behavior is typically unknown, yet it is critically important for the use of RL algorithms. In existing RL algorithms for recommendation systems, the reward functions are manually designed (e.g. $\pm 1$ for click/noclick) which may not reflect a user’s preference over different items.
Second, modelfree RL typically requires lots of interactions with the environment in order to learn a good policy. This is impractical in the recommendation system setting. An online user will quickly abandon the service if the recommendation looks random and do not meet her interests. Thus, to avoid the large sample complexity of the modelfree approach, a modelbased RL approach is more preferable. In a related but a different setting where one wants to train a robot policy, recent works showed that modelbased RL is much more sample efficient (Nagabandi et al., 2017; Deisenroth et al., 2015; Clavera et al., 2018). The advantage of modelbased approaches is that potentially large amount of offpolicy data can be pooled and used to learn a good environment dynamics model, whereas modelfree approaches can only use expensive onpolicy data for learning. However, previous modelbased approaches are typically designed based on physics or Gaussian processes, and not tailored for complex sequences of user behaviors.
To address the above challenges, we propose a novel modelbased RL framework for recommendation systems, where a user behavior model and the associated reward function are learned in unified minimax framework, and then RL policies are learned using this model. Our main technical contributions are:

1.
We develop a generative adversarial learning (GAN) formulation to model user behavior dynamics and recover her reward function. These two components are estimated simultaneously via a joint minimax optimization algorithm. The benefits of our formulation are: (i) a better predictive user model can be obtained, and the reward function are learned in a consistent way with the user model; (ii) the learned reward allows later reinforcement learning to be carried out in a more principled way, rather than relying on manually designed reward; (ii) the learned user model allows us to perform modelbased RL and online adaptation for new users to achieve better results.

2.
Using this model as the simulation environment, we develop a cascading DQN algorithm to obtain a combinatorial recommendation policy. The cascading design of actionvalue function allows us to find the best subset of items to display from a large pool of candidates with time complexity linear in number of candidates.
In our experiments with real data, we showed that this generative adversarial model is a better fit to user behavior in terms of heldout likelihood and click prediction. Based on the learned user model and reward, we show that the estimated recommendation policy leads to better cumulative longterm reward for the user. Furthermore, in the case of model mismatch, our modelbased policy can also quickly adapt to the new dynamics with a much fewer number of user interactions compared to modelfree approaches.
2 Related Work
Commonly used recommendation algorithms use a simple user model. For instance, Wide&Deep networks (Cheng et al., 2016) and other methods such as XGBOOST (Chen & Guestrin, 2016) and DFM (Guo et al., 2017) based on logistic regression assume a user chooses each item independently; Collaborative competitive filtering (Yang et al., 2011) takes into account the context of user’s choice but assumes that user’s choices in each page view are independent. Sessionbased RNN (Hidasi et al., 2016) and sessionbased KNN (Jannach & Ludewig, 2017) improve upon previous approaches by modeling users’ history, but this model does not recover a users’ reward function and can not be used subsequently for reinforcement learning. Bandit based approaches, such as LinUCB (Li et al., 2010), can deal with adversarial user behaviors, but the reward is updated in a Bayesian framework and can not be directly in a reinforcement learning framework.
Zhao et al. (2018b, a); Zheng et al. (2018) used modelfree RL for recommender systems, which may require many user interactions and the reward function is manually designed. Modelbased reinforcement learning has been commonly used in robotics applications and resulted in reduced sample complexity to obtain a good policy (Deisenroth et al., 2015; Nagabandi et al., 2017; Clavera et al., 2018). However, user behaviors are sequences of discrete choices under a complex session context, very different from robotics models. Views resonating our work also emerge from literature (Shi et al., 2019) that user model estimation is essential for the success of RL in recommendation systems.
3 Setting and RL Formulation
We focus on a simplistic yet typical setting where the recommendation system (RS) and its user interact as follows:
Setting: RS displays $k$ items in one page to a user. The user provides feedback by clicking on one or none of these items, and then the system recommends a new page of $k$ items.
Our model can be easily extended to settings with more complex page views and user interactions.
Since reinforcement learning can take into account longterm reward, it holds the promise to improve users’ longterm engagement with an online platform. In the RL framework, a recommendation system aims to find a policy $\pi (\bm{s},\mathcal{I})$ to choose from a set $\mathcal{I}$ of items in user state $\bm{s}$, such that the cumulative expected reward is maximized,
$${\pi}^{\ast}={\mathrm{arg}\mathrm{max}}_{\pi ({\bm{s}}^{t},{\mathcal{I}}^{t})}\mathbb{E}\left[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({\bm{s}}^{t},{a}^{t})\right],$$  (1) 
where ${\bm{s}}^{0}\sim {p}^{0}$, ${\mathcal{A}}^{t}\sim \pi ({\bm{s}}^{t},{\mathcal{I}}^{t})$, ${\bm{s}}^{t+1}\sim P(\cdot {\bm{s}}^{t},{\mathcal{A}}^{t})$, ${a}^{t}\in {\mathcal{A}}^{t}$. The overall RL framework for recommendation is illustrated in Figure 1. Several key aspects are as follows:

(1)
Environment: will correspond to a logged online user who can click on one of the $k$ items displayed by the recommendation system in each page view (or interaction);

(2)
State ${s}^{t}\mathrm{\in}\mathrm{S}$: will correspond to an ordered sequence of a user’s historical clicks;

(3)
Action ${\mathrm{A}}^{t}\mathrm{\in}\mathrm{\left(}\genfrac{}{}{0pt}{}{{\mathrm{I}}^{t}}{k}\mathrm{\right)}$ of the recommender: will correspond to a subset of $k$ items chosen by the recommender from ${\mathcal{I}}^{t}$ to display to the user. $\left(\genfrac{}{}{0pt}{}{{\mathcal{I}}^{t}}{k}\right)$ means the set of all subsets of $k$ items of ${\mathcal{I}}^{t}$, where ${\mathcal{I}}^{t}\subset \mathcal{I}$ are available items to recommend at time $t$.

(4)
State Transition $P\mathrm{(}\mathrm{\cdot}\mathrm{}{s}^{t}\mathrm{,}{\mathrm{A}}^{t}\mathrm{)}\mathrm{:}\mathrm{S}\mathrm{\times}\mathrm{\left(}\genfrac{}{}{0pt}{}{\mathrm{I}}{k}\mathrm{\right)}\mathrm{\mapsto}\mathrm{P}\mathrm{(}\mathrm{S}\mathrm{)}$: will correspond to a user behavior model which returns the transition probability for ${\bm{s}}^{t+1}$ given previous state ${\bm{s}}^{t}$ and the set of items ${\mathcal{A}}^{t}$ displayed by the system. It is equivalent to the distribution $\varphi ({\bm{s}}^{t},{\mathcal{A}}^{t})$ over a user’s actions, which is defined in our user model in section 4.1.

(5)
Reward Function $r$ $({\bm{s}}^{t},{\mathcal{A}}^{t},{a}^{t}):\mathcal{S}\times \left(\genfrac{}{}{0pt}{}{\mathcal{I}}{k}\right)\times \mathcal{I}\mapsto \mathbb{R}$: will correspond to a user’s utility or satisfaction after making her choice ${a}^{t}\in {\mathcal{A}}^{t}$ in state ${\bm{s}}^{t}$. Here we assume that the reward to the recommendation system is the same as the user’s utility. Thus, a recommendation algorithm which optimizes its longterm reward is designed to satisfy the user in a long run. One can also include the company’s benefit to the reward, but we will focus on users’ satisfaction.

(6)
Policy ${\mathrm{A}}^{t}\mathrm{\sim}\pi \mathbf{}\mathrm{(}{s}^{t}\mathrm{,}{\mathrm{I}}^{t}\mathrm{)}\mathrm{:}\mathrm{S}\mathrm{\times}{\mathrm{2}}^{\mathrm{I}}\mathrm{\mapsto}\mathrm{P}\mathbf{}\mathrm{(}\mathrm{\left(}\genfrac{}{}{0pt}{}{\mathrm{I}}{k}\mathrm{\right)}\mathrm{)}$: will correspond to a recommendation strategy which returns the probability of displaying a subset ${\mathcal{A}}^{t}$ of ${\mathcal{I}}^{t}$ in user state ${\bm{s}}^{t}$.
Remark. In the above mapping, Environment, State and State Transition are associated with the user, Action and Policy are associated with the recommendation system, and Reward Function is associated with both of them. Here we use the notation $r({\bm{s}}^{t},{\mathcal{A}}^{t},{a}^{t})$ to emphasize the dependency of the reward on the recommendation action, as the user can only choose from the display set. However, the value of the reward is determined by the user’s state and the clicked item once the item occurs in the display set ${\mathcal{A}}^{t}$. In fact, $r({\bm{s}}^{t},{\mathcal{A}}^{t},{a}^{t})=r({\bm{s}}^{t},{a}^{t})\cdot \mathrm{\U0001d7cf}({a}^{t}\in {\mathcal{A}}^{t})$. Thus, in section 4.1 where we discuss the user model, we simply denote $r({\bm{s}}^{t},{a}^{t})=r({\bm{s}}^{t},{\mathcal{A}}^{t},{a}^{t})$ and assume ${a}^{t}\in {\mathcal{A}}^{t}$ is true.
Since both the reward function and the state transition model are unknown, we need to learn them from data. Once they are learned, the optimal policy ${\pi}^{\ast}$ in Eq. (1) can be estimated by repeated querying the model using algorithms such as Qlearning (Watkins, 1989). In the next two sections, we will explain our formulation for the user behavior model and the reward function, and design an efficient algorithm for learning the RL recommendation policy.
4 Generative Adversarial User Model
We propose a model to imitate users’ sequential choices and discuss its parameterization and estimation. The formulation is inspired by imitation learning, a powerful tool for learning sequential decisionmaking policies from expert demonstrations (Abbeel & Ng, 2004; Ho et al., 2016; Ho & Ermon, 2016; Torabi et al., 2018). We formulate a unified minimax optimization to learn user behavior model and reward function simultaneously based on sample trajectories.
4.1 User Behavior As Reward Maximization
We model user behavior based on two realistic assumptions. (i) Users are not passive. Instead, given a set of $k$ items, a user will make a choice to maximize her own reward. The reward $r$ measures how much she will be satisfied with or interested in an item. Alternatively, the user can choose not to click on any items. Then she will receive the reward of not wasting time on boring items. (ii) The reward depends not only on the selected item but also on the user’s history. For example, a user may not be interested in Taylor Swift’s song at the beginning, but once she happens to listen to it, she may like it and then becomes interested in her other songs. Also, a user can get bored after listening to Taylor Swift’s songs repeatedly. In other words, a user’s evaluation of the items varies in accordance with her personal experience.
To formalize the model, we consider both the clicked item and the state of the user as the inputs to the reward function $r({\bm{s}}^{t},{a}^{t})$, where the clicked item is the user’s action ${a}^{t}$ and the user’s history is captured in her state ${\bm{s}}^{t}$ (nonclick is treated as a special item/action). Suppose in session $t$, the user is presented with $k$ items ${\mathcal{A}}^{t}=\{{a}_{1},\mathrm{\cdots},{a}_{k}\}$ and their associated features $\{{\bm{f}}_{1}^{t},\mathrm{\cdots},{\bm{f}}_{k}^{t}\}$ by the system. She will take an action ${a}^{t}\in {\mathcal{A}}^{t}$ according to a strategy ${\varphi}^{*}$ which can maximize her expected reward. More specifically, this strategy is a probability distribution over ${\mathcal{A}}^{t}$, which is the result of the optimization problem below
Generative User Model:
$${\varphi}^{*}({\bm{s}}^{t},{\mathcal{A}}^{t})=\mathrm{arg}\underset{\varphi \in {\mathrm{\Delta}}^{k1}}{\mathrm{max}}{\mathbb{E}}_{\varphi}\left[r({\bm{s}}^{t},{a}^{t})\right]R(\varphi )/\eta ,$$
(2)
where ${\mathrm{\Delta}}^{k1}$ is the probability simplex, and $R(\varphi )$ is a convex regularization function to encourage exploration, and $\eta $ controls the strength of the regularization.
Model interpretation. If we use the negative Shannon entropy as the regularizer, we can obtain an interpretation of our user model from the perspective of explorationexploitation tradeoff (See Appendix A for a proof).
Lemma 1.
Let the regularization term in Eq. (2) be $R\mathit{}\mathrm{(}\varphi \mathrm{)}\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}{\varphi}_{i}\mathit{}\mathrm{log}\mathit{}{\varphi}_{i}$. Then the optimal solution ${\varphi}^{\mathrm{*}}$ for the problem in Eq. (2) has a closed form
$${\varphi}^{\ast}{({\bm{s}}^{t},{\mathcal{A}}^{t})}_{i}=\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{i}))/{\sum}_{{a}_{j}\in {\mathcal{A}}^{t}}\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{j})).$$ 
Furthermore, in each session $t$, the user’s optimal policy ${\varphi}^{\mathrm{*}}$ is equivalent to the following discrete choice model where ${\epsilon}^{t}$ follows a Gumbel distribution.
$${a}^{t}={\mathrm{arg}\mathrm{max}}_{a\in {\mathcal{A}}^{t}}\eta r({\bm{s}}^{t},a)+{\epsilon}^{t}.$$  (3) 
This lemma makes it clear that the user greedily picks an item according to the reward function (exploitation), and yet the Gumbel noise ${\epsilon}^{t}$ allows the user to explore other less rewarding items. Similar models have appeared in econometric choice models (Manski, 1975; McFadden, 1973). The regularization parameter $\eta $ is an explorationexploitation tradeoff parameter. It can be easily seen that with a smaller $\eta $, the user is more exploratory. Thus, $\eta $ reveals a part of users’ character. In practice, we simply set the value $\eta =1$ in our experiments, since it is implicitly learned in the reward $r$, which is a function of various features of a user.
Remark. (i) Other regularization $R(\varphi )$ can also be used in our framework, which may induce different user behaviors. In these cases, the relations between ${\varphi}^{*}$ and $r$ are also different, and may not appear in the closed form. (ii) The case where the user does not click any items can be regarded as a special item which is always in the display set ${\mathcal{A}}^{t}$. It can be defined as an item with zero feature vector, or, alternatively, its reward value can be defined as a constant to be learned.
4.2 Model Parameterization
We will represent the state ${\bm{s}}^{t}$ as an embedding of the historical sequence of items clicked by the user before session $t$, and then we will define the reward function $r({\bm{s}}^{t},{a}^{t})$ based on the state and the embedding of the current action ${a}^{t}$.
First, we will define the state of the user as ${\bm{s}}^{t}:=h({\bm{F}}_{\ast}^{1:t1}:=[{\bm{f}}_{\ast}^{1},\mathrm{\cdots},{\bm{f}}_{\ast}^{t1}])$, where each ${\bm{f}}_{\ast}^{\tau}\in {\mathbb{R}}^{d}$ is the feature vector of the clicked item at session $\tau $ and $h(\cdot )$ is an embedding function. One can also define a truncated $M$step sequence as ${\bm{F}}_{\ast}^{tm:t1}:=[{\bm{f}}_{\ast}^{tm},\mathrm{\cdots},{\bm{f}}_{\ast}^{t1}]$. For the state embedding function $h(\cdot )$, we propose a simple and effective position weighting scheme. Let $\bm{W}\in {\mathbb{R}}^{m\times n}$ be a matrix where the number of rows $m$ corresponds to a fixed number of historical steps, and each column corresponds to one set of importance weights on positions. Then the embedding function $h\in {\mathbb{R}}^{dn\times 1}$ can be designed as
$${\bm{s}}^{t}=h({\bm{F}}_{\ast}^{tm:t1}):=vec[\sigma \left({\bm{F}}_{*}^{tm:t1}\bm{W}+\bm{B}\right)],$$  (4) 
where $\bm{B}\in {\mathbb{R}}^{d\times n}$ is a bias matrix, and $\sigma (\cdot )$ is a nonlinear activation such as ReLU and ELU, and $vec[\cdot ]$ turns the input matrix into a long vector by concatenating the matrix columns. Alternatively, one can also use an LSTM to capture the history. However, the advantage of the position weighting scheme is that the history embedding is produced by a shallow network. It is more efficient for forwardcomputation and gradient backpropagation.
(a)  (b)  (c) 
Next, we define the reward function and the user behavior model. A user’s choice ${a}^{t}\in {\mathcal{A}}^{t}$ will correspond to an item with feature ${\bm{f}}_{{a}^{t}}^{t}$, which will be used to parameterize the reward function and user behavior model as
$$r({\bm{s}}^{t},{a}^{t}):={\bm{v}}^{\top}\sigma \left(\bm{V}{\left[\begin{array}{c}\hfill {({\bm{s}}^{t})}^{\top},{({\bm{f}}_{{a}^{t}}^{t})}^{\top}\hfill \end{array}\right]}^{\top}+\bm{b}\right)\text{and}$$ 
$$\varphi (\bm{s},{\mathcal{A}}^{t})\propto \mathrm{exp}\left(\bm{v}^{\prime}{}^{\top}\sigma \left({\bm{V}}^{\prime}{\left[\begin{array}{c}\hfill {({\bm{s}}^{t})}^{\top},{({\bm{f}}_{{a}^{t}}^{t})}^{\top}\hfill \end{array}\right]}^{\top}+{\bm{b}}^{\prime}\right)\right),$$ 
where $\bm{V},{\bm{V}}^{\prime}\in {\mathbb{R}}^{\mathrm{\ell}\times (dn+d)}$ are weight matrices, $\bm{b},{\bm{b}}^{\prime}\in {\mathbb{R}}^{\mathrm{\ell}}$ are bias vectors , and $\bm{v},{\bm{v}}^{\prime}\in {\mathbb{R}}^{\mathrm{\ell}}$ are the final regression parameters. See Figure 2 for an illustration of the overall parameterization. For simplicity of notation, we will denote the set of all parameters in the reward function as $\theta $ and the set of all parameters in the user model as $\alpha $, and hence the notation ${r}_{\theta}$ and ${\varphi}_{\alpha}$ respectively.
4.3 Generative Adversarial Training
Both the reward function $r({\bm{s}}^{t},{a}^{t})$ and the behavior model $\varphi ({\bm{s}}^{t},{\mathcal{A}}^{t})$ are unknown and need to be estimated from the data. The behavior model $\varphi $ tries to mimic the action sequences provided by a real user who acts to maximize her reward function $r$. In analogy to generative adversarial networks, (i) $\varphi $ acts as a generator which generates the user’s next action based on her history, and (ii) $r$ acts as a discriminator which tries to differentiate the user’s actual actions from those generated by the behavior model $\varphi $. Thus, inspired by the GAN framework, we estimate $\varphi $ and $r$ simultaneously via a minimax formulation. More precisely, given a trajectory of $T$ observed actions $\{{a}_{true}^{1},{a}_{true}^{2},\mathrm{\dots},{a}_{true}^{T}\}$ of a user and the corresponding clicked item features $\{{\bm{f}}_{\ast}^{1},{\bm{f}}_{\ast}^{2},\mathrm{\dots},{\bm{f}}_{\ast}^{T}\}$, we learn $r$ and $\varphi $ jointly by solving the following minimax optimization
Generative Adversarial Training:
$\underset{\theta}{\mathrm{min}}\underset{\alpha}{\mathrm{max}}({\mathbb{E}}_{{\varphi}_{\alpha}}$
$\left[{\sum}_{t=1}^{T}{r}_{\theta}({\bm{s}}_{true}^{t},{a}^{t})\right]R({\varphi}_{\alpha})/\eta )$
${\sum}_{t=1}^{T}{r}_{\theta}({\bm{s}}_{true}^{t},{a}_{true}^{t}),$
(5)
where
we use ${\bm{s}}_{true}^{t}$ to emphasize that this is observed in the data. From the above optimization, one can see that the reward ${r}_{\theta}$ will extract some statistics from both real user actions and model user actions, and try to magnify their difference (or make their negative gap larger). In contrast, the user model ${\varphi}_{\alpha}$ will try to make the difference smaller, and hence more similar to the real user behavior. Alternatively, the minimax optimization can also be interpreted as a game between an adversary and a learner where the adversary tries to minimize the reward of the learner by adjusting ${r}_{\theta}$, while the learner tries to maximize its reward by adjusting ${\varphi}_{\alpha}$ to counteract the adversarial moves. This gives the user behavior training process a largemargin training flavor, where we want to learn the best model even for the worst scenario.
For general regularization function $R({\varphi}_{\alpha})$, the optimal solution in Eq. (4.3) does not have a closed form, and typically needs to be solved by alternatively updating ${\varphi}_{\alpha}$ and ${r}_{\theta}$, e.g.
$$\{\begin{array}{cc}\alpha \leftarrow \alpha +{\gamma}_{1}{\nabla}_{\alpha}{\mathbb{E}}_{{\varphi}_{\alpha}}\left[{\sum}_{t=1}^{T}{r}_{\theta}\right]{\gamma}_{1}{\nabla}_{\alpha}R\left({\varphi}_{\alpha}\right)/\eta ;\hfill & \\ \theta \leftarrow \theta {\gamma}_{2}{\mathbb{E}}_{{\varphi}_{\alpha}}\left[{\sum}_{t=1}^{T}{\nabla}_{\theta}{r}_{\theta}\right]+{\gamma}_{2}{\sum}_{t=1}^{T}{\nabla}_{\theta}{r}_{\theta}.\hfill & \end{array}$$  (6) 
The process may be unstable due to the nonconvexity nature of the problem. To stabilize the training process, we will leverage a special regularization for initializing the training process. More specifically, for entropy regularization, we can obtain a closed form solution to the innermaximization for user behavior model, which makes the learning of reward function easy (See lemma 2 below and Appendix A for a proof). Once the reward function is learned for entropy regularization, it can be used to initialize the learning in the case of other regularization functions which may induce different user behavior models and final rewards.
Lemma 2.
When $R\mathit{}\mathrm{(}\varphi \mathrm{)}\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}{\varphi}_{i}\mathit{}\mathrm{log}\mathit{}{\varphi}_{i}$, the optimization problem in Eq. (4.3) is equivalent to the following maximum likelihood estimation
$$\underset{\theta \in \mathrm{\Theta}}{\mathrm{max}}\prod _{t=1}^{T}\frac{\mathrm{exp}(\eta {r}_{\theta}({\bm{s}}_{true}^{t},{a}_{true}^{t}))}{{\sum}_{{a}^{t}\in {\mathcal{A}}^{t}}\mathrm{exp}(\eta {r}_{\theta}({\bm{s}}_{true}^{t},{a}^{t}))}.$$ 
5 Cascading RL Policy for Recommendation
Using the estimated user behavior model $\varphi $ and the corresponding reward function $r$ as the simulation environment, we can then use reinforcement learning to obtain a recommendation policy. The recommendation policy needs to choose from a combinatorial action space $\left(\genfrac{}{}{0pt}{}{\mathcal{I}}{k}\right)$, where each action is a subset of $k$ items chosen from a larger set $\mathcal{I}$ of $K$ candidates. Two challenges associated with this problem include the potentially high computational complexity of the combinatorial action space and the development of a framework for estimating the longterm reward (the Q function) from a combination of items. Our contribution is designing a novel cascading Qnetworks to handle the combinatorial action space, and an algorithm to estimate the parameters by interacting with the GAN user model.
5.1 Cascading QNetworks
We will use the Qlearning framework where an optimal actionvalue function ${Q}^{*}(\bm{s},\mathcal{A})$ will be learned and satisfies ${Q}^{*}({\bm{s}}^{t},{\mathcal{A}}^{t})=\mathbb{E}\left[r({\bm{s}}^{t},{\mathcal{A}}^{t},{a}^{t})+\gamma {\mathrm{max}}_{{\mathcal{A}}^{\prime}\subset \mathcal{I}}{Q}^{*}({\bm{s}}^{t+1},{\mathcal{A}}^{\prime})\right]$, ${a}^{t}\in {\mathcal{A}}^{t}$. Once the actionvalue function is learned, an optimal policy for recommendation can be obtained as
$${\pi}^{\ast}({\bm{s}}^{t},{\mathcal{I}}^{t})={\mathrm{arg}\mathrm{max}}_{{\mathcal{A}}^{t}\subset {\mathcal{I}}^{t}}{Q}^{*}({\bm{s}}^{t},{\mathcal{A}}^{t}),$$  (7) 
where ${\mathcal{I}}^{t}\subset \mathcal{I}$ is the set of items available at time $t$. The action space contains $\left(\genfrac{}{}{0pt}{}{K}{k}\right)$ many choices, which can be very large even for moderate $K$ (e.g. 1,000) and $k$ (e.g. 5). Furthermore, an item put in different combinations can have different probabilities of being clicked, which is indicated by the user model and is in line with reality. For instance, interesting items may compete with each other for a user’s attention. Thus, the policy in Eq. (7) will be very expensive to compute. To address this challenge, we will design not just one but a set of $k$ related Qfunctions which will be used in a cascading fashion for finding the maximum in Eq. (7).
Denote the recommender actions as $\mathcal{A}=\{{a}_{1:k}\}\subset \mathcal{I}$ and the optimal action as ${\mathcal{A}}^{*}=\{{a}_{1:k}^{*}\}=\mathrm{arg}{\mathrm{max}}_{\mathcal{A}}{Q}^{*}(\bm{s},\mathcal{A})$. Our cascading Qnetworks are inspired by the key fact:
$$\underset{{a}_{1:k}}{\mathrm{max}}{Q}^{*}(\bm{s},{a}_{1:k})=\underset{{a}_{1}}{\mathrm{max}}\left(\underset{{a}_{2:k}}{\mathrm{max}}{Q}^{*}(\bm{s},{a}_{1:k})\right).$$  (8) 
Also, there is a set of mutually consistent ${Q}^{1*},\mathrm{\dots},{Q}^{k*}$:
Cascading QNetworks:
${a}_{1}^{*}=\mathrm{arg}{\mathrm{max}}_{{a}_{1}}\{{Q}^{1*}(\bm{s},{a}_{1}):={\mathrm{max}}_{{a}_{2:k}}{Q}^{*}(\bm{s},{a}_{1:k})\},$
${a}_{2}^{*}=\mathrm{arg}{\mathrm{max}}_{{a}_{2}}\{{Q}^{2*}(\bm{s},{a}_{1}^{*},{a}_{2}):={\mathrm{max}}_{{a}_{3:k}}{Q}^{*}(\bm{s},{a}_{1:k})\},$
$\mathrm{\cdots}$
${a}_{k}^{*}=\mathrm{arg}{\mathrm{max}}_{{a}_{k}}\{{Q}^{k*}(\bm{s},{a}_{1:k1}^{*},{a}_{k}):={Q}^{*}(\bm{s},{a}_{1:k})\}.$
Thus, we can obtain an optimal action in $O$ $(k\mathcal{I})$ computations by applying these functions in a cascading manner. See Algorithm 1 and Figure 2(c) for a summary. However, this cascade of ${Q}^{j*}$ functions are usually not available and need to be estimated from the data.
5.2 Parameterization and Estimation
Each ${Q}^{j*}$ function is parameterized by a neural network
$${\bm{q}}_{j}^{\top}\sigma \left({\bm{L}}_{j}{[{\bm{s}}^{\top},{\bm{f}}_{{a}_{1}^{*}}^{\top},\mathrm{\dots},{\bm{f}}_{{a}_{j1}^{*}}^{\top},{\bm{f}}_{{a}_{j}}^{\top}]}^{\top}+{\bm{c}}_{j}\right),\forall j,$$  (9) 
where ${\bm{L}}_{j}\in {\mathbb{R}}^{\mathrm{\ell}\times (dn+dj)}$, ${\bm{c}}_{j}\in {\mathbb{R}}^{\mathrm{\ell}}$ and ${\bm{q}}_{j}\in {\mathbb{R}}^{\mathrm{\ell}}$ are the set ${\mathrm{\Theta}}_{j}$ of parameters, and we use the same embedding for the state $\bm{s}$ as in Eq. (4). Now the problem left is how we can estimate these functions. Note that the set of ${Q}^{j*}$ functions need to satisfy a large set of constraints. At the optimal point, the value of ${Q}^{j*}$ is the same as ${Q}^{*}$ for all $j$, i.e.,
$${Q}^{j*}(\bm{s},{a}_{1}^{*},\mathrm{\cdots},{a}_{j}^{*})={Q}^{*}(\bm{s},{a}_{1}^{*},\mathrm{\cdots},{a}_{k}^{*}),\forall j.$$  (10) 
It is not easy to strictly enforce these constraints, we take them into account in a soft and approximate way. That is, we define the loss as
${\left(y{Q}^{j}\right)}^{2},\text{where}$  
$y=r({\bm{s}}^{t},{\mathcal{A}}^{t},{a}^{t})+\gamma {Q}^{k}({\bm{s}}^{t+1},{a}_{1:k}^{*};{\mathrm{\Theta}}_{k}),\forall j.$  (11) 
All ${Q}^{j}$ networks are fitting against the same target $y$. Then the parameters ${\mathrm{\Theta}}_{k}$ can be updated by performing gradient steps over the above loss. We note that in our experiments the set of learned ${Q}^{j}$ networks satisfies the constraints nicely with a small error (Figure 5).
6 Experiments
We conduct three sets of experiments to evaluate our generative adversarial user model (called GAN user model) and the resulting RL recommendation policy. Our experiments are designed to investigate the following questions: (1) Can GAN user model lead to better user behavior prediction? (2) Can GAN user model lead to higher user reward and click rate? and (3) Can GAN user model help reduce the sample complexity of reinforcement learning?
Dataset and Feature Description. We use 6 realworld datasets: (1) MovieLens contains a large number of movie ratings, from which we randomly sample 1,000 active users. Each display set is simulated by collecting 39 movies released near the time the movie is rated. Movie features are collected from IMDB. Categorical and descriptive features are encoded as sparse and dense vectors respectively; (2) Last.fm contains listening records from 359,347 users. Each display set is simulated by collecting 9 songs with the nearest timestamp. (3) Yelp contains users’ reviews to various businesses. Each display set is simulated by collecting 9 businesses with the nearest location. (4) Taobao contains the clicking and buying records of users in 22 days. We consider the buying records as positive events. (5) RecSys15 YooChoose contains clickstreams that sometimes end with purchase events. (6) Ant Financial News dataset contains clicks records from 50,000 users for one month, involving dozens of thousands of news. On average each display set contains 5 news articles. It also contains useritem cross features which are widely used in this online platform.(More details in Appendix C)
6.1 Predictive Performance of User Model
To assess the predictive accuracy of GAN user model with position weight (GANPW) and LSTM (GANLSTM), we choose a series of most widely used or stateofthearts as the baselines, including: (1) W&DLR (Cheng et al., 2016), a wide & deep model with logistic regression loss function; (2) CCF (Yang et al., 2011), an advanced collaborative filtering model which takes into account the context information in the loss function; we further augment it with wide & deep feature layer (W&DCCF); (3) IKNN (Hidasi et al., 2015), one of the most popular itemtoitem solutions, which calculates items similarly according to the number of cooccurrences in sessions; (4) SRNN (Hidasi et al., 2016), a sessionbased RNN model with a pairwise ranking loss; (5) SCKNNC (Jannach & Ludewig, 2017), a strong methods which unify session based RNN and KNN by cascading combination; (6) XGBOOST (Chen & Guestrin, 2016), a parallel tree boosting; (7) DFM (Guo et al., 2017) is a deep neural factorizationmachine based on wide & deep features. SRNN (Hidasi et al., 2016), a sessionbased RNN model with a pairwise ranking loss; (8) SCKNNW (Jannach & Ludewig, 2017) and (9) SCKNNC (Jannach & Ludewig, 2017), two methods which unify SRNN and CKNN by weighted combination and cascading combination respectively; (10) XGBOOST (Chen & Guestrin, 2016), a parallel tree boosting, which is also known as GBDT and GBM.
Top$k$ precision ([email protected]$k$) is employed as the evaluation metric. It is the proportion of top$k$ ranked items at each page view that are actually clicked by the user, averaged across test page views and users. Users are randomly divided into train(50%), validation(12.5%) and test(37.5%) sets. The results in Table 1 show that GAN model performs significantly better than baselines. Moreover, GANPW performs nearly as well as GANLSTM, but it is more efficient to train. Thus we use GANPW for later experiments and refer to it as GAN.
(1) MovieLens  (2) LastFM  (3) Yelp  (4) Taobao  (5) YooChoose  (6) Ant Financial  
Model  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2 
IKNN  38.8($\pm $1.9)  40.3($\pm $1.9)  20.4($\pm $0.6)  32.5($\pm $1.4)  57.7($\pm $1.8)  73.5($\pm $1.8)  32.8($\pm $2.6)  46.6($\pm $2.6)  39.3($\pm $1.5)  69.8($\pm $2.1)  20.6($\pm $0.2)  32.1($\pm $0.2) 
SRNN  39.3($\pm $2.7)  42.9($\pm $3.6)  9.4($\pm $1.6)  17.4($\pm $0.9)  67.8($\pm $1.4)  73.2($\pm $0.9)  32.7($\pm $1.7)  47.0($\pm $1.4)  41.8($\pm $1.2)  69.9($\pm $1.9)  32.2($\pm $0.9)  40.3($\pm $0.6) 
SCKNNC  49.4($\pm $1.9)  51.8($\pm $2.3)  21.4($\pm $0.5)  26.1($\pm $1.0)  60.3($\pm $4.5)  71.6($\pm $1.8)  35.7($\pm $0.4)  47.9($\pm $2.1)  40.8($\pm $2.5)  70.4($\pm $3.8)  34.6($\pm $0.7)  43.2($\pm $0.8) 
XGBOOST  66.7($\pm $1.1)  76.0($\pm $0.9)  10.2($\pm $2.6)  19.2($\pm $3.1)  64.1($\pm $2.1)  79.6($\pm $2.4)  30.2($\pm $2.5)  51.3($\pm $2.6)  60.8($\pm $0.4)  80.3($\pm $0.4)  41.9($\pm $0.1)  65.4($\pm $0.2) 
DFM  63.3($\pm $0.4)  75.9($\pm $0.3)  10.5($\pm $0.4)  20.4($\pm $0.1)  72.1($\pm $2.1)  80.3($\pm $2.1)  30.1($\pm $0.8)  48.5($\pm $1.1)  61.3($\pm $0.3)  82.5($\pm $1.5)  41.7($\pm $0.1)  64.2($\pm $0.2) 
W&DLR  61.5($\pm $0.7)  73.8($\pm $1.2)  7.6($\pm $2.9)  16.6($\pm $3.3)  62.7($\pm $0.8)  86.0($\pm $0.9)  34.0($\pm $1.1)  54.6($\pm $1.5)  51.9($\pm $0.8)  75.8($\pm $1.5)  37.5($\pm $0.2)  60.9($\pm $0.1) 
W&DCCF  65.7($\pm $0.8)  75.2($\pm $1.1)  15.4($\pm $2.4)  25.7($\pm $2.6)  73.2($\pm $1.8)  88.1($\pm $2.2)  34.9($\pm $1.1)  53.3($\pm $1.3)  52.1($\pm $0.5)  76.3($\pm $1.5)  37.7($\pm $0.1)  61.1($\pm $0.1) 
GANPW  66.6($\pm $0.7)  75.4($\pm $1.3)  24.1($\pm $0.8)  34.9($\pm $0.7)  72.0($\pm $0.2)  92.5($\pm $0.5)  34.7($\pm $0.6)  54.1($\pm $0.7)  52.9($\pm $0.7)  75.7($\pm $1.4)  41.9($\pm $0.1)  65.8($\pm $0.1) 
GANLSTM  67.4($\pm $0.5)  76.3($\pm $1.2)  24.0($\pm $0.9)  34.9($\pm $0.8)  73.0($\pm $0.2)  88.7($\pm $0.4)  35.9($\pm $0.6)  55.0($\pm $0.7)  52.7($\pm $0.3)  75.9($\pm $1.2)  42.1($\pm $0.2)  65.9($\pm $0.2) 
We also tested different types of regularization (Table 2). In general, Shannon entropy performs well and it is also favored for its closed form solution. However, on the Yelp dataset, we find that ${L}_{2}$ regularization $R(\varphi )={\parallel \varphi \parallel}_{2}^{2}$ leads to a better user model. It is noteworthy that the user model with ${L}_{2}$ regularization is trained with Shannon entropy initialization scheme proposed in section 4.3.
Split 1  Split 2  Split 3  

Model  [email protected]  [email protected]  [email protected]  [email protected]  [email protected]  [email protected] 
SE  73.1  88.8  72.8  89.0  73.1  88.2 
${L}_{2}$  73.5  89.0  78.8  91.5  76.1  91.1 
An interesting result on Movielens is shown in Figure 3 (see Appendix D.1 for similar figures). It shows GAN performs much better as time goes by, while the items predicted by W&DCCF are concentrated on several categories. This indicates a drawback of static models — it fails to capture user interest evolution.
6.2 Recommendation Policy Based on User Model
With a learned user model, we can immediately derive a greedy policy to recommend $k$ items with the highest estimated likelihood. We will compare the strongest baseline methods W&DLR, W&DCCF and GANGreedy in this setting. Furthermore, we will learn an RL policy using the cascading Qnetworks from section 5 (GANCDQN) and compare it with two RL methods: a cascading Qnetwork trained with $\pm 1$ reward (GANRWD1), and an additive Qnetwork policy (He et al., 2016), $Q(\bm{s},{a}_{1},\mathrm{\cdots},{a}_{k}):={\sum}_{j=1}^{k}Q(\bm{s},{a}_{j})$ trained with learned reward (GANGDQN).
Since we cannot perform online experiments at this moment, we use collected data from the online news platform to fit a user model, and then use it as a test environment. To make the experimental results trustful and solid, we fit the test model based on a randomly sampled test set of 1,000 users and keep this set isolated. The RL policies are learned from another set of 2,500 users without overlapping the test set. The performances are evaluated by two metrics: (1) Cumulative reward: For each recommendation action, we can observe a user’s behavior and compute her reward $r({\bm{s}}^{t},{a}^{t})$ using the test model. Note that we never use the reward of test users when we train the RL policy. The numbers shown in Table 3 are the cumulative rewards averaged over time horizon first and then averaged over all users. It can be formulated as $\frac{1}{N}{\sum}_{u=1}^{N}\frac{1}{T}{\sum}_{t=1}^{T}$${r}_{u}^{t}$, where ${r}_{u}^{t}$ is the reward received by user $u$ at time $t$. (2) CTR (click through rate): it is the ratio of the number of clicks and the number of steps it is run. The values displayed in Table 3 are also averaged over 1,000 test users.
$k=3$  $k=5$  
model  reward  ctr  reward  ctr 
W&DLR  14.46($\pm $0.42)  0.46($\pm $0.01)  15.18($\pm $0.38)  0.48($\pm $0.01) 
W&DCCF  19.93($\pm $1.09)  0.62($\pm $0.03)  20.94($\pm $1.03)  0.65($\pm $0.03) 
GANGreedy  21.37($\pm $1.24)  0.67($\pm $0.04)  22.97($\pm $1.22)  0.71($\pm $0.03) 
GANRWD1  22.17($\pm $1.07)  0.68($\pm $0.03)  25.15($\pm $1.04)  0.78($\pm $0.03) 
GANGDQN  23.60($\pm $1.06)  0.72($\pm $0.03)  23.19($\pm $1.17)  0.70($\pm $0.03) 
GANCDQN  24.05($\pm $0.98)  0.74($\pm $0.03)  25.36($\pm $1.10)  0.77($\pm $0.03) 
DQNOff  20.31($\pm $0.14)  0.63($\pm $0.01)  21.82($\pm $0.08)  0.67($\pm $0.01) 
Experiments with different numbers of items in each page view are conducted and the results are summarized in Table 3. Since users’ behaviors are not deterministic, each policy is evaluated repeatedly for 50 times on test users. The results show that: (1) Greedy policy built on GAN model is significantly better than the policies built on other models. (2) RL policy learned from GAN is better than the greedy policy. (3) Although GANCDQN is trained to optimize the cumulative reward, the recommendation policy also achieves a higher CTR compared to GANRWD1 which directly optimizes $\pm 1$ reward. The learning of GANCDQN may have benefited from the wellknown reward shaping effects of the learned continuous reward (Mataric, 1994; Ng et al., 1999; Matignon et al., 2006). (4) While the computational cost of GANCDQN is about the same as that of GANGDQN (both are linear in the total number of items), our proposed GANCDQN is a more flexible parametrization and achieved better results.
Since Table 3 only shows average values taken over test users, we compare the policies in user level and the results are shown in figure 4. GANCDQN policy results in higher averaged cumulative reward for most users. A similar figure which compares the CTR is deferred to Appendix D. Figure 5 shows that the learned cascading Qnetworks satisfy constraints in Eq. (10) well when $k=5$.
6.3 User Model Assisted Policy Adaptation
Former results in section 6.1 and 6.2 have demonstrated that GAN is a better user model and RL policy based on it can achieve higher CTR compared to other user models, but this user model may be misspecified. In this section, we show that our GAN model can help an RL policy to quickly adapt to a new user. The RL policy assisted by GAN user model is compared with other policies that are learned from and adapted to online users: (1) CDQN with GAN: cascading Qnetworks which are first trained using the learned GAN user model from other users and then adapted online to a new user using MAML (Finn et al., 2017). (2) CDQN model free: cascading Qnetworks without pretrained by the GAN model. It interacts with and adapts to online users directly. (3) LinUCB: a classic contextual bandit algorithm which assumes adversarial user behavior. We choose its stronger version  LinUCB with hybrid linear models (Li et al., 2010)  to compare with.
The experiment setting is similar to section 6.2. All policies are evaluated on a set of 1,000 test users associated with a test model. Two sets of results corresponding to different sizes of display set are plotted in Figure 6. It shows how the CTR increases as each policy interacts with and adapts to users over time. In fact, the performances of users’ cumulative reward according to different policies are also similar, and the corresponding figure is deferred to Appendix D.3.
The results show that the CDQN policy pretrained over a GAN user model can quickly achieve a high CTR even when it is applied to a new set of users (Figure 6). Without the user model, CDQN can also adapt to the users during its interaction with them. However, it takes around 1,000 iterations (i.e., 100,000 interactive data points) to achieve similar performance as the CDQN policy assisted by GAN user model. LinUCB(hybrid) is also capturing users’ interests during its interaction with users. Similarly, it takes too many interactions. In Appendix D.3, another figure is attached to compare the cumulative reward received by the user instead of CTR. Generally speaking, GAN user model provides a dynamical environment for RL policies to interact with. It helps the policy achieve a more satisfying status before applying to online users.
7 Conclusion and Future Work
We proposed a novel modelbased reinforcement learning framework for recommendation systems, where we developed a GAN formulation to model user behavior dynamics and her associated reward function. Using this user model as the simulation environment, we develop a novel cascading Qnetwork for combinatorial recommendation policy which can handle a large number of candidate items efficiently. Although the experiments show clear benefits of our method in an offline and realistic simulation setting, even stronger results could be obtained via future online A/B testing.
Acknowledgement
This project was supported in part by NSF IIS1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, NSF IIS1639792 EAGER, NSF IIS1841351 EAGER, NSF CNS1704701, ONR N000141512340, Intel ISTC, NVIDIA, Google and Amazon AWS.
References
 Abbeel & Ng (2004) Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 Chen & Guestrin (2016) Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, 2016.
 Cheng et al. (2016) Cheng, H.T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016.
 Clavera et al. (2018) Clavera, I., Nagabandi, A., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt: Metalearning for modelbased control. arXiv preprint arXiv:1803.11347, 2018.
 Deisenroth et al. (2015) Deisenroth, M. P., Fox, D., and Rasmussen, C. E. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015. ISSN 01628828.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Guo et al. (2017) Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: a factorizationmachine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
 He et al. (2016) He, J., Ostendorf, M., He, X., Chen, J., Gao, J., Li, L., and Deng, L. Deep reinforcement learning with a combinatorial action space for predicting popular reddit threads. In EMNLP, 2016.
 Hidasi et al. (2015) Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
 Hidasi et al. (2016) Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. Sessionbased recommendations with recurrent neural networks. In ICLR, 2016.
 Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In NIPS, 2016.
 Ho et al. (2016) Ho, J., Gupta, J. K., and Ermon, S. Modelfree imitation learning with policy optimization. In ICML, 2016.
 Jannach & Ludewig (2017) Jannach, D. and Ludewig, M. When recurrent neural networks meet the neighborhood for sessionbased recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 306–310. ACM, 2017.
 Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
 Manski (1975) Manski, C. F. Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics, pp. 205 – 228, 1975. ISSN 03044076.
 Mataric (1994) Mataric, M. J. Reward functions for accelerated learning. In ICML, 1994.
 Matignon et al. (2006) Matignon, L., Laurent, G. J., and FortPiat, N. L. Reward function and initial values: Better choices for accelerated goaldirected reinforcement learning. In ICANN, 2006.
 McFadden (1973) McFadden, D. Conditional logit analysis of qualitative choice behaviour. In Zarembka, P. (ed.), Frontiers in Econometrics, pp. 105–142. Academic Press New York, 1973.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Nagabandi et al. (2017) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. arXiv preprint arXiv:1708.02596, 2017.
 Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. J. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.
 Shi et al. (2019) Shi, J.C., Yu, Y., Da, Q., Chen, S.Y., and Zeng, A.X. Virtualtaobao: Virtualizing realworld online retail environment for reinforcement learning. In ThirtyThird AAAI Conference on Artificial Intelligence, 2019.
 Torabi et al. (2018) Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. In IJCAI, 2018.
 Watkins (1989) Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Oxford, May 1989. (To be reprinted by MIT Press.).
 Yang et al. (2011) Yang, S.H., Long, B., Smola, A. J., Zha, H., and Zheng, Z. Collaborative competitive filtering: learning recommender using context of user choice. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 295–304. ACM, 2011.
 Zhao et al. (2018a) Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., and Tang, J. Deep reinforcement learning for pagewise recommendations. 2018a.
 Zhao et al. (2018b) Zhao, X., Zhang, L., Ding, Z., Yin, D., Zhao, Y., and Tang, J. Deep reinforcement learning for listwise recommendations. CoRR, 2018b.
 Zheng et al. (2018) Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. Drn: A deep reinforcement learning framework for news recommendation. 2018.
Appendix A Lemma
A.1 Proof of lemma 1
See 1
Proof.
First, recall the problem defined in Eq. (2):
$${\varphi}^{*}({\bm{s}}^{t},{\mathcal{A}}^{t})=\mathrm{arg}\underset{\varphi \in {\mathrm{\Delta}}^{k1}}{\mathrm{max}}{\mathbb{E}}_{\varphi}\left[r({\bm{s}}^{t},{a}^{t})\right]\frac{1}{\eta}R(\varphi ).$$ 
Denote ${\varphi}^{t}=\varphi ({\bm{s}}^{t},{\mathcal{A}}^{t})$. Since $\varphi $ can be an arbitrary mapping (i.e., $\varphi $ is not limited in a specific parameter space), ${\varphi}^{t}$ can be an arbitrary vector in ${\mathrm{\Delta}}^{k1}$. Recall the notation ${\mathcal{A}}^{t}=\{{a}_{1},\mathrm{\cdots},{a}_{k}\}$. Then the expectation taken over random variable ${a}^{t}\in {\mathcal{A}}^{t}$ can be written as
$${\mathbb{E}}_{\varphi}\left[r({\bm{s}}^{t},{a}^{t})\right]\frac{1}{\eta}R(\varphi )=\sum _{i=1}^{k}{\varphi}_{i}^{t}r({\bm{s}}^{t},{a}_{i})\frac{1}{\eta}\sum _{i=1}^{k}{\varphi}_{i}^{t}\mathrm{log}{\varphi}_{i}^{t}.$$  (12) 
By simple computation, the optimal vector ${\varphi}^{t*}\in {\mathrm{\Delta}}^{k1}$ which maximizes Eq. (12) is
$${\varphi}_{i}^{t*}=\frac{\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{i}))}{{\sum}_{j=1}^{k}\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{j}))},$$  (13) 
which is equivalent to Eq. (2). Next, we show the equivalence of Eq. (13) to the discrete choice model interpreted by Eq. (3).
The cumulative distribution function for the Gumbel distribution is $F(\epsilon ;\alpha )=\mathbb{P}[\epsilon \u2a7d\alpha ]={e}^{{e}^{\alpha}}$ and the probability density is $f(\epsilon )={e}^{{e}^{\epsilon}}{e}^{\epsilon}$. Using the definition of the Gumbel distribution, the probability of the event $[{a}^{t}={a}_{i}]$ where ${a}^{t}$ is defined in Eq. (3) is
${P}_{i}:=\mathbb{P}[{a}^{t}={a}_{i}]$  $=\mathbb{P}[\eta r({\bm{s}}^{t},{a}_{i})+{\epsilon}_{i}\u2a7e\eta r({\bm{s}}^{t},{a}_{j})+{\epsilon}_{j},\text{for all}i\ne j]$  
$=\mathbb{P}[{\epsilon}_{j}\u2a7d{\epsilon}_{i}+\eta r({\bm{s}}^{t},{a}_{i})\eta r({\bm{s}}^{t},{a}_{j}),\text{for all}i\ne j].$ 
Suppose we know the random variable ${\epsilon}_{i}$. Then we can compute the choice probability ${P}_{i}$ conditioned on this information. Let ${B}_{ij}={\epsilon}_{i}+\eta r({\bm{s}}^{t},{a}_{i})\eta r({\bm{s}}^{t},{a}_{j})$ and ${P}_{i\mathcal{E}}$ be the conditional probability; then we have
$${P}_{i{\epsilon}_{i}}=\prod _{i\ne j}\mathbb{P}[{\epsilon}_{j}\u2a7d{B}_{ij}]=\prod _{i\ne j}{e}^{{e}^{{B}_{ij}}}.$$ 
In fact, we only know the density of ${\epsilon}_{i}$. Hence, using the Bayes theorem, we can express ${P}_{i}$ as
${P}_{i}$  $={\displaystyle {\int}_{\mathrm{\infty}}^{\mathrm{\infty}}}{P}_{i{\epsilon}_{i}}f({\epsilon}_{i})\text{d}{\epsilon}_{i}={\displaystyle {\int}_{\mathrm{\infty}}^{\mathrm{\infty}}}{\displaystyle \prod _{i\ne j}}{e}^{{e}^{{B}_{ij}}}f({\epsilon}_{i})\text{d}{\epsilon}_{i}$  
$={\displaystyle {\int}_{\mathrm{\infty}}^{\mathrm{\infty}}}{\displaystyle \prod _{j=1}^{k}}{e}^{{e}^{{B}_{ij}}}{e}^{{e}^{{\epsilon}_{i}}}{e}^{{e}^{{\epsilon}_{i}}}{e}^{{\epsilon}_{i}}\text{d}{\epsilon}_{i}={\displaystyle {\int}_{\mathrm{\infty}}^{\mathrm{\infty}}}\left({\displaystyle \prod _{j=1}^{k}}{e}^{{e}^{{B}_{ij}}}\right){e}^{{\epsilon}_{i}}\text{d}{\epsilon}_{i}$ 
Now, let us look at the product itself.
$\prod _{j=1}^{k}}{e}^{{e}^{{B}_{ij}}$  $=\mathrm{exp}\left({\displaystyle \sum _{j=1}^{k}}{e}^{{B}_{ij}}\right)$  
$=\mathrm{exp}\left({e}^{{\epsilon}_{i}}{\displaystyle \sum _{j=1}^{k}}{e}^{(\eta r({\bm{s}}^{t},{a}_{i})\eta r({\bm{s}}^{t},{a}_{j}))}\right)$ 
Hence
$${P}_{i}={\int}_{\mathrm{\infty}}^{\mathrm{\infty}}\mathrm{exp}({e}^{{\epsilon}_{i}}Q){e}^{{\epsilon}_{i}}\text{d}{\epsilon}_{i}$$ 
where $Q={\sum}_{j=1}^{k}{e}^{(\eta r({\bm{s}}^{t},{a}_{i})\eta r({\bm{s}}^{t},{a}_{j}))}=Z/\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{i}))$.
Next, we make a change of variable $y={e}^{{\epsilon}_{i}}$. The Jacobian of the inverse transform is $J=\frac{\text{d}{\epsilon}_{i}}{\text{d}y}=\frac{1}{y}$. Since $y>0$, the absolute of Jacobian is $J=\frac{1}{y}$. Therefore,
${P}_{i}$  $={\displaystyle {\int}_{0}^{\mathrm{\infty}}}\mathrm{exp}(Qy)yJ\text{d}y={\displaystyle {\int}_{0}^{\mathrm{\infty}}}\mathrm{exp}(Qy)\text{d}y$  
$={\displaystyle \frac{1}{Q}}={\displaystyle \frac{1}{\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{i})){\sum}_{j}\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{j}))}}$  
$={\displaystyle \frac{\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{i})}{{\sum}_{j=1}^{k}\mathrm{exp}(\eta r({\bm{s}}^{t},{a}_{j}))}}.$ 
∎
A.2 Proof of lemma 2
See 2
Proof.
This lemma is a straight forward result of lemma 1. First, recall the problem defined in Eq. (4.3):
$$\underset{\theta \in \mathrm{\Theta}}{\mathrm{min}}\left(\underset{\varphi \in \mathrm{\Phi}}{\mathrm{max}}{\mathbb{E}}_{\varphi}\left[\sum _{t=1}^{T}{r}_{\theta}({\bm{s}}_{true}^{t},{a}^{t})\right]\frac{1}{\eta}R(\varphi )\right)\sum _{t=1}^{T}{r}_{\theta}({\bm{s}}_{true}^{t},{a}_{true}^{t})$$ 
We make a assumption that there is no repeated pair $({\bm{s}}_{true}^{t},{a}^{t})$ in Eq. (4.3). This is a very soft assumption because ${\bm{s}}_{true}^{t}$ is updated overtime, and ${a}^{t}$ is in fact representing its feature vector ${\bm{f}}_{{a}^{t}}^{t}$, which is in space ${\mathbb{R}}^{d}$. With this assumption, we can let $\varphi $ map each pair $({\bm{s}}_{true}^{t},{a}^{t})$ to the optimal vector ${\varphi}^{t*}$ which maximize ${r}_{\theta}({\bm{s}}_{true}^{t},{a}^{t})\frac{1}{\eta}R({\varphi}^{t})$ since there is no repeated pair. Using Eq. (13), we have
$\underset{\varphi \in \mathrm{\Phi}}{\mathrm{max}}{\mathbb{E}}_{\varphi}\left[{\displaystyle \sum _{t=1}^{T}}{r}_{\theta}({\bm{s}}_{true}^{t},{a}^{t})\right]{\displaystyle \frac{1}{\eta}}R(\varphi )=\underset{\varphi \in \mathrm{\Phi}}{\mathrm{max}}{\displaystyle \sum _{t=1}^{T}}{\mathbb{E}}_{\varphi}\left[{r}_{\theta}({\bm{s}}_{true}^{t},{a}^{t})\right]{\displaystyle \frac{1}{\eta}}R(\varphi )$  
$=$  $\sum _{t=1}^{T}}\left({\displaystyle \sum _{i=1}^{k}}{\varphi}_{i}^{t*}r({\bm{s}}^{t},{a}_{i}){\displaystyle \frac{1}{\eta}}{\displaystyle \sum _{i=1}^{k}}{\varphi}_{i}^{t*}\mathrm{log}{\varphi}_{i}^{t*}\right)={\displaystyle \sum _{t=1}^{T}}{\displaystyle \frac{1}{\eta}}\mathrm{log}\left({\displaystyle \sum _{i=1}^{k}}\mathrm{exp}(\eta {r}_{\theta}({\bm{s}}_{true}^{t},{a}_{i}))\right).$ 
Eq. (4.3) can then be written as
$$\underset{\theta \in \mathrm{\Theta}}{\mathrm{min}}\sum _{t=1}^{T}\frac{1}{\eta}\mathrm{log}\left(\sum _{i=1}^{k}\mathrm{exp}(\eta {r}_{\theta}({\bm{s}}_{true}^{t},{a}_{i}))\right)\sum _{t=1}^{T}{r}_{\theta}({\bm{s}}_{true}^{t},{a}_{true}^{t}),$$ 
which is the negative loglikelihood function and is equivalent to lemma 2. ∎
Appendix B Alogrithm box
The following is the algorithm of learning the cascading deep Qnetworks. We employ the cascading $Q$ functions to search the optimal action efficiently (line 2). Besides, both the experience replay (Mnih et al., 2013) and $\epsilon $exploration techniques are applied. The system’s experiences at each timestep are stored in a replay memory set $\mathcal{M}$ (line 2) and then a minibatch of data will be sampled from the replay memory to update $\widehat{{Q}^{j}}$ (line 2 and 2). An exploration to the action space is executed with probability $\epsilon $ (line 2).
Appendix C Dataset description
(1) MovieLens public dataset^{1}^{1} 1 https://grouplens.org/datasets/movielens/ contains large amounts of movie ratings collected from their website. We randomly sample 1,000 active users from this dataset. On average, each of these active users rated more than 500 movies (including short films), so we assume they rated almost every movie that they watched and thus equate their rating behavior with watching behavior. MovieLens dataset is the most suitable public dataset for our experiments, but it is still not perfect. In fact, none of the public datasets provides the context in which a user’s choice is made. Thus, we simulate this missing information in a reasonable way. For each movie watched(rated) on the date $d$, we collect a list of movies released within a month before that day $d$. On average, movies run for about four weeks in theater. Even though we don’t know the actual context of user’s choice, at least the user decided to watch the rated movie instead of other movies in theater. Besides, we control the maximal size of each displayed set by 40. Features: In MovieLens dataset, only titles and IDs of the movies are given, so we collect detailed movie information from Internet Movie Database(IMDB). Categorical features as encoded as sparse vectors and descriptive features are encoded as dense vectors. The combination of such two types of vectors produces 722 dimensional raw feature vectors. To further reduce dimensionality, we use logistic regression to fit a wide&deep networks (Cheng et al., 2016) and use the learned input and hidden layers to reduce the feature to 10 dimension.
(2) An online news article recommendation dataset from Ant Financial is anonymously collected from Ant Financial news article online platform. It consists of 50,000 users’ clicks and impression logs for one month, involving dozens of thousands of news. It is a timestamped dataset which contains user features, news article features and the context where the user clicks the articles. The size of the display set is not fixed, since a user can browse the news article platform as she likes. On average a display set contains 5 new articles, but it actually various from 2 to 10. Features: The news article raw features are approximately of dimension 100 million because it summarizes the key words in the article. Apparently it is too expensive to use these raw features in practice. The features we use in the experiments are 20 dimensional dense vector embedding produced from the raw feature by wide&deep networks. The reduced 20 dimensional features are widely used in this online platform and revealed to be effective in practice.
(3) Last.fm^{2}^{2} 2 https://www.last.fm/api contains listening records from 359,347 users. Each display set is simulated by collecting 9 songs with nearest timestamp.
(4) Yelp^{3}^{3} 3 https://www.yelp.com/dataset/ contains users’ reviews to various businesses. Each display set is simulated by collecting 9 businesses with nearest location.
(5) RecSys15^{4}^{4} 4 https://2015.recsyschallenge.com/ contains clickstreams that sometimes end with purchase events.
(6) Taobao^{5}^{5} 5 https://tianchi.aliyun.com/datalab contains the clicking behavior and buying behavior of users in 22 days. We consider the buying behaviors as positive events.
Appendix D More figures for experimental results
D.1 Figures for section 6.1
An interesting comparison is shown in Figure 3 and more similar figures are provided here. The blue curve is the trajectory of a user’s actual choices of movies over time. The orange curves are simulated trajectories predicted by GAN and CCF, respectively. Similar to what we conclude in section 6.1, these figures reveal the good performances of GAN user model in terms of capturing the evolution of users’ interest.
D.2 Figures for section 6.2
We demonstrate the policy performance in user level in figure 4 by comparing the cumulative reward. Here we attach the figure which compares the click rate. In each subfigure, red curve represents GANDQN policy and blue curve represents the other. GANDQN policy contributes higher averaged click rate for most users.
D.3 Figures for section 6.3
This figure shows three sets of results corresponding to different sizes of display set. It reveals how users’ cumulative reward(averaged over 1,000 users) increases as each policy interacts with and adapts to 1,000 users over time. It can be easily that the CDQN policy pretrained over a GAN user model can adapt to online users much faster then other modelfree policies and can reduce the risk of losing the user at the beginning. The experiment setting is similar to section 6.2. All policies are evaluated on a separated set of 1,000 users associated with a test model. We need to emphasize that the GAN model which assists the CDQN policy is learned from a training set of users without overlapping test users. It is different from the test model which fits the 1,000 test users.