Abstract
Reinforcement learning is effective in optimizing policies for recommendersystems. Current solutions mostly focus on modelfree approaches, which requirefrequent interactions with a real environment, and thus are expensive in modellearning. Offline evaluation methods, such as importance sampling, canalleviate such limitations, but usually request a large amount of logged dataand do not work well when the action space is large. In this work, we propose amodelbased reinforcement learning solution which models the useragentinteraction for offline policy learning via a generative adversarial network.To reduce bias in the learnt policy, we use the discriminator to evaluate thequality of generated sequences and rescale the generated rewards. Ourtheoretical analysis and empirical evaluations demonstrate the effectiveness ofour solution in identifying patterns from given offline data and learningpolicies based on the offline and generated data.
Quick Read (beta)
A ModelBased Reinforcement Learning with Adversarial Training for Online Recommendation
Abstract
Reinforcement learning is effective in optimizing policies for recommender systems. Current solutions mostly focus on modelfree approaches, which require frequent interactions with a real environment, and thus are expensive in model learning. Offline evaluation methods, such as importance sampling, can alleviate such limitations, but usually request a large amount of logged data and do not work well when the action space is large. In this work, we propose a modelbased reinforcement learning solution which models the useragent interaction for offline policy learning via a generative adversarial network. To reduce bias in the learnt policy, we use the discriminator to evaluate the quality of generated sequences and rescale the generated rewards. Our theoretical analysis and empirical evaluations demonstrate the effectiveness of our solution in identifying patterns from given offline data and learning policies based on the offline and generated data.
A ModelBased Reinforcement Learning with Adversarial Training for Online Recommendation
Xueying Bai${}^{\mathrm{\ast}\mathrm{\u2021}}$, Jian Guan^{†}^{†}thanks: Both authors contributed equally. ${}^{\mathrm{\S}}$, Hongning Wang${}^{\mathrm{\u2020}}$ ${}^{\u2021}$Department of Computer Science, Stony Brook University ${}^{\mathrm{\S}}$ Department of Computer Science and Technology, Tsinghua University ${}^{\u2020}$ Department of Computer Science, University of Virginia [email protected], [email protected] [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Recommender systems are widely used to recommend items that users would be interested in among the huge amount of content on the internet. However, because of users’ different interests and behaviors, only a small fraction of items are viewed by each user with even less feedback recorded. This gives relatively little information on userrecommender interactions for such a large state and action space chen2019top, and thus brings challenges to serve users with their favored items at the appropriate time based on historical interactions. Therefore, it is important to develop approaches to learn users’ preferences from the sparse user feedback such as clicks and purchases he2016fast, and explore unobserved interactions koren2009matrix to further improve recommenders.
Users’ interests can be shortterm or longterm and reflected by different types of feedback wu2017returning. For example, clicks are generally considered as shortterm feedback which reflects users’ immediate interests during the interaction, while purchase reveals users’ longterm interests which usually happen after several clicks. Considering both users’ shortterm and longterm interests, we frame the recommender system as a reinforcement learning (RL) agent, which aims to maximize users’ overall longterm satisfaction without sacrificing the recommendations’ shortterm utility shani2005mdp.
Classical modelfree RL methods require collecting large quantities of data by interacting with the environment, e.g., a population of users. Therefore, without interacting with real users, a recommender cannot easily probe for reward in previously unexplored regions in the state and action space. However, it is prohibitively expensive for a recommender to interact with users for reward and model updates, because bad recommendations (e.g., for exploration) hurt user satisfaction and increase the risk of user drop out. In this case, it is preferred for a recommender to learn a policy by fully utilizing the logged data that is acquired from other policies instead of direct interactions with users. For this purpose, in this work we take a modelbased learning approach, in which we simultaneously estimate a model of user behavior from the offline data and use it to interact with our learning agent to obtain an improved policy.
Modelbased RL has a strong advantage of being sample efficient and helping reduce noise in offline data. However, such an advantage can easily diminish due to the inherent bias in its model approximation of the real environment. Moreover, dramatic changes in subsequent policy updates impose the risk of decreased user satisfaction, i.e., inconsistent recommendations across model updates. To address these issues, we introduce adversarial training into a recommender’s policy learning from offline data. The discriminator is trained to differentiate simulated interaction trajectories from real ones so as to debias the user behavior model and improve policy learning. To the best of our knowledge, this is the first work to explore adversarial training over a modelbased RL framework for recommendation. We theoretically and empirically demonstrate the value of our proposed solution in policy evaluation. Together, the main contributions of our work are as follows:

•
To avoid the high interaction cost, we propose a unified solution to more effectively utilize the logged offline data with modelbased RL algorithms, integrated via adversarial training. It enables robust recommendation policy learning.

•
The proposed model is verified through theoretical analysis and extensive empirical evaluations. Experiment results demonstrate our solution’s better sample efficiency over the stateoftheart baselines ^{1}^{1} 1 Our implementation is available at https://github.com/JianGuanTHU/IRecGAN.
2 Related Work
Deep RL for recommendation There have been studies utilizing deep RL solutions in news, music and video recommendations lu2016partially; liebman2015dj; zheng2018drn. However, most of the existing solutions are modelfree methods and thus do not explicitly model the agentuser interactions. In these methods, valuebased approaches, such as deep Qlearning mnih2015human, present unique advantages such as seamless offpolicy learning, but are prone to instability with function approximation sutton2000policy; mnih2013playing. And the policy’s convergence in these algorithms is not wellstudied. In contrast, policybased methods such as policy gradient learning1998introduction remain stable but suffer from data bias without realtime interactive control due to learning and infrastructure constraints. Oftentimes, importance sampling munos2016safe is adopted to address the bias but instead results in huge variance chen2019top. In this work, we rely on a policy gradient based RL approach, in particular, REINFORCE williams1992simple; but we simultaneously estimate a user behavior model to provide a reliable environment estimate so as to update our agent on policy.
Modelbased RL Modelbased RL algorithms incorporate a model of the environment to predict rewards for unseen stateaction pairs. It is known in general to outperform modelfree solutions in terms of sample complexity deisenroth2013survey, and has been applied successfully to control robotic systems both in simulation and real world deisenroth2011pilco; meger2015learning; morimoto2003minimax; deisenroth2011learning. Furthermore, DynaQ sutton1990integrated; peng2018deep integrates modelfree and modelbased RL to generate samples for learning in addition to the real interaction data. gu2016continuous extended these ideas to neural network models, and peng2018deep further apply the method on taskcompletion dialogue policy learning. However, the most efficient modelbased algorithms have used relatively simple function approximations, which actually have difficulties in highdimensional space with nonlinear dynamics and thus lead to huge approximation bias.
Offline evaluation The problems of offpolicy learning munos2016safe; precup2000eligibility; precup2001off and offline policy evaluation are generally pervasive and challenging in RL, and in recommender systems in particular. As a policy evolves, so does the distribution under which the expectation of gradient is computed. Especially in the scenario of recommender systems, where item catalogues and user behavior change rapidly, substantial policy changes are required; and therefore it is not feasible to take the classic approaches schulman2015trust; achiam2017constrained to constrain the policy updates before new data is collected under an updated policy. Multiple offpolicy estimators leveraging inversepropensity scores, capped inversepropensity scores and various variance control measures have been developed thomas2016data; swaminathan2015self; swaminathan2015batch; gilotte2018offline for this purpose.
RL with adversarial training yu2017seqgan propose SeqGAN to extend GANs with an RLlike generator for the sequence generation problem, where the reward signal is provided by the discriminator at the end of each episode via a Monte Carlo sampling approach. The generator takes sequential actions and learns the policy using estimated cumulative rewards. In our solution, the generator consists of two components, i.e., our recommendation agent and the user behavior model, and we model the interactive process via adversarial training and policy gradient. Different from the sequence generation task which only aims to generate sequences similar to the given observations, we leverage adversarial training to help reduce bias in the user model and further reduce the variance in training our agent. The agent learns from both the interactions with the user behavior model and those stored in the logged offline data. To the best of our knowledge, this is the first work that utilizes adversarial training for improving both model approximation and policy learning on offline data.
3 Problem Statement
The problem is to learn a recommender that recommends items to maximize cumulative rewards of an online recommendation system by efficiently utilizing offline data. We address this problem with a modelbased reinforcement learning solution, which also needs to capture users’ behavior patterns.
Problem A recommender is formed as a learning agent to generate actions under a policy, where each action gives a recommendation list with $k$ items. Every time through interactions between the agent and the environment (i.e., users of the system), a set $\mathrm{\Omega}$ of $n$ sequences $\mathrm{\Omega}=\{{\tau}_{1},\mathrm{\dots},{\tau}_{n}\}$ is recorded, where ${\tau}_{i}$ is the $i$th sequence containing agent actions, user behaviors and rewards: ${\tau}_{i}=\{({a}_{0}^{i},{c}_{0}^{i},{r}_{0}^{i}),({a}_{1}^{i},{c}_{1}^{i},{r}_{1}^{i}),\mathrm{\dots},({a}_{t}^{i},{c}_{t}^{i},{r}_{t}^{i})\}$, ${r}_{t}^{i}$ represents the reward on ${a}_{t}^{i}$ (e.g., make a purchase), and ${c}_{t}^{i}$ is the associated user behavior corresponding to agent’s action ${a}_{t}^{i}$ (e.g., click on a recommended item). For simplicity, in the rest of paper, we drop the superscript $i$ to represent a general sequence $\tau $. Based on the observed sequences, a policy $\pi $ is learnt to maximize the expected cumulative reward ${\mathbb{E}}_{\tau \sim \pi}[{\sum}_{t=0}^{T}{r}_{t}]$, where $T$ is the end time of $\tau $.
Assumption To narrow the scope of our discussion, we study a typical type of user behavior, i.e., clicks, and make following assumptions: 1) at each time a user must click on one item from the recommendation list; 2) items not clicked in the recommendation list will not influence the user’s future behaviors; 3) rewards only relate to clicked items. For example, when taking the user’s purchase as reward, purchases can only happen on the clicked items.
Learning framework In a Markov Decision Process, an environment consists of a state set $S$, an action set $A$, a state transition distribution $P:S\times A\times S$, and a reward function ${f}_{r}:S\times A\to \mathbb{R}$, which maps a stateaction pair to a realvalued scalar. In this paper, the environment is modeled as a user behavior model $\mathcal{U}$, and learnt from offline log data. $S$ is reflected by the interaction history before time $t$, and $P$ captures the transition of user behaviors. In the meanwhile, based on the assumptions mentioned above, at each time $t$, the environment generates user’s click ${c}_{t}$ on items recommended by an agent $\mathcal{A}$ in ${a}_{t}$ based on his/her click probabilities under the current state; and the reward function ${f}_{r}$ generates reward ${r}_{t}$ for the clicked item ${c}_{t}$.
Our recommendation policy is learnt from both offline data and data sampled from the learnt user behavior model, i.e., a modelbased RL solution. We incorporate adversarial training in our modelbased policy learning to: 1) improve the user model to ensure the sampled data is close to true data distribution; 2) utilize the discriminator to scale rewards from generated sequences to further reduce bias in value estimation. Our proposed solution contains an interactive model constructed by $\mathcal{U}$ and $\mathcal{A}$, and an adversarial policy learning approach. We name the solution as InteractiveRecGAN, or IRecGAN in short. The overview of our proposed solution is shown in Figure 1.
4 Interactive Modeling for Recommendation
We present our interactive model for recommendation, which consists of two components: 1) the user behavior model $\mathcal{U}$ that generates user clicks over the recommended items with corresponding rewards; and 2) the agent $\mathcal{A}$ which generates recommendations according to its policy. $\mathcal{U}$ and $\mathcal{A}$ interact with each other to generate user behavior sequences for adversarial policy learning.
User behavior model Given users’ click observations $\{{c}_{0},{c}_{1},\mathrm{\dots},{c}_{t1}\}$, the user behavior model $\mathcal{U}$ first projects the clicked item into an embedding vector ${\mathbf{e}}^{u}$ at each time ^{2}^{2} 2 As we can use different embeddings on the user side and agent side, we use the superscript $u$ and $a$ to denote this difference accordingly.. The state ${\mathbf{s}}_{t}^{u}$ can be represented as a summary of click history, i.e., ${\mathbf{s}}_{t}^{u}={h}_{u}({\mathbf{e}}_{0}^{u},{\mathbf{e}}_{1}^{u},\mathrm{\dots}{\mathbf{e}}_{t1}^{u})$. We use a recurrent neural network to model the state transition $P$ on the user side, thus for the state ${\mathbf{s}}_{t}^{u}$ we have,
$${\mathbf{s}}_{t}^{u}={h}^{u}({\mathbf{s}}_{t1}^{u},{\mathbf{e}}_{t1}^{u}),$$ 
${h}^{u}(\cdot ,\cdot )$ can be functions in the RNN family like GRU chung2014empirical and LSTM hochreiter1997long cells. Given the action ${a}_{t}=\{{a}_{t(1)},\mathrm{\dots}{a}_{t(k)}\}$, i.e., the top$k$ recommendations at time $t$, we compute the probability of click among the recommended items via a softmax function,
$${\mathbf{V}}^{c}={({\mathbf{W}}^{c}{\mathbf{s}}_{t}^{u}+{\mathbf{b}}^{c})}^{\top}{\mathbf{E}}_{t}^{u},p({c}_{t}{\mathbf{s}}_{t}^{u},{a}_{t})=\mathrm{exp}({\mathbf{V}}_{i}^{c})/{\sum}_{j=1}^{{a}_{t}}\mathrm{exp}({\mathbf{V}}_{j}^{c})$$  (1) 
where ${\mathbf{V}}^{c}\in {\mathbb{R}}^{k}$ is a transformed vector indicating the evaluated quality of each recommended item ${a}_{t(i)}$ under state ${\mathbf{s}}_{t}^{u}$, ${\mathbf{E}}_{t}^{u}$ is the embedding matrix of recommended items, ${\mathbf{W}}^{c}$ is the click weight matrix, and ${\mathbf{b}}^{c}$ is the corresponding bias term. Under the assumption that target rewards only relate to clicked items, the reward ${r}_{t}$ for (${\mathbf{s}}_{t}^{u}$, ${a}_{t}$) is calculated by:
$${r}_{t}({\mathbf{s}}_{t}^{u},{a}_{t})={f}_{r}\left({({\mathbf{W}}^{r}{\mathbf{s}}_{t}^{u}+{\mathbf{b}}^{r})}^{\top}{\mathbf{e}}_{t}^{u}\right),$$  (2) 
where ${\mathbf{W}}^{r}$ is the reward weight matrix, ${\mathbf{b}}^{r}$ is the corresponding bias term, and ${f}_{r}$ is the reward mapping function and can be set according to the reward definition in specific recommender systems. For example, if we make ${r}_{t}$ to be the purchase of a clicked item ${c}_{t}$, where ${r}_{t}=1$ if it is purchased and ${r}_{t}=0$ otherwise, ${f}_{r}$ can be realized by a Sigmoid function with binary output.
Based on Eq (1) and (2), taking the categorical reward, the user behavior model $\mathcal{U}$ can be estimated from the offline data $\mathrm{\Omega}$ via maximum likelihood estimation:
$$\mathrm{max}\sum _{{\tau}_{i}\in \mathrm{\Omega}}{\sum}_{t=0}^{{T}_{i}}\mathrm{log}p({c}_{t}^{i}{\mathbf{s}}_{t}^{{u}_{i}},{a}_{t}^{i})+{\lambda}_{p}\mathrm{log}p({r}_{t}^{i}{\mathbf{s}}_{t}^{{u}_{i}},{c}_{t}^{i}),$$  (3) 
where ${\lambda}_{p}$ is a parameter balancing the loss between click prediction and reward prediction, and ${T}_{i}$ is the length of the observation sequence ${\tau}_{i}$. With a learnt user behavior model, user clicks and reward on the recommendation list can be sampled from Eq (1) and (2) accordingly.
Agent The agent should take actions based on the environment’s provided states. However, in practice, users’ states are not observable in a recommender system. Besides, as discussed in oh2017value, the states for the agent to take actions may be different from those for users to generate clicks and rewards. As a result, we build a different state model on the agent side in $\mathcal{A}$ to learn its states. Similar to that on the user side, given the projected click vectors $\{{\mathbf{e}}_{0}^{a},{\mathbf{e}}_{2}^{a},\mathrm{\dots}{\mathbf{e}}_{t1}^{a}\}$, we model states on the agent side by ${\mathbf{s}}_{t}^{a}={h}^{a}({\mathbf{s}}_{t1}^{a},{\mathbf{e}}_{t1}^{a})$, where ${\mathbf{s}}_{t}^{a}$ denotes the state maintained by the agent at time $t$, ${h}^{a}(\cdot ,\cdot )$ is the chosen RNN cell. The start state ${\mathbf{s}}_{0}^{a}$ for the first recommendation is drawn from a distribution $\rho $. We simply denote it as ${\mathbf{s}}_{0}$ in the rest of our paper. We should note that although the agent also models states based on users’ click history, it might create different state sequences than that on the user side.
Based on the current state ${\mathbf{s}}_{t}^{a}$, the agent generates a size$k$ recommendation list out of the entire set of items as its action ${a}_{t}$. The probability of item $i$ to be included in ${a}_{t}$ under the policy $\pi $ is:
$$\pi (i\in {a}_{t}{\mathbf{s}}_{t}^{a})=\frac{\mathrm{exp}({\mathbf{W}}_{i}^{a}{\mathbf{s}}_{t}^{a}+{\mathbf{b}}_{i}^{a})}{{\sum}_{j=1}^{C}\mathrm{exp}({\mathbf{W}}_{j}^{a}{\mathbf{s}}_{t}^{a}+{\mathbf{b}}_{j}^{a})},$$  (4) 
where ${\mathbf{W}}_{i}^{a}$ is the $i$th row of the action weight matrix ${\mathbf{W}}^{a}$, $C$ is the entire set of recommendation candidates, and ${\mathbf{b}}_{i}^{a}$ is the corresponding bias term. Following chen2019top, we generate ${a}_{t}$ by sampling without replacement according to Eq (4). Unlike pmlrv97chen19f, we do not consider the combinatorial effect among the $k$ items by simply assuming the users will evaluate them independently (as indicated in Eq (1)).
5 Adversarial Policy Learning
We use the policy gradient method REINFORCE williams1992simple for the agent’s policy learning, based on both generated and offline data. When generating ${\widehat{\tau}}_{0:t}=\{({\widehat{a}}_{0},{\widehat{c}}_{0},{\widehat{r}}_{0}),\mathrm{\dots},({\widehat{a}}_{t},{\widehat{c}}_{t},{\widehat{r}}_{t})\}$ for $t>0$, we obtain ${\widehat{a}}_{t}=\mathcal{A}({\widehat{\tau}}_{0:t1}^{c})$ by Eq (4), ${\widehat{c}}_{t}={\mathcal{U}}_{c}({\widehat{\tau}}_{0:t1}^{c},{\widehat{a}}_{t})$ by Eq (2), and ${\widehat{r}}_{t}={\mathcal{U}}_{r}({\widehat{\tau}}_{0:t1}^{c},{\widehat{c}}_{t})$ by Eq (1). ${\tau}^{c}$ represents clicks in the sequence $\tau $ and $({\widehat{a}}_{0},{\widehat{c}}_{0},{\widehat{r}}_{0})$ is generated by ${\mathbf{s}}_{0}^{a}$ and ${\mathbf{s}}_{0}^{u}$. The generation of a sequence ends at the time $t$ if ${\widehat{c}}_{t}={c}_{end}$, where ${c}_{end}$ is a stopping symbol. The distributions of generated and offline data are denoted as $g$ and $data$ respectively. In the following discussions, we do not explicitly differentiate $\tau $ and $\widehat{\tau}$ when the distribution of them is specified. Since we start the training of $\mathcal{U}$ from offline data, it introduces inherent bias from the observations and our specific modeling choices. The bias affects the sequence generation and thus may cause biased value estimation. To reduce the effect of bias, we apply adversarial training to control the training of both $\mathcal{U}$ and $\mathcal{A}$. The discriminator scores are also used to rescale the generated rewards $\widehat{r}$ for policy learning. Therefore, the learning of agent $\mathcal{A}$ considers both sequence generation and target rewards.
5.1 Adversarial training
We leverage adversarial training to encourage our IRecGAN model to generate highquality sequences that capture intrinsic patterns in the real data distribution. A discriminator $\mathcal{D}$ is used to evaluate a given sequence $\tau $, where $\mathcal{D}(\tau )$ represents the probability that $\tau $ is generated from the real recommendation environment. The discriminator can be estimated by minimizing the objective function:
$${\mathbb{E}}_{\tau \sim data}\mathrm{log}\left(\mathcal{D}(\tau )\right){\mathbb{E}}_{\tau \sim g}\mathrm{log}\left(1\mathcal{D}(\tau )\right).$$  (5) 
However, $\mathcal{D}$ only evaluates a completed sequence, while it cannot directly evaluate a partially generated sequence at a particular time step $t$. Inspired by yu2017seqgan, we utilize the MonteCarlo tree search algorithm with the rollout policy constructed by $\mathcal{U}$ and $\mathcal{A}$ to get sequence generation score at each time. At time $t$, the sequence generation score ${q}_{\mathcal{D}}$ of ${\tau}_{0:t}$ is defined as:
$$  (6) 
where $M{C}^{\mathcal{U},\mathcal{A}}({\tau}_{{}_{0:t}};N)$ is the set of $N$ sequences sampled from the interaction between $\mathcal{U}$ and $\mathcal{A}$.
Given the observations in offline data, $\mathcal{U}$ should generate clicks and rewards that reflect intrinsic patterns of the real data distribution. Therefore, $\mathcal{U}$ should maximize the sequence generation objective ${\mathbb{E}}_{{\mathbf{s}}_{0}^{u}\sim \rho}[{\sum}_{({a}_{0},{c}_{0},{r}_{0})\sim g}\mathcal{U}({c}_{0},{r}_{0}{\mathbf{s}}_{0}^{u},{a}_{0})\cdot {q}_{\mathcal{D}}({\tau}_{0:0})]$, which is the expected discriminator score for generating a sequence from the start state. $\mathcal{U}$ may not generate clicks and rewards exactly the same as those in offline data, but the closeness of its generated data to offline data is still an informative signal to evaluate its sequence generation quality. By setting ${q}_{\mathcal{D}}({\tau}_{0:t})=1$ at any time $t$ for offline data, we extend this objective to include offline data (it becomes the data likelihood function on offline data). Following yu2017seqgan, based on Eq (1) and Eq (2), the gradient of $\mathcal{U}$’s objective can be derived as,
$${\mathbb{E}}_{\tau \sim \{g,data\}}\left[{\sum}_{t=0}^{T}{q}_{\mathcal{D}}({\tau}_{0:t}){\nabla}_{{\mathrm{\Theta}}_{u}}\left(\mathrm{log}{p}_{{\mathrm{\Theta}}_{u}}({c}_{t}{\mathbf{s}}_{t}^{u},{a}_{t})+{\lambda}_{p}\mathrm{log}{p}_{{\mathrm{\Theta}}_{u}}({r}_{t}{\mathbf{s}}_{t}^{u},{c}_{t})\right)\right],$$  (7) 
where ${\mathrm{\Theta}}_{u}$ denotes the parameters of $\mathcal{U}$ and ${\mathrm{\Theta}}_{a}$ denotes those of $\mathcal{A}$. Based on our assumption, even when $\mathcal{U}$ can already capture users’ true behavior patterns, it still depends on $\mathcal{A}$ to provide appropriate recommendations to generate clicks and rewards that the discriminator will treat as authentic. Hence, $\mathcal{A}$ and $\mathcal{U}$ are coupled in this adversarial training. To encourage $\mathcal{A}$ to provide needed recommendations, we include ${q}_{\mathcal{D}}({\tau}_{0:t})$ as a sequence generation reward for $\mathcal{A}$ at time $t$ as well. As ${q}_{\mathcal{D}}({\tau}_{0:t})$ evaluates the overall generation quality of ${\tau}_{0:t}$, it ignores sequence generations after $t$. To evaluate the quality of a whole sequence, we require $\mathcal{A}$ to maximize the cumulative sequence generation reward ${\mathbb{E}}_{\tau \sim \{g,data\}}\left[{\sum}_{t=0}^{T}{q}_{\mathcal{D}}({\tau}_{0:t})\right]$. Because $\mathcal{A}$ does not directly generate the observations in the interaction sequence, we approximate ${\nabla}_{{\mathrm{\Theta}}_{a}}\left({\sum}_{t=0}^{T}{q}_{\mathcal{D}}({\tau}_{0:t})\right)$ as 0 when calculating the gradients. Putting these together, the gradient derived from sequence generations for $\mathcal{A}$ is estimated as,
$${\mathbb{E}}_{\tau \sim \{g,data\}}\left[{\sum}_{t=0}^{T}\left({\sum}_{{t}^{\prime}=t}^{T}{\gamma}^{{t}^{\prime}t}{q}_{\mathcal{D}}({\tau}_{0:t})\right){\nabla}_{{\mathrm{\Theta}}_{a}}\mathrm{log}{\pi}_{{\mathrm{\Theta}}_{a}}({c}_{t}\in {a}_{t}{\mathbf{s}}_{t}^{a})\right].$$  (8) 
Based on our assumption that only the clicked items influence user behaviors, and $\mathcal{U}$ only generates rewards on clicked items, we use ${\pi}_{{\mathrm{\Theta}}_{a}}({c}_{t}\in {a}_{t}{s}_{t}^{a})$ as an estimation of ${\pi}_{{\mathrm{\Theta}}_{a}}({a}_{t}{s}_{t}^{a})$, i.e., $\mathcal{A}$ should promote ${c}_{t}$ in its recommendation at time $t$. In practice, we add a discount factor $$ when calculating the cumulative rewards to reduce estimation variance chen2019top.
5.2 Policy learning
Because our adversarial training encourages IRecGAN to generate clicks and rewards with similar patterns as offline data, and we assume rewards only relate to the clicked items, we use offline data as well as generated data for policy learning and treat the offline rewards as an estimation of the rewards of ${\pi}_{{\mathrm{\Theta}}_{a}}$. Given data ${\tau}_{0:T}=\{({a}_{0},{c}_{0},{r}_{0}),\mathrm{\dots},({a}_{T},{c}_{T},{r}_{T})\}$, including both offline and generated data, the objective of the agent is to maximize the expected cumulative reward ${\mathbb{E}}_{\tau \sim \{g,data\}}[{R}_{T}]$, where ${R}_{T}={\sum}_{t=0}^{T}{r}_{t}$. In the generated data, due to the difference in distributions of the generated and offline sequences, the generated reward ${\widehat{r}}_{t}$ calculated by Eq (2) might be biased. To reduce such bias, we utilize the sequence generation score in Eq (6) to rescale generated rewards: ${r}_{t}^{s}={q}_{\mathcal{D}}({\tau}_{0:t}){\widehat{r}}_{t}$, and treat it as the reward for generated data. The gradient of the objective is thus estimated by:
$${\mathbb{E}}_{\tau \sim \{g,data\}}\left[{\sum}_{t=0}^{T}{R}_{t}{\nabla}_{{\mathrm{\Theta}}_{a}}\mathrm{log}{\pi}_{{\mathrm{\Theta}}_{a}}({c}_{t}\in {a}_{t}{\mathbf{s}}_{t}^{a})\right],{R}_{t}={\sum}_{{t}^{\prime}=t}^{T}{\gamma}^{{t}^{\prime}t}{q}_{\mathcal{D}}({\tau}_{0:t}){r}_{t}$$  (9) 
${R}_{t}$ is an approximation of ${R}_{T}$ with $\gamma $ as the discount factor. Overall, the user behavior model $\mathcal{U}$ is updated only by the sequence generation objective defined in Eq (7) on both offline and generated data; but the agent $\mathcal{A}$ is updated by both sequence generation and target rewards. Hence, the overall reward for $\mathcal{A}$ at time $t$ is ${q}_{\mathcal{D}}({\tau}_{0:t})(1+{\lambda}_{r}{r}_{t})$, where ${\lambda}_{r}$ is the weight for cumulative target rewards. The overall gradient for $\mathcal{A}$ is thus:
$${\mathbb{E}}_{\tau \sim \{g,data\}}\left[{\sum}_{t=0}^{T}{R}_{t}^{a}{\nabla}_{{\mathrm{\Theta}}_{a}}\mathrm{log}{\pi}_{{\mathrm{\Theta}}_{a}}({c}_{t}\in {a}_{t}{\mathbf{s}}_{t}^{a})\right],{R}_{t}^{a}={\sum}_{{t}^{\prime}=t}^{T}{\gamma}^{{t}^{\prime}t}{q}_{\mathcal{D}}({\tau}_{0:t})(1+{\lambda}_{r}{r}_{t})$$  (10) 
6 Theoretical Analysis
For one iteration of policy learning in IRecGAN, we first train the discriminator $\mathcal{D}$ with offline data, which follows ${P}_{data}$ and was generated by an unknown logging policy, and the data generated by IRecGAN under ${\pi}_{{\mathrm{\Theta}}_{a}}$ with the distribution of $g$. When ${\mathrm{\Theta}}_{u}$ and ${\mathrm{\Theta}}_{a}$ are learnt, for a given sequence $\tau $, by proposition 1 in goodfellow2014generative, the optimal discriminator $\mathcal{D}$ is ${\mathcal{D}}^{*}(\tau )=\frac{{P}_{data}(\tau )}{{P}_{data}(\tau )+{P}_{g}(\tau )}$.
Sequence generation Both $\mathcal{A}$ and $\mathcal{U}$ contribute to the sequence generation in IRecGAN. $\mathcal{U}$ is updated by the gradient in Eq (7) to maximize the sequence generation objective. At time $t$, the expected sequence generation reward for $\mathcal{A}$ on the generated data is: ${E}_{{\tau}_{0:t}\sim g}[{q}_{\mathcal{D}}({\tau}_{0:t})]={E}_{{\tau}_{0:t}\sim g}[\mathcal{D}({\tau}_{0:T}{\tau}_{0:t})].$ The expected value on ${\tau}_{0:t}$ is: ${\mathbb{E}}_{\tau \sim g}[{V}_{g}]={\mathbb{E}}_{\tau \sim g}\left[{\sum}_{t=0}^{T}{q}_{\mathcal{D}}({\tau}_{0:t})\right]={\sum}_{t=0}^{T}{\mathbb{E}}_{{\tau}_{0:t}\sim g}\left[\mathcal{D}({\tau}_{0:T}{\tau}_{0:t})\right].$ Given the optimal ${\mathcal{D}}^{*}$, the sequence generation value can be written as:
$${\mathbb{E}}_{\tau \sim g}[{V}_{g}]={\sum}_{t=0}^{T}{\mathbb{E}}_{{\tau}_{0:t}\sim g}\left[\frac{{P}_{data}({\tau}_{0:T}{\tau}_{0:t})}{{P}_{data}({\tau}_{0:T}{\tau}_{0:t})+{P}_{g}({\tau}_{0:T}{\tau}_{0:t})}\right].$$  (11) 
Maximizing each term in the summation of Eq (11) is an objective for the generator at time $t$ in GAN. According to goodfellow2014generative, the optimal solution for all such terms is ${P}_{g}({\tau}_{0:T}{s}_{0})={P}_{data}({\tau}_{0:T}{s}_{0})$. It means $\mathcal{A}$ can maximize the sequence generation value when it helps to generate sequences with the same distribution as $data$. Besides the global optimal, Eq (11) also encourages $\mathcal{A}$ to reward each ${P}_{g}({\tau}_{0:T}{\tau}_{0:t})={P}_{data}({\tau}_{0:T}{\tau}_{0:t})$, even if ${\tau}_{0:t}$ is less likely to be generated from ${P}_{g}$. This prevents IRecGAN to recommend items only considering users’ immediate preferences.
Value estimation The agent $\mathcal{A}$ should also be updated to maximize the expected value of target rewards ${V}_{a}$. To achieve this, we use discriminator $\mathcal{D}$ to rescale the estimation of ${V}_{a}$ on the generated sequences, and we also combine offline data to evaluate ${V}_{a}$ for policy ${\pi}_{{\mathrm{\Theta}}_{a}}$:
$${\mathbb{E}}_{\tau \sim {\pi}_{{}_{{\mathrm{\Theta}}_{a}}}}[{V}_{a}]={\lambda}_{1}{\sum}_{t=0}^{T}{\mathbb{E}}_{{\tau}_{0:t}\sim g}\frac{{P}_{data}({\tau}_{0:t})}{{P}_{data}({\tau}_{0:t})+{P}_{g}({\tau}_{0:t})}{\widehat{r}}_{t}+{\lambda}_{2}{\sum}_{t=0}^{T}{\mathbb{E}}_{{\tau}_{0:t}\sim data}{r}_{t},$$  (12) 
where ${\widehat{r}}_{t}$ is the generated reward by $\mathcal{U}$ at time $t$ and ${r}_{t}$ is the true reward. ${\lambda}_{1}$ and ${\lambda}_{2}$ represent the ratio of generated data and offline data during model training, and we require ${\lambda}_{1}+{\lambda}_{2}=1$. Here we simplify $P({\tau}_{0:T}{\tau}_{0:t})$ as $P({\tau}_{0:t})$. As a result, there are three sources of biases in this value estimation:
$$\mathrm{\Delta}={\widehat{r}}_{t}{r}_{t},\begin{array}{}\end{array}{\delta}_{1}=1{P}_{{\pi}_{{\mathrm{\Theta}}_{a}}}({\tau}_{0:t})/{P}_{g}({\tau}_{0:t}),\begin{array}{}\end{array}{\delta}_{2}=1{P}_{{\pi}_{{\mathrm{\Theta}}_{a}}}({\tau}_{0:t})/{P}_{data}({\tau}_{0:t}).$$ 
Based on different sources of biases, the expected value estimation in Eq (12) is:
${\mathbb{E}}_{\tau \sim {\pi}_{{}_{{\mathrm{\Theta}}_{a}}}}[{V}_{a}]=$  ${\lambda}_{1}{\displaystyle \sum _{t=0}^{T}}{\mathbb{E}}_{{\tau}_{0:t}\sim g}{\displaystyle \frac{{P}_{{\pi}_{{\mathrm{\Theta}}_{a}}}({\tau}_{0:t})}{{P}_{g}({\tau}_{0:t})}}{\displaystyle \frac{\mathrm{\Delta}+{r}_{t}}{2({\delta}_{1}+{\delta}_{2})}}+{\lambda}_{2}{\displaystyle \sum _{t=0}^{T}}{\mathbb{E}}_{{\tau}_{0:t}\sim data}\left({\displaystyle \frac{{P}_{{\pi}_{{\mathrm{\Theta}}_{a}}}({\tau}_{0:t})}{{P}_{data}({\tau}_{0:t})}}+{\delta}_{2}\right){r}_{t}$  
$=$  ${V}_{a}^{{\pi}_{{\mathrm{\Theta}}_{a}}}+{\displaystyle \sum _{t=0}^{T}}{\mathbb{E}}_{{\tau}_{0:t}\sim {\pi}_{{\mathrm{\Theta}}_{a}}}{w}_{t}\mathrm{\Delta}+{\displaystyle \sum _{t=0}^{T}}{\mathbb{E}}_{{\tau}_{0:t}\sim data}{\lambda}_{2}{\delta}_{2}{r}_{t}{\displaystyle \sum _{t=0}^{T}}{\mathbb{E}}_{{\tau}_{0:t}\sim {\pi}_{{\mathrm{\Theta}}_{a}}}({\lambda}_{1}{w}_{t}){r}_{t},$ 
where ${w}_{t}=\frac{{\lambda}_{1}}{2({\delta}_{1}+{\delta}_{2})}$. $\mathrm{\Delta}$ and ${\delta}_{1}$ come from the bias of user behavior model $\mathcal{U}$. Because the adversarial training helps to improve $\mathcal{U}$ to capture real data patterns, it decreases $\mathrm{\Delta}$ and ${\delta}_{2}$. Because we can adjust the sampling ratio ${\lambda}_{1}$ to reduce ${w}_{t}$, ${w}_{t}\mathrm{\Delta}$ can be small. The sequence generation rewards for agent $\mathcal{A}$ encourage distribution $g$ to be close to $data$. Because ${\delta}_{2}=1\frac{{P}_{{\pi}_{{\mathrm{\Theta}}_{a}}}({\tau}_{0:t})}{{P}_{g}({\tau}_{0:t})}\cdot \frac{{P}_{g}({\tau}_{0:t})}{{P}_{data}({\tau}_{0:t})}$, the bias ${\delta}_{2}$ can also be reduced. It shows our method has a bias controlling effect.
7 Experiments
In our theoretical analysis, we can find that reducing the model bias improves value estimation, and therefore improves policy learning. In this section, we conduct empirical evaluations on both realworld and synthetic datasets to demonstrate that our solution can effectively model the pattern of data for better recommendations, compared with stateoftheart baselines.
7.1 Simulated Online Test
Subject to the complexity and difficulty of deploying a recommender system with real users for online evaluation, we use simulationbased studies to first investigate the effectiveness of our approach following zhao2019model; pmlrv97chen19f.
Simulated Environment We synthesize an MDP to simulate an online recommendation environment. It has $m$ states and $n$ items for recommendation, with a randomly initialized transition probability matrix $P(s\in S{a}_{j}\in A,{s}_{i}\in S)$. Under each state ${s}_{i}$, an item ${a}_{j}$’s reward $r({a}_{j}\in A{s}_{i}\in S)$ is uniformly sampled from the range of 0 to 1. During the interaction, given a recommendation list including $k$ items selected from the whole item set by an agent, the simulator first samples an item proportional to its groundtruth reward under the current state ${s}_{i}$ as the click candidate. Denote the sampled item as ${a}_{j}$, a Bernoulli experiment is performed on this item with $r({a}_{j})$ as the success probability; then the simulator moves to the next state according to the state transition probability $p(s{a}_{j},{s}_{i})$. The special state ${s}_{0}$ is used to initialize all the sessions, which do not stop until the Bernoulli experiment fails. The immediate reward is 1 if the session continues to the next step; otherwise 0. In our experiment, $m,n$ and $k$ are set to 10, 50 and 10 respectively.
Offline Data Generation We generate offline user logs denoted by ${d}_{\text{off}}$ with the simulator. We especially control the bias and variance in ${d}_{\text{off}}$ by changing the logging policy and the size of ${d}_{\text{off}}$ to compare the performance of different models. We adopt three different logging policies: 1) uniformly random policy ${\pi}_{\text{random}}$, 2) maximum reward policy ${\pi}_{\text{max}}$, and 3) mixed reward policy ${\pi}_{\text{mix}}$. Specifically, ${\pi}_{\text{max}}$ recommends the top $k$ items with the highest groundtruth reward under the current simulator state at each step, while ${\pi}_{\text{mix}}$ randomly selects $k$ items with either the top 20%50% groundtruth reward or the highest groundtruth reward under a given state. In the meanwhile, we vary the size of ${d}_{\text{off}}$ from 200 to 10,000.
Baselines We compared our IRecGAN with the following baselines: 1). LSTM: only the user behavior model trained on offline data; 2). PG: only the agent model trained by policy gradient on offline data; 3). LSTMD: the user behavior model in IRecGAN, updated by adversarial training.
Experiment Settings The hyperparameters in all models are set as follows: the item embedding dimension is set to 50, the discount factor $\gamma $ in value calculation is set to 0.9, the scale factors ${\lambda}_{r}$ and ${\lambda}_{p}$ are set to 3 and 1. We apply 2layer LSTM units with 512dimension hidden states. The ratio of generated training samples and offline data for each training epoch is set to 1:10. We use an RNN based discriminator in all experiments with details provided in the appendix.
Online Evaluation After we finish model and baselines training on ${d}_{\text{off}}$, we deploy the learnt policy to interact with the simulator for online evaluation. We calculated [email protected] to measure the average proportion of the top $r$ relevant items in groundtruth that are actually recommended by an agent (i.e., in its top $k$ recommendations) across all the time steps. The results of [email protected] under different configurations of offline data generation are reported in Figure 2. LSTM outperforms policy gradient under ${\pi}_{\text{random}}$, because under this logging policy every item has an equal chance to be observed (i.e., full exploration), LSTM better recognizes each item’s true reward via maximum likelihood estimation. A similar comparison is observed between our user model and agent model: LSTMD can capture the true reward even with much fewer data. Under ${\pi}_{\text{max}}$ and ${\pi}_{\text{mix}}$, it is easy for all models to recognize items with the highest reward. But the low [email protected] suggests that they fail to capture the overall preference as the items with lower reward are less likely to be clicked. This becomes more serious under ${\pi}_{\text{mix}}$ that requires a model to differentiate top relevant items from those with moderate reward. By generating more training data via adversarial training, our model performed better than baselines.
After offline training, the average cumulative rewards for all methods are also evaluated and reported in the rightmost bars of Figure 2. The cumulative rewards are calculated by generating 1000 sequences with the environment and take average of their cumulative rewards. IRecGAN has a larger average cumulative reward than other methods under all configurations except ${\pi}_{\text{random}}$ with 200 offline sequences. However, on ${\pi}_{\text{random}}(200)$ it has less variance than other methods, which indicates the robustness of IRecGAN when the volume of offline data is limited.
Online Learning To evaluate our model’s effectiveness in reducing the bias of value estimation in a more practical setting, we execute online and offline learning alternately. Specifically, we separate the learning into two stages: first, the agents can directly interact with the simulator to update their policies, and we only allow them to generate 200 sequences in this stage; then they turn to the offline stage to reuse their generated data for offline learning. We iterate these two stages and record their recommendation performance during the online learning stage. We compare with the following baselines: 1) PGonline with only online learning, 2) PGonline&offline with online learning and reusing the generated data via policy gradient for offline learning, and 3) LSTMoffline with only offline learning. We train all the models from scratch and report the performance of [email protected] and [email protected] over 20 iterations in Figure 3. We can observe that LSTMoffline performs worse than all RL methods, especially in the later stage, due to its lack of exploration. PGonline improves slowly with a high variance, as it does not reuse the generated data. Compared with PGonline&offline, IRecGAN has better convergence and coverage because of its reduced value estimation bias. We also find that [email protected] is harder to improve. The key reason is that as the model identifies the items with high rewards, it tends to recommend them more often; this gives those less relevant items less chance to be explored, which is in a similar situation to our online evaluation experiments under ${\pi}_{\text{max}}$ and ${\pi}_{\text{mix}}$. Our modelbased RL training alleviates this bias to a certain extend by generating more training sequences, but it cannot totally alleviate it. This reminds us to focus on exploreexploit tradeoff in modelbased RL in our future work.
7.2 Realworld Offline Test
We also use a largescale realworld recommendation dataset from CIKM Cup 2016 to evaluate the effectiveness of our proposed solution for offline reranking. We filtered out sessions of length 1 or longer than 40 and items that have never been clicked. We selected the top 40,000 most popular items to construct our recommendation candidate set. We randomly selected 65,284/1,718/1,820 sessions for training/validation/testing purposes, where the average length of sessions is 2.81/2.80/2.77 respectively. The percentage of recorded recommendations that lead to a purchase is 2.31%/2.46%/2.45%. We followed the same setting as in our simulationbased study in this experiment.
Baselines In addition to the baselines we compared in our simulation based study, we also include the following stateoftheart solutions for recommendation: 1). PGIS: the agent model estimated with importance sampling on offline data to reduce bias; 2). AC: an LSTM model whose setting is the same as our agent model but trained with actorcritic algorithm lillicrap2015continuous to reduce variance; 3). PGU: the agent model trained using offline and generated data, without adversarial training; 4). ACU: AC model trained with both offline and generated data, without adversarial training.
Evaluation Metrics All the models were applied to rerank the given recommendation list at each step of testing sessions in offline data. We used [email protected] ([email protected] and [email protected]) to compare different models’ recommendation performance, where we define the clicked items as relevant. In addition, because the logged recommendation list was not ordered, we cannot assess the original logging policy’s performance in the offline data.
Model  LSTM  LSTMD  PG  PGIS  AC  PGU  ACU  IRecGAN 

[email protected] (%)  32.89$\pm $0.50  33.42$\pm $0.40  33.28$\pm $0.71  28.13$\pm $0.45  31.93$\pm $0.17  34.12$\pm $0.52  32.43$\pm $0.22  35.06$\mathrm{\pm}$0.48 
[email protected] (%)  8.20$\pm $0.65  8.55$\mathrm{\pm}$0.63  6.25$\pm $0.14  4.61$\pm $0.73  6.54$\pm $0.19  6.44$\pm $0.56  6.63$\pm $ 0.29  6.79$\pm $0.44 
Results
The results of the offline rerank evaluation are reported in Table 1. With the help of adversarial training, IRecGAN achieved encouraging improvement against all baselines. This verifies the effectiveness of our modelbased reinforcement learning, especially its adversarial training strategy for utilizing the offline data with reduced bias. Specially, we compare the results of PG, PGIS, PGU, and IRecGAN. PGIS did not perform as well as PG partially because of the high variance introduced by importance sampling. PGU was able to fit the given data more accurately than PG since there are many items for recommendation and the collected data is limited. However, PGU performed worse than IRecGAN because of the biased user behavior model. And with the help of the discriminator, IRecGAN reduces the bias in the user behavior model to improve value estimation and better policy learning. This is also reflected on its improved user behavior model: LSTMD outperformed LSTM, given both of them are for user behavior modeling.
8 Conclusion
In this work, we developed a practical solution for utilizing offline data to build a modelbased reinforcement learning agent for recommendation, with reduced model bias. We introduce adversarial training for joint user behavior model learning and policy update. Our theoretical analysis shows our solution’s promise in reducing bias; our empirical evaluations in both synthetic and realworld recommendation datasets verify the effectiveness of our solution. Several directions left open in our work, including balancing exploreexploit in policy learning with offline data, incorporating richer structures in user behavior modeling, and exploring the applicability of our solution in other offpolicy learning scenarios, such as conversational systems.
References
Appendix
1 Details of Discriminator Model
We adopt an RNNbased discriminator $\mathcal{D}$ for our IRecGAN framework, and model its hidden states by ${\mathbf{s}}_{t}^{d}={h}_{d}({\mathbf{s}}_{t1}^{d},{\mathbf{e}}_{t1}^{d})$, where ${\mathbf{s}}_{t}^{d}$ denotes the hidden states maintained by the discriminator at time $t$ and ${\mathbf{e}}_{t1}^{d}$ is the embeddings used in the discriminator side. And we add a multilayer perceptron which takes as input the hidden states to compute a score through a Sigmoid layer indicating whether the trajectory looks like being generated by real users as follows:
$D({\tau}_{0:T})$  $=\text{Sigmoid}\left[{\displaystyle \frac{1}{T}}{\displaystyle \sum _{t=0}^{T}}{\mathbf{e}}^{d}{({c}_{{a}_{t}}^{\mathrm{max}})}^{\top}{\mathbf{e}}^{d}({c}_{t})({\text{W}}^{p}{r}_{t}+{\text{b}}^{p})\right]$  
${c}_{{a}_{t}}^{\mathrm{max}}$  $={\text{argmax}}_{c\in {a}_{t}}{({\mathbf{W}}^{d}{\mathbf{s}}_{t}^{d}+{\mathbf{b}}^{d})}^{\top}\mathbf{e}(c)$ 
where ${c}_{{a}_{t}}^{\mathrm{max}}$ can be seen as the user’s favorite item of the given ${a}_{t}$, and should be as close to ${c}_{t}$ as possible for real user. To ensure the gradient backpropagation, we use Softmax with a temperature 0.1 to approximate the argmax function. Other hyper parameters are set the same with the experiment setting depicted in section 7.1. The optimization target of $\mathcal{D}$ is formulated as in Eq (5).
2 Algorithm