Abstract
Modelbased reinforcement learning (MBRL) aims to learn a dynamic model toreduce the number of interactions with realworld environments. However, due toestimation error, rollouts in the learned model, especially those of longhorizon, fail to match the ones in realworld environments. This mismatchinghas seriously impacted the sample complexity of MBRL. The phenomenon can beattributed to the fact that previous works employ supervised learning to learnthe onestep transition models, which has inherent difficulty ensuring thematching of distributions from multistep rollouts. Based on the claim, wepropose to learn the synthesized model by matching the distributions ofmultistep rollouts sampled from the synthesized model and the real ones viaWGAN. We theoretically show that matching the two can minimize the differenceof cumulative rewards between the real transition and the learned one. Ourexperiments also show that the proposed model imitation method outperforms thestateoftheart in terms of sample complexity and average return.
Quick Read (beta)
Model Imitation for ModelBased Reinforcement Learning
${}^{2}$ University of California San Diego
${}^{3}$ Princeton University
Abstract
Modelbased reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with realworld environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in realworld environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the onestep transition models, which has inherent difficulty ensuring the matching of distributions from multistep rollouts. Based on the claim, we propose to learn the synthesized model by matching the distributions of multistep rollouts sampled from the synthesized model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one. Our experiments also show that the proposed model imitation method outperforms the stateoftheart in terms of sample complexity and average return.
C¿c¡ \newcolumntypeL¿l¡
1 Introduction
Reinforcement learning (RL) has become of great interest because plenty of realworld problems can be modeled as a sequential decisionmaking problem. Modelfree reinforcement learning (MFRL) is favored by its capability of learning complex tasks when interactions with environments are cheap. However, in the majority of realworld problems, such as autonomous driving, interactions are extremely costly, thus MFRL becomes infeasible. One critique about MFRL is that it does not fully exploit past queries over the environment, and this motivates us to consider the modelbased reinforcement learning (MBRL). In addition to learning an agent policy, MBRL also uses the queries to learn the dynamic of the environment that our agent is interacting with. If the learned dynamic is accurate enough, the agent can acquire the desired skill by simply interacting with the simulated environment, so that the number of samples to collect in the real world can be greatly reduced. As a result, MBRL has become one of the possible solutions to reduce the number of samples required to learn an optimal policy.
Most previous works of MBRL adopt supervised learning with ${\mathrm{\ell}}_{2}$based errors [Luo et al., 2019; Kurutach et al., 2018; Clavera et al., 2018] or maximum likelihood [Janner et al., 2019], to obtain an environment model that synthesizes real transitions. These nontrivial developments imply that optimizing a policy on a synthesized environment is a challenging task. Because the estimation error of model accumulates as the trajectory grows, it is hard to train a policy on a long synthesized trajectory. On the other hand, training on short trajectories makes the policy shortsighted. This issue is known as the planning horizon dilemma [Langlois et al., 2019]. As a result, despite having a strong intuition at first sight, MBRL has to be designed meticulously.
Intuitively, we would like to learn a transition model in a way that it can reproduce the trajectories that have been generated in the real world. Since the attained trajectories are sampled according to a certain policy, directly employing supervised learning may not necessarily lead to the mentioned result especially when the policy is stochastic. The resemblance in trajectories matters because we estimate policy gradient by generating rollouts; however, the onestep model learning adopted by many MBRL methods do not guarantee this. Some previous works propose multistep training [Luo et al., 2019]; however, experiments show that model learning fails to benefit much from the multistep loss. We attribute this outcome to the essence of supervised learning, which elementally preserves only onestep transition and the similarity between real trajectories and the synthesized ones cannot be guaranteed.
In this work, we propose to learn the transition model via distribution matching. Specifically, we use WGAN [Arjovsky et al., 2017] to match the distributions of stateactionnextstate triple $(s,a,{s}^{\prime})$ in real/learned models so that the agent policy can generate similar trajectories when interacting with either the true transition or the learned transition. Figure 1 illustrates the difference between methods based on supervised learning and distribution matching. Different from the ensemble methods proposed in previous works, our method is capable of generalizing to unseen transitions with only one dynamic model because merely incorporating multiple models does not alter the essence that onestep (or fewstep) supervised learning fails to imitate the distribution of multistep rollouts.
Concretely, we gather some transitions in the real world according to a policy. To learn the real transition, we then sample fake transitions from our synthesized model with the same policy. The synthesized model serves as the generator in the WGAN framework and there is a critic that discriminates the two transition data. We update the generator and the critic alternatively until the synthesized data cannot be distinguished from the real one, which we will show later that it gives $T\to {T}^{\prime}$ theoretically.
Our contributions are summarized below:

•
We propose an MBRL method called model imitation (MI), which enforces the learned transition model to generate similar rollouts to the real one so that policy gradient is accurate;

•
We theoretically show that the transition can be learned by MI in the sense that $T\to {T}^{\prime}$ by consistency and the difference in cumulative rewards $R(T)R({T}^{\prime})$ is small;

•
To stabilize model learning, we deduce guarantee for our sampling technique and investigate training across WGANs;

•
We experimentally show that MI is more sample efficient than stateoftheart MBRL and MFRL methods and outperforms them on four standard tasks.
2 Related work
In this section, we introduce our motivation inspired by learning from demonstration (LfD) [Schaal, 1997] and give a brief survey of MBRL methods.
2.1 Learning from Demonstration
A straightforward approach to LfD is to leverage behavior cloning (BC), which reduces LfD to a supervised learning problem. Even though learning a policy via BC is timeefficient, it cannot imitate a policy without sufficient demonstration because the error may accumulate without the guidance of expert [Ross et al., 2011]. Generative Adversarial Imitation Learning (GAIL) [Ho and Ermon, 2016] is another stateoftheart IfD method that learns an optimal policy by utilizing generative adversarial training to match occupancy measure [Syed et al., 2008b]. GAIL learns an optimal policy by matching the distribution of the trajectories generated from an agent policy with the distribution of the given demonstration. Ho and Ermon [2016] shows that the two distributions match if and only if the agent has learned the optimal policy. One of the advantages of GAIL is that it only requires a small amount of demonstration data to obtain an optimal policy but it requires a considerable number of interactions with environments for the generative adversarial training to converge.
Our intuition is that we analogize transition learning (TL) to learning from demonstration (LfD). In LfD, trajectories sampled from a fixed transition are given, and the goal is to learn a policy. On the other hand, in TL, trajectories sampled from a fixed policy are given, and we would like to imitate the underlying transition. That being said, from LfD to TL, we interchange the roles of the policy and the transition. It is therefore tempting to study the counterpart of GAIL in TL; i.e., learning the transition by distribution matching. Fortunately, by doing so, the pros of GAIL remain while the cons are insubstantial in MBRL because sampling with the learned model is considered to be much cheaper than sampling in the real one. That GAIL learns a better policy than what BC does suggests that distribution matching possess the potential to learn a better transition than supervised learning.
2.2 ModelBased Reinforcement Learning
For deterministic transition, it is usually optimized with ${\mathrm{\ell}}_{2}$based error. Nagabandi et al. [2018], an approach that uses supervised learning with meansquared error as its objective, is shown to perform well under finetuning. To alleviate model bias, some previous works adopt ensembles [Kurutach et al., 2018; Buckman et al., 2018], where multiple transition models with different initialization are trained at the same time. In a slightly more complicated manner, Clavera et al. [2018] utilizes metalearning to gather information from multiple models. Lastly, on the theoretical side, SLBO [Luo et al., 2019] is the first algorithm that develops from solid theoretical properties for modelbased deep RL via a joint modelpolicy optimization framework.
For the stochastic transition, maximum likelihood estimator or moment matching are natural ways to learn a synthesized transition, which is usually modeled by the Gaussian distribution. Following this idea, Gaussian process [Kupcsik et al., 2013; Deisenroth et al., 2015] and Gaussian process with model predictive control [Kamthe and Deisenroth, 2017] are introduced as an uncertaintyaware version of MBRL. Similar to the deterministic case, to mitigate model bias and foster stability, an ensemble method for probabilistic networks [Chua et al., 2018] is also studied. An important distinction between training a deterministic or stochastic transition is that although the stochastic transition can model the noise hidden within the real world, the stochastic model may also induce instability if the true transition is deterministic. This is a potential reason why an ensemble of models is adopted to reduce variance.
3 Background
3.1 Reinforcement Learning
We consider the standard Markov Decision Process (MDP) [Sutton and Barto, 1998]. MDP is represented by a tuple $\u27e8S,\mathcal{A},T,r,\gamma \u27e9$, where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $T({s}_{t+1}{s}_{t},{a}_{t})$ is the transition density of state ${s}_{t+1}$ at time step $t+1$ given action ${a}_{t}$ made under state ${s}_{t}$, $r(s,a)$ is the reward function, and $\gamma \in (0,1)$ is the discount factor.
A stochastic policy $\pi (as)$ is a density of action $a$ given state $s$. Let the initial state distribution be $\alpha $. The performance of the triple $(\alpha ,\pi ,T)$ is evaluated in the expectation of the cumulative reward in the $\gamma $discounted infinite horizon setting:
$R(\alpha ,\pi ,T)=\mathbb{E}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}r({s}_{t},{a}_{t})\alpha ,\pi ,T]=\mathbb{E}[{\displaystyle \sum _{t=0}^{H1}}r({s}_{t},{a}_{t})\alpha ,\pi ,T].$  (1) 
Equivalently, $R(\alpha ,\pi ,T)$ is the expected cumulative rewards in a length$H$ trajectory ${\{{s}_{t},{a}_{t}\}}_{t=0}^{H1}$ generated by $(\alpha ,\pi ,T)$ with $H\sim \text{Geometric}(1\gamma )$. When $\alpha $ and $T$ are fixed, $R(\cdot )$ becomes a function that only depends on $\pi $, and reinforcement learning algorithms [Sutton and Barto, 1998] aim to find a policy $\pi $ to maximize $R(\pi )$.
3.2 Occupancy Measure
Given initial state distribution $\alpha (s)$, policy $\pi (as)$ and transition $T({s}^{\prime}s,a)$, the normalized occupancy measure ${\rho}_{T}^{\alpha ,\pi}(s,a)$ generated by $(\alpha ,\pi ,T)$ is defined as
$${\rho}_{T}^{\alpha ,\pi}(s,a)=\sum _{t=o}^{\mathrm{\infty}}(1\gamma ){\gamma}^{t}\mathbb{P}({s}_{t}=s,{a}_{t}=a\alpha ,\pi ,T)=(1\gamma )\sum _{t=0}^{H1}\mathbb{P}({s}_{t}=s,{a}_{t}=a\alpha ,\pi ,T),$$  (2) 
where $\mathbb{P}(\cdot )$ is the probability measure and will be replaced by a density function if $\mathcal{S}$ or $\mathcal{A}$ is continuous. Intuitively, ${\rho}_{T}^{\alpha ,\pi}(s,a)$ is a distribution of $(s,a)$ in a length$H$ trajectory ${\{{s}_{t},{a}_{t}\}}_{t=0}^{H1}$ with $H\sim \text{Geometric}(1\gamma )$ following the laws of $(\alpha ,\pi ,T)$. From Syed et al. [2008a], the relation between ${\rho}_{T}^{\alpha ,\pi}$ and $(\alpha ,\pi ,T)$ is characterized by the Bellman flow constraint. Specifically, $x={\rho}_{T}^{\alpha ,\pi}$ as defined in Eq. 2 is the unique solution to:
$$x(s,a)={\pi}_{\theta}(as)\left[(1\gamma )\alpha (s)+\gamma \int x({s}^{\prime},{a}^{\prime})T(s{s}^{\prime},{a}^{\prime})\mathit{d}{s}^{\prime}\mathit{d}{a}^{\prime}\right],x(s,a)\ge 0.$$  (3) 
In addition, Theorem 2 of Syed et al. [2008a] gives that $\pi (as)$ and ${\rho}_{T}^{\alpha ,\pi}(s,a)$ have an onetoone correspondence with $\alpha (s)$ and $T({s}^{\prime}s,a)$ fixed; i.e., $\pi (as)\triangleq \frac{\rho (s,a)}{{\scriptscriptstyle \int \rho (s,a)\mathit{d}a}}$ is the only policy whose occupancy measure is $\rho $.
With the occupancy measure, the cumulative reward Eq. 1 can be represented as
$$R(\alpha ,\pi ,T)={\mathbb{E}}_{(s,a)\sim {\rho}_{T}^{\alpha ,\pi}}[r(s,a)]/(1\gamma ).$$  (4) 
The goal of maximizing the cumulative reward can then be achieved by adjusting ${\rho}_{T}^{\alpha ,\pi}$, and this motivates us to adopt distribution matching approaches like WGAN [Arjovsky et al., 2017] to learn a transition model.
4 Theoretical Analysis for WGAN
In this section, we present a consistency result and error bounds for WGAN [Arjovsky et al., 2017]. All proofs of the following theorems and lemmas can be found in Appendix A.
In the setting of MBRL, the training objective for WGAN is
$$\underset{{T}^{\prime}}{\mathrm{min}}\underset{{\parallel f\parallel}_{L}\le 1}{\mathrm{max}}{\mathbb{E}}_{(s,a)\sim {\rho}_{T},{s}^{\prime}\sim T(\cdot s,a)}[f(s,a,{s}^{\prime})]{\mathbb{E}}_{(s,a)\sim {\rho}_{{T}^{\prime}},{s}^{\prime}\sim {T}^{\prime}(\cdot s,a)}[f(s,a,{s}^{\prime})].$$  (5) 
By KantorovichRubinstein duality [Villani, 2008], the optimal value of the inner maximization is exactly ${W}_{1}(p(s,a,{s}^{\prime}){p}^{\prime}(s,a,{s}^{\prime}))$ where $p(s,a,{s}^{\prime})={\rho}_{T}(s,a)T({s}^{\prime}s,a)$ is the discounted distribution of $(s,a,{s}^{\prime})$. Thus, by minimizing over the choice of ${T}^{\prime}$, we are essentially finding ${p}^{\prime}$ that minimizes ${W}_{1}(p(s,a,{s}^{\prime}){p}^{\prime}(s,a,{s}^{\prime}))$, which gives the consistency result.
Proposition 4.1 (Consistency for WGAN).
Let $T$ and ${T}^{\mathrm{\prime}}$ be the true and synthesized transitions respectively. If WGAN is trained to its optimal point, we have
$$T({s}^{\prime}s,a)={T}^{\prime}({s}^{\prime}s,a),\forall (s,a)\in \text{\mathit{S}\mathit{u}\mathit{p}\mathit{p}}({\rho}_{T}),$$ 
where $\text{\mathit{S}\mathit{u}\mathit{p}\mathit{p}}\mathit{}\mathrm{(}{\rho}_{T}\mathrm{)}$ is the support of ${\rho}_{T}$.
The support constraint is inevitable because the training data is sampled from ${\rho}_{T}$ and guaranteeing anything beyond it can be difficult. Still, we will empirically show that the support constraint is not an issue in our experiments because the performance boosts up in the beginning, indicating that $\text{Supp}({\rho}_{T})$ may be large enough initially.
Now that training with WGAN gives a consistent estimate of the true transition, it is sensible to train a synthesized transition upon it. However, the consistency result is too restrictive as it only discusses the optimal case. The next step is to analyze the nonoptimal situation and observe how the cumulative reward deviates w.r.t. the training error.
Theorem 4.2 (Error Bound for WGAN).
Let ${\rho}_{T}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{,}{\rho}_{{T}^{\mathrm{\prime}}}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ be the normalized occupancy measures generated by the true transition $T$ and the synthesized one ${T}^{\mathrm{\prime}}$. If the reward function is ${L}_{r}$Lipschitz and the training error of WGAN is $\u03f5$, we have $\mathrm{}R\mathit{}\mathrm{(}T\mathrm{)}\mathrm{}R\mathit{}\mathrm{(}{T}^{\mathrm{\prime}}\mathrm{)}\mathrm{}\mathrm{\le}\u03f5\mathit{}{L}_{r}\mathrm{/}\mathrm{(}\mathrm{1}\mathrm{}\gamma \mathrm{)}$.
Theorem 4.2 indicates that if WGAN is trained properly, i.e., having small $\u03f5$, the cumulative reward on the synthesized trajectory will be close to that on the true trajectory. As MBRL aims to train a policy on the synthesized trajectory, the accuracy of the cumulative reward over the synthesized trajectory is thus the bottleneck. Theorem 4.2 also implies that WGAN’s error is linear to the (expected) length of the trajectory ${(1\gamma )}^{1}$. This is a sharp contrast to the error bounds in most RL literature, as the dependency on the trajectory length is usually quadratic [Syed and Schapire, 2010; Ross et al., 2011], or of even higher order. Since WGAN gives us a better estimation of the cumulative reward in the learned model, the policy update becomes more accurate.
5 Model Imitation for ModelBased Reinforcement Learning
In this section, we present a practical MBRL method called model imitation (MI) that incorporates the transition learning mentioned in Section 4.
5.1 Sampling Technique for Transition Learning
Due to the longterm digression, it is hard to train the WGAN directly from a long synthesized trajectory. To tackle this issue, we use the synthesized transition ${T}^{\prime}$ to sample $N$ short trajectories with initial states sampled from the true trajectory.
To analyze this sampling technique, let $$ be the discount factor of the short trajectories so that the expected length is $\mathbb{E}[L]={(1\beta )}^{1}$. Let ${\rho}_{{T}^{\prime}}^{\beta}$, ${\widehat{\rho}}_{T}^{\beta}$, ${\rho}_{T}^{\beta}$, ${\rho}_{T}$ be the normalized occupancy measures of synthesized short trajectories, empirical true short trajectories, true short trajectories and the true long trajectories. The 1Wasserstein distance can be bounded by
$${W}_{1}({\rho}_{{T}^{\prime}}^{\beta}{\rho}_{T})\le {W}_{1}({\rho}_{{T}^{\prime}}^{\beta}{\widehat{\rho}}_{T}^{\beta})+{W}_{1}({\widehat{\rho}}_{T}^{\beta}{\rho}_{T}^{\beta})+{W}_{1}({\rho}_{T}^{\beta}{\rho}_{T}).$$ 
${W}_{1}({\rho}_{{T}^{\prime}}^{\beta}{\widehat{\rho}}_{T}^{\beta})$ is upper bounded by the training error of WGAN on short trajectories, which can be small empirically because the short ones are easier to imitate. ${W}_{1}({\widehat{\rho}}_{T}^{\beta}{\rho}_{T}^{\beta})={\mathbb{E}}_{L}[O({(NL)}^{1/d})]=O({((1\beta )/N)}^{1/d}/\beta )$ by Canas and Rosasco [2012] and Lemma A.3, where $d$ is the dimension of $(s,a)$. ${W}_{1}({\rho}_{T}^{\beta}{\rho}_{T})\le \text{diam}(\mathcal{S}\times \mathcal{A})(1\gamma )\beta /(\gamma \beta )$ by Lemma A.4 and ${W}_{1}\le {D}_{TV}\text{diam}(\mathcal{S}\times \mathcal{A})$ [Gibbs and Su, 2002], where $\text{diam}(\cdot )$ is the diameter. The second term encourages $\beta $ to be large while the third term does the opposite. Besides, $\beta $ need not be large if $N$ is large enough; in practice we may sample $N$ short trajectories to reduce the error from ${W}_{1}({\rho}_{{T}^{\prime}}{\rho}_{T})$ to ${W}_{1}({\rho}_{{T}^{\prime}}^{\beta}{\rho}_{T})$. Finally, since ${\rho}_{{T}^{\prime}}^{\beta}$ is the occupancy measure we train on, from the proof of Theorem 4.2 we deduce that
$$R(T)R({T}^{\prime})\le {W}_{1}({\rho}_{{T}^{\prime}}^{\beta}{\rho}_{T}){L}_{r}/(1\gamma ).$$ 
Thus, WGAN may perform better under this sampling technique.
5.2 Empirical Transition Learning
To learn the real transition based on the occupancy measure matching mentioned in Section 4, we employ a transition learning scheme by aligning the distribution of $(s,a,{s}^{\prime})$ between the real and the learned environments. Inspired by how GAIL [Ho and Ermon, 2016] learns to align $(s,a)$ via solving an MDP with rewards extracted from a discriminator, we formulate an MDP with rewards from a discriminator over $(s,a,{s}^{\prime})$. Specifically, the WGAN critic $f(s,a,{s}^{\prime})$ in Eq. 5 is used as the (psuedo) rewards $r(s,a,{s}^{\prime})$ of our MDP. Interestingly, there is a duality between GAIL and our transition learning: for GAIL, the transition is fixed and the objective is to train a policy to maximize the cumulative pseudo rewards, while for our transition learning, the policy is fixed and the objective is to train a synthesized transition to maximize the cumulative pseudo rewards.
In practice, since the policy is updated alternatively with the synthesized model, we are required to train a number of WGANs along with the change of the policy. Although the generators across WGANs correspond to the same transition and can be similar, we observe that WGAN may get stuck at a local optimum when we switch from one WGAN training to another. The reason is that, unlike GAN that mimics the JensenShannon divergence and hence its inner maximization is upper bounded by $\mathrm{log}(2)$, WGAN mimics the Wasserstein distance and the inner maximization is unbounded from above. Intuitively, such unboundedness makes the WGAN critic so strong that the WGAN generator (the synthesized transition) cannot find a way out and gets stuck at a local optimum. Thereby, we have to modify the WGAN objective to alleviate such situation. To ensure the boundedness, for a fixed $\delta >0$, we introduce cutoffs at the WGAN objective so that the inner maximization is upper bounded by $2\delta $:
$$\underset{{T}^{\prime}}{\mathrm{min}}\underset{{\parallel f\parallel}_{L}\le 1}{\mathrm{max}}{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{T}\\ {s}^{\prime}\sim T(\cdot s,a)\end{array}}[\mathrm{min}(\delta ,f(s,a,{s}^{\prime})]+{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{{T}^{\prime}}\\ {s}^{\prime}\sim {T}^{\prime}(\cdot s,a)\end{array}}[\mathrm{min}(\delta ,f(s,a,{s}^{\prime}))].$$  (6) 
As $\delta \to \mathrm{\infty}$, Eq. 6 recovers the WGAN objective, Eq. 5. Therefore, this is a truncated version of WGAN. To comprehend Eq. 6 further, notice that it is equivalent to
$$\begin{array}{cc}& \underset{{T}^{\prime}}{\mathrm{min}}\underset{{\parallel f\parallel}_{L}\le 1}{\mathrm{max}}{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{T}\\ {s}^{\prime}\sim T(\cdot s,a)\end{array}}[\mathrm{min}(0,f(s,a,{s}^{\prime})\delta )]+{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{{T}^{\prime}}\\ {s}^{\prime}\sim {T}^{\prime}(\cdot s,a)\end{array}}[\mathrm{min}(0,f(s,a,{s}^{\prime})\delta )]\hfill \\ \hfill \iff & \underset{{T}^{\prime}}{\mathrm{min}}\underset{{\parallel f\parallel}_{L}\le 1}{\mathrm{min}}{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{T}\\ {s}^{\prime}\sim T(\cdot s,a)\end{array}}[\mathrm{max}(0,\delta f(s,a,{s}^{\prime}))]+{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{{T}^{\prime}}\\ {s}^{\prime}\sim {T}^{\prime}(\cdot s,a)\end{array}}[\mathrm{max}(0,\delta +f(s,a,{s}^{\prime}))],\hfill \end{array}$$  (7) 
which is a hinge loss version of the generative adversarial objective. Such WGAN is introduced in Lim and Ye [2017], where the consistency result is provided and further experiments are evaluated in Zhang et al. [2018]. According to Lim and Ye [2017], the inner minimization can be interpreted as the softmargin SVM. Consequently, it provides a geometric intuition of maximizing margin, which potentially enhances robustness. Finally, because the objective of transition learning is to maximize the cumulative pseudo rewards on the MDP, ${T}^{\prime}$ does not directly optimize Eq. 7. Note that the truncation only takes part in the inner minimization:
$$\underset{{\parallel f\parallel}_{L}\le 1}{\mathrm{min}}{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{T}\\ {s}^{\prime}\sim T(\cdot s,a)\end{array}}[\mathrm{max}(0,\delta f(s,a,{s}^{\prime}))]+{\mathbb{E}}_{\begin{array}{c}(s,a)\sim {\rho}_{{T}^{\prime}}\\ {s}^{\prime}\sim {T}^{\prime}(\cdot s,a)\end{array}}[\mathrm{max}(0,\delta +f(s,a,{s}^{\prime}))],$$  (8) 
which gives us a WGAN critic $f(s,a,{s}^{\prime})$. As mentioned, $f$ will be the pseudo reward function. Later, we will introduce a transition learning version of PPO [Schulman et al., 2017] to optimize the cumulative pseudo reward.
{algorithmic}[1] \StateInitialize policy ${\pi}_{\theta}$, transition model ${T}_{\varphi}$, WGAN critic ${f}_{w}$, environment dataset ${\mathcal{D}}_{\text{env}}$ \For$i=0,1,2,\mathrm{\dots}$ \StateTake actions in real environment according to ${\pi}_{\theta}$; ${\mathcal{D}}_{\text{env}}\leftarrow {\mathcal{D}}_{\text{env}}\cup {\mathcal{D}}_{i}$ \StatePretrain ${T}_{\varphi}$ and ${f}_{w}$ by optimizing Eq. 8 and 11 with ${\mathcal{D}}_{i}$ and ${\mathcal{D}}_{\text{env}}$ \For$N$ epochs \For${n}_{\text{transition}}$ epochs \Stateoptimize Eq. 8 and 11 over $\varphi $ and $w$ with ${\mathcal{D}}_{i}$ \EndFor\For${n}_{\text{policy}}$ epochs \Stateupdate ${\pi}_{\theta}$ by TRPO on the data generated by ${T}_{\varphi}$ \EndFor\EndFor\EndFor
After modifying the WGAN objective, to include both the stochastic and (approximately) deterministic scenarios, the synthesized transition is modeled by a Gaussian distribution ${T}^{\prime}({s}^{\prime}s,a)={T}_{\varphi}({s}^{\prime}s,a)\sim \mathcal{N}({\mu}_{\varphi}(s,a),{\mathrm{\Sigma}}_{\varphi}(s,a))$. Although the underlying transitions of tasks like MuJoCo [Todorov et al., 2012] are deterministic, modeling by a Gaussian does not harm the transition learning empirically.
Recall that the synthesized transition is trained on an MDP whose reward function is the critic of the truncated WGAN. To achieve this goal with proper stability, we employ PPO [Schulman et al., 2017], which is an efficient approximation of TRPO [Schulman et al., 2015]. Note that although the PPO is originally designed for policy optimization, it can be adapted to transition learning with a fixed sampling policy and the PPO objective (Eq. 7 of Schulman et al. [2017])
$${\mathcal{L}}_{\text{PPO}}(\varphi )={\widehat{\mathbb{E}}}_{t}\left[\mathrm{min}({r}_{t}(\varphi ){\widehat{A}}_{t},\text{clip}({r}_{t}(\varphi ),1\u03f5,1+\u03f5){\widehat{A}}_{t})\right],$$  (9) 
where
$${r}_{t}(\varphi )=\frac{{T}_{\varphi}({s}_{t+1}{s}_{t},{a}_{t})}{{T}_{{\varphi}_{\text{old}}}({s}_{t+1}{s}_{t},{a}_{t})},{\widehat{A}}_{t}:\text{advantage func. derived from the pseudo reward}f({s}_{t},{a}_{t},{s}_{t+1}).$$  (10) 
To enhance stability of the transition learning, in addition to PPO, we also optimize maximum likelihood, which can be regarded as a regularization. We empirically observe that jointly optimizing both maximum likelihood and the PPO objective attains better transition model for policy gradient. The overall loss of the transition learning becomes
${\mathcal{L}}_{\text{transition}}={\mathcal{L}}_{\text{PPO}}+\alpha {\mathcal{L}}_{\text{mle}},$  (11) 
where ${\mathcal{L}}_{\text{mle}}$ is the loss of MLE, which is policyagnostic and can be estimated with all collected real transitions. For more implementation details, please see Appendix B.1.
We consider a training procedure similar to SLBO [Luo et al., 2019], where they consider the fact that the value function is dependent on the varying transition model. As a result, unlike most of the MBRL methods that have only one pair of modelpolicy update for each real environment sampling, SLBO proposes to take multiple update pairs for each real environment sampling so that the objective composed of the model loss and the value loss can be optimized. Our proposed model imitation (MI) method is summarized in Algorithm 5.2.
In the experiment section, we would like to answer the following questions. (1) Does the proposed model imitation outperforms the stateoftheart in terms of sample complexity and average return? (2) Does the proposed model imitation benefit from distribution matching and is superior to its modelfree and modelbased counterparts, TRPO and SLBO?
To fairly compare algorithms and enhance reproducibility, we adopt opensourced environments released along with a modelbased benchmark paper [Langlois et al., 2019], which is based on a physical simulation engine, MuJoCo [Todorov et al., 2012]. Specifically, we evaluate the proposed algorithm MI on four continuous control tasks including Hopper, HalfCheetah, Ant, and Reacher. For hyperparameters mentioned in Algorithm 5.2 and coefficients such as entropy regularization $\lambda $, please refer to Appendix B.2.
We compare to two modelfree algorithms, TRPO [Schulman et al., 2015] and PPO [Schulman et al., 2017], to assess the benefit of utilizing the proposed model imitation since our MI (Algorithm 5.2) uses TRPO for policy gradient to update the agent policy. We also compare MI to four modelbased methods. SLBO [Luo et al., 2019] gives theoretical guarantee of monotonic improvement for modelbased deep RL and proposes to update a joint modelpolicy objective. PETS [Chua et al., 2018] propose to employ uncertaintyaware dynamic models with samplingbased uncertainty to capture both aleatoric and epistemic uncertainty. METRPO [Kurutach et al., 2018] shows that insufficient data may cause instability and propose to use an ensemble of models to regularize the learning process. STEVE [Buckman et al., 2018] dynamically interpolates among model rollouts of various horizon lengths and favors those whose estimates have lower error.
Figure 2 shows the learning curves for all methods. In Hopper, HalfCheetah, and Ant, MI converges fairly fast and learns a policy significantly better than competitors’. In Ant, even though MI does not improve the performance too much from the initial one, the fact that it maintains the average return at around 1,000 indicates that MI can capture a better transition than other methods do with only 5,000 transition data. Even though we do not employ an ensemble of models, the curves show that our learning does not suffer from high variance. In fact, the performance shown in Figure 2 indicates that the variance of MI is lower than that of methods incorporating ensembles such as METRPO and PETS.
The questions raised at the beginning of this section can now be answered. The learned model enables TRPO to explore the world without directly access real transitions and therefore TRPO equipped with MI needs much fewer interactions with the real world to learn a good policy. Even though MI is based on the training framework proposed in SLBO, the additional distribution matching component allows the synthesized model to generate similar rollouts to that of the real environments, which empirically gives superior performance because we rely on long rollouts to estimate policy gradient.
To better understand the performance presented in Figure 2, we further compare MI with benchmarked RL algorithms recorded in Langlois et al. [2019] including stateoftheart MFRL methods such as TD3 [Fujimoto et al., 2018] and SAC [Haarnoja et al., 2018]. It should be noted that the reported results of Langlois et al. [2019] are the final performance after 200k timesteps but we only use up to 100k timesteps to train MI. Table 1 indicates that MI significantly outperforms most of the MBRL and MFRL methods with $50\%$ fewer samples, which verifies that MI is more sampleefficient by incorporating distribution matching.
Hopper  HalfCheetah  Ant  Reacher  

MBRL  8/10  10/10  8/10  8/10 
MFRL  3/4  2/4  4/4  3/4 
6 Conclusion
We have pointed out that the stateoftheart methods concentrate on learning synthesized models in a supervised fashion, which does not guarantee that the policy is able to reproduce a similar trajectory in the learned model and therefore the model may not be accurate enough to estimate long rollouts. We have proposed to incorporate WGAN to achieve occupancy measure matching between the real transition and the synthesized model and theoretically shown that matching indicates the closeness in cumulative rewards between the synthesized model and the real environment.
To enable stable training across WGANs, we have suggested using a truncated version of WGAN to prevent training from getting stuck at local optimums. The empirical property of WGAN application such as imitation learning indicates its potential to learn the transition with fewer samples than supervised learning. We have confirmed it experimentally by further showing that MI converges much faster and obtains better policy than stateoftheart modelbased and modelfree algorithms.
Acknowledgement
The authors would like to acknowledge the National Science Founding for the grant RI1764078 and Qualcomm for the generous support.
References
 Wasserstein gan. External Links: arXiv:1701.07875 Cited by: §1, §3.2, §4.
 Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 8224–8234. Cited by: §2.2, §5.2.
 Learning probability measures with respect to optimal transport metrics. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12, USA, pp. 2492–2500. Cited by: §5.1.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 4754–4765. Cited by: §2.2, §5.2.
 Modelbased reinforcement learning via metapolicy optimization. In Conference on Robot Learning, pp. 617–629. Cited by: §1, §2.2.
 Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2), pp. 408–423. Cited by: §2.2.
 Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §5.2.
 On choosing and bounding probability metrics. International Statistical Review 70 (3), pp. 419–435. Cited by: §5.1.
 Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §B.1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §5.2.
 Generative adversarial imitation learning. In NeurIPS, pp. 4565–4573. Cited by: §2.1, §5.2.
 When to trust your model: modelbased policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §1.
 Dataefficient reinforcement learning with probabilistic model predictive control. In AISTATS, Cited by: §2.2.
 Dataefficient generalization of robot skills with contextual policy search. In Proceedings of the TwentySeventh AAAI Conference on Artificial Intelligence, AAAI’13, pp. 1401–1407. Cited by: §2.2.
 Modelensemble trustregion policy optimization. In International Conference on Learning Representations, External Links: Link Cited by: §B.1, §1, §2.2, §5.2.
 Benchmarking modelbased reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: §1, §5.2, §5.2, Table 1.
 Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §5.2.
 Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, Cited by: §B.1, §1, §1, §2.2, §5.2, §5.2.
 Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7559–7566. External Links: Document, ISSN 2577087X Cited by: §2.2.
 A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2.1, §4.
 Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046. Cited by: §2.
 Trust region policy optimization. In ICML, pp. 1889–1897. Cited by: §5.2, §5.2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.2, §5.2, §5.2.
 Introduction to reinforcement learning. Vol. 135, MIT press. Cited by: §3.1, §3.1.
 Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 1032–1039. External Links: ISBN 9781605582054, Link, Document Cited by: §3.2.
 Apprenticeship learning using linear programming. In ICML, pp. 1032–1039. Cited by: §2.1.
 A reduction from apprenticeship learning to classification. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2253–2261. Cited by: §4.
 Mujoco: a physics engine for modelbased control. In IROS, pp. 5026–5033. Cited by: §5.2, §5.2.
 Optimal transport – old and new. Vol. 338, pp. xxii+973. Cited by: §A.1, §4.
 The computation of polylogarithms. Technical report Technical Report 1592*, University of Kent, Computing Laboratory, University of Kent, Canterbury, UK. Cited by: §A.2.
 Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §5.2.
Appendix A Proofs
A.1 Proof for WGAN
Proposition A.1 (Consistency for WGAN).
Let $\alpha \mathit{}\mathrm{(}s\mathrm{)}\mathrm{,}\pi \mathit{}\mathrm{(}a\mathrm{}s\mathrm{)}\mathrm{,}{T}^{\mathrm{\prime}}\mathit{}\mathrm{(}{s}^{\mathrm{\prime}}\mathrm{}s\mathrm{,}a\mathrm{)}$ be initial state distribution, policy and synthesized transition. Let $T$ be the true transition, $p\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{)}\mathrm{=}{\rho}_{T}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathit{}T\mathit{}\mathrm{(}{s}^{\mathrm{\prime}}\mathrm{}s\mathrm{,}a\mathrm{)}$ be the discounted distribution of the triple $\mathrm{(}s\mathrm{,}a\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{)}$. If the WGAN is trained to its optimal point, we have
$$T({s}^{\prime}s,a)={T}^{\prime}({s}^{\prime}s,a),\forall (s,a)\in \text{\mathit{S}\mathit{u}\mathit{p}\mathit{p}}({\rho}_{T}).$$ 
Proof.
Because the loss function of WGAN is the 1Wasserstein distance, we know $p(s,a,{s}^{\prime})={p}^{\prime}(s,a,{s}^{\prime})$ at its optimal points. Plug in to the Bellman flow constraint Eq. (3),
${\rho}_{{T}^{\prime}}(s,a)$  $=\pi (as)\left[(1\gamma )\alpha (s)+\gamma {\displaystyle \int {\rho}_{{T}^{\prime}}({s}^{\prime},{a}^{\prime}){T}^{\prime}(s{s}^{\prime},{a}^{\prime})\mathit{d}{s}^{\prime}\mathit{d}{a}^{\prime}}\right]$  
$=\pi (as)\left[(1\gamma )\alpha (s)+\gamma {\displaystyle \int {p}^{\prime}({s}^{\prime},{a}^{\prime},s)\mathit{d}{s}^{\prime}\mathit{d}{a}^{\prime}}\right]$  
$\stackrel{p={p}^{\prime}}{=}\pi (as)\left[(1\gamma )\alpha (s)+\gamma {\displaystyle \int p({s}^{\prime},{a}^{\prime},s)\mathit{d}{s}^{\prime}\mathit{d}{a}^{\prime}}\right]={\rho}_{T}(s,a)$ 
That is,
$$\text{WGANisopt.}\iff p(s,a,{s}^{\prime})={p}^{\prime}(s,a,{s}^{\prime})\stackrel{\text{Bellman}}{\iff}{\rho}_{T}(s,a)={\rho}_{{T}^{\prime}}(s,a).$$ 
Finally, recall $p(s,a,{s}^{\prime})\triangleq {\rho}_{T}(s,a)T({s}^{\prime}s,a)\text{and}{p}^{\prime}(s,a,{s}^{\prime})\triangleq {\rho}_{{T}^{\prime}}(s,a){T}^{\prime}({s}^{\prime}s,a)$, we arrive at
$$\text{WGANisopt.iff}T({s}^{\prime}s,a)={T}^{\prime}({s}^{\prime}s,a),\forall (s,a)\in \text{Supp}({\rho}_{T}).$$ 
∎
Theorem A.2 (Twosided Errors for WGAN).
Let ${\rho}_{T}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{,}{\rho}_{{T}^{\mathrm{\prime}}}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ be normalized occupancy measures generated by the true transition $T$ and the synthesized one ${T}^{\mathrm{\prime}}$. Suppose the reward is ${L}_{r}$Lipschitz. If the training error of WGAN is $\u03f5$, then $\mathrm{}R\mathit{}\mathrm{(}T\mathrm{)}\mathrm{}R\mathit{}\mathrm{(}{T}^{\mathrm{\prime}}\mathrm{)}\mathrm{}\mathrm{\le}\u03f5\mathit{}{L}_{r}\mathrm{/}\mathrm{(}\mathrm{1}\mathrm{}\gamma \mathrm{)}$.
Proof.
Observe that the occupancy measure ${\rho}_{T}(s,a)$ is a marginal distribution of $p(s,a,{s}^{\prime})={\rho}_{T}(s,a)T({s}^{\prime}s,a)$. Because the distance between the marginal is upper bounded by that of the joint, we have
$${W}_{1}({\rho}_{T}(s,a){\rho}_{{T}^{\prime}}(s,a))\le {W}_{1}(p(s,a,{s}^{\prime}){p}^{\prime}(s,a,{s}^{\prime}))=\u03f5,$$ 
where ${W}_{1}$ is the 1Wasserstein distance. Then, the cumulative reward is bounded by
$$\begin{array}{cc}\hfill R(T)& =\frac{1}{1\gamma}\int r(s,a){\rho}_{T}(s,a)\mathit{d}s\mathit{d}a=R({T}^{\prime})+\frac{1}{1\gamma}\int r(s,a)\left({\rho}_{T}(s,a){\rho}_{{T}^{\prime}}(s,a)\right)\mathit{d}s\mathit{d}a\hfill \\ & =R({T}^{\prime})+\frac{{L}_{r}}{1\gamma}\int \frac{r(s,a)}{{L}_{r}}\left({\rho}_{T}(s,a){\rho}_{{T}^{\prime}}(s,a)\right)\mathit{d}s\mathit{d}a\hfill \\ & \le R({T}^{\prime})+\frac{{L}_{r}}{1\gamma}\underset{{\parallel f\parallel}_{L}\le 1}{sup}\int f(s,a)\left({\rho}_{T}(s,a){\rho}_{{T}^{\prime}}(s,a)\right)\mathit{d}s\mathit{d}a\hfill \\ & =R({T}^{\prime})+\frac{{L}_{r}}{1\gamma}\underset{{\parallel f\parallel}_{L}\le 1}{sup}{\mathbb{E}}_{(s,a)\sim {\rho}_{T}}[f(s,a)]{\mathbb{E}}_{(s,a)\sim {\rho}_{{T}^{\prime}}}[f(s,a)]\hfill \\ & =R({T}^{\prime})+\frac{{L}_{r}}{1\gamma}{W}_{1}({\rho}_{T}{\rho}_{{T}^{\prime}})\le R({T}^{\prime})+\u03f5\frac{{L}_{r}}{1\gamma},\hfill \end{array}$$ 
where the first inequality holds because $r(s,a)/{L}_{r}$ is 1Lipschitz and the last equality follows from KantorovichRubinstein duality Villani [2008]. Since ${W}_{1}$ distance is symmetric, the same conclusion holds if we interchange $T$ and ${T}^{\prime}$, so we arrive at
$$R(T)R({T}^{\prime})\le \u03f5{L}_{r}/(1\gamma ).$$ 
∎
A.2 Lemmas for Sampling Techniques
Lemma A.3.
Let $L\mathrm{\sim}\text{\mathit{G}\mathit{e}\mathit{o}\mathit{m}\mathit{e}\mathit{t}\mathit{r}\mathit{i}\mathit{c}}\mathit{}\mathrm{(}\mathrm{1}\mathrm{}\beta \mathrm{)}$. If $d\mathrm{\ge}\mathrm{1}$, then $\mathrm{E}\mathit{}\mathrm{[}{L}^{\mathrm{}\mathrm{1}\mathrm{/}d}\mathrm{]}\mathrm{=}O\mathit{}\mathrm{(}{\mathrm{(}\mathrm{1}\mathrm{}\beta \mathrm{)}}^{\mathrm{1}\mathrm{/}d}\mathrm{/}\beta \mathrm{)}$.
Proof.
$$\mathbb{E}[{L}^{1/d}]=\sum _{i=1}^{\mathrm{\infty}}{i}^{1/d}(1\beta ){\beta}^{i1}=\frac{1\beta}{\beta}\sum _{i=1}^{\mathrm{\infty}}\frac{{\beta}^{i}}{{i}^{1/d}}=\frac{1\beta}{\beta}{\text{Li}}_{1/d}(\beta ),$$ 
where Li is the polylogarithm function. From Wood [1992], the limiting behavior of it is
$${\text{Li}}_{1/d}({e}^{\mu})=\mathrm{\Gamma}(11/d){\mu}^{1/d1},\text{as}\mu \to {0}^{+},$$ 
where $\mathrm{\Gamma}$ is the gamma function. Since ${e}^{\mu}\to 1\mu $ when $\mu \to {0}^{+}$, we know when $\beta \to {1}^{}$, ${\text{Li}}_{1/d}(\beta )\to \mathrm{\Gamma}(11/d){(1\beta )}^{1/d1}$. Finally, since $\mathrm{\Gamma}(11/d)\le 1$, we conclude that $\mathbb{E}[{L}^{1/d}]=O({(1\beta )}^{1/d}/\beta )$. ∎
Lemma A.4.
Let ${\rho}_{T}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ be a the normalized occupancy measure generated by the triple $\mathrm{(}\alpha \mathrm{,}\pi \mathrm{,}T\mathrm{)}$ with discount factor $\gamma $. Let ${\rho}_{T}^{\beta}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ be the normalized occupancy measure generated by the triple $\mathrm{(}{\rho}_{T}\mathrm{,}\pi \mathrm{,}T\mathrm{)}$ with discount factor $\beta $. If $\gamma \mathrm{>}\beta $, then ${D}_{T\mathit{}V}\mathit{}\mathrm{(}{\rho}_{T}\mathit{}\mathrm{}\mathrm{}\mathit{}{\rho}_{T}^{\beta}\mathrm{)}\mathrm{\le}\mathrm{(}\mathrm{1}\mathrm{}\gamma \mathrm{)}\mathit{}\beta \mathrm{/}\mathrm{(}\gamma \mathrm{}\beta \mathrm{)}$.
Proof.
By definition of the occupancy measure we have
$$\begin{array}{cc}& {\rho}_{T}(s,a)=\sum _{i=0}^{\mathrm{\infty}}(1\gamma ){\gamma}^{i}{f}_{i}(s,a).\hfill \\ & {\rho}_{T}^{\beta}(s,a)=\sum _{i=0}^{\mathrm{\infty}}\sum _{j=0}^{i}(1\gamma ){\gamma}^{ij}(1\beta ){\beta}^{j}{f}_{i}(s,a),\hfill \end{array}$$ 
where ${f}_{i}(s,a)$ is the density of $(s,a)$ at time $i$ if generated by the triple $(\alpha ,\pi ,T)$. The TV distance is bounded by
$$\begin{array}{cc}\hfill {D}_{TV}({\rho}_{T}{\rho}_{T}^{\beta})& \le \frac{1}{2}\sum _{i=0}^{\mathrm{\infty}}\left(1\gamma ){\gamma}^{i}\sum _{j=0}^{i}(1\gamma ){\gamma}^{ij}(1\beta ){\beta}^{j}\right=\frac{1}{2}\sum _{i=0}^{\mathrm{\infty}}(1\gamma ){\gamma}^{i}\left1\sum _{j=0}^{i}(1\beta ){\left(\frac{\beta}{\gamma}\right)}^{j}\right\hfill \\ & =\frac{1}{2}\sum _{i=0}^{\mathrm{\infty}}(1\gamma ){\gamma}^{i}\frac{1}{\gamma \beta}\left\beta (1\gamma )+{\left(\frac{\beta}{\gamma}\right)}^{i+1}(1\beta )\gamma \right\hfill \\ & \stackrel{(*)}{=}\frac{(1\gamma )\beta}{\gamma \beta}\sum _{i=0}^{M1}(1\gamma ){\gamma}^{i}+(1\beta ){\beta}^{i}=\frac{(1\gamma )\beta}{\gamma \beta}({\gamma}^{M}{\beta}^{M})\hfill \\ & \le \frac{(1\gamma )\beta}{\gamma \beta}.\hfill \end{array}$$ 
where $(*)$ comes from that $\beta (1\gamma )+{(\frac{\beta}{\gamma})}^{i}(1\beta )\gamma $ is a strictly decreasing function. Since $\gamma >\beta $, its sign flips from $+$ to $$ at some index; say $M$. Finally, the sum of the absolute value are the same from ${\sum}_{i=0}^{M1}$ and from ${\sum}_{i=M}^{\mathrm{\infty}}$ because the total probability is conservative, and the difference on one side is the same as that on the other. ∎
Appendix B Experiments
B.1 Implementation Details
We normalize states according to the statistics derived from the first batch of states from the real world. To ensure stability, we maintain the same mean ${\mu}_{0}$ and standard deviation ${\sigma}_{0}$ throughout the training process.
Instead of directly predicting the next state, we estimate the state difference ${s}_{t+1}{s}_{t}$ [Kurutach et al., 2018; Luo et al., 2019]. Since we incorporate state normalization, the transition network is trained to output $({s}_{t+1}{s}_{t}{\mu}_{0})/{\sigma}_{0}$.
To enhance state exploration, we sample real transitions according to policy $\beta \sim \mathcal{N}({\mu}_{\theta}(s),\sigma )$, where $\mu (s)$ is the mean of our Gaussian parameterized policy ${\pi}_{\theta}$ and $\sigma $ is a fixed standard deviation. In addition, since model the transition as a Gaussian distribution, we found that matching ${\rho}_{{T}^{\prime}}^{\alpha ,{\pi}_{\theta}}$ with ${\rho}_{T}^{\alpha ,\beta}$ is empirically more stable and more sampleefficient than matching ${\rho}_{{T}^{\prime}}^{\alpha ,\beta}$ with ${\rho}_{T}^{\alpha ,\beta}$.
For policy update, it is shown that using the mean ${\mu}_{\varphi}$ of the Gaussianparameterized transition can accelerate policy optimization and better balance exploration and exploitation. In order to enforce the Lipschitz constraint to the WGAN critic $f$, we employ gradient penalty [Gulrajani et al., 2017] with weight $10$.
B.2 Hyperparameters
HalfCheetah  Hopper  Reacher  Ant  
$N$  10  
$\alpha $  1  10  
${n}_{\text{transition}}$  100  
${n}_{\text{policy}}$  20  60  100  30 
horizon for model update  20  10  30  
entropy regularization  0.001  0.005 