Model Imitation for Model-Based Reinforcement Learning

  • 2019-10-02 06:19:21
  • Yueh-Hua Wu, Ting-Han Fan, Peter J. Ramadge, Hao Su
  • 0


Model-based reinforcement learning (MBRL) aims to learn a dynamic model toreduce the number of interactions with real-world environments. However, due toestimation error, rollouts in the learned model, especially those of longhorizon, fail to match the ones in real-world environments. This mismatchinghas seriously impacted the sample complexity of MBRL. The phenomenon can beattributed to the fact that previous works employ supervised learning to learnthe one-step transition models, which has inherent difficulty ensuring thematching of distributions from multi-step rollouts. Based on the claim, wepropose to learn the synthesized model by matching the distributions ofmulti-step rollouts sampled from the synthesized model and the real ones viaWGAN. We theoretically show that matching the two can minimize the differenceof cumulative rewards between the real transition and the learned one. Ourexperiments also show that the proposed model imitation method outperforms thestate-of-the-art in terms of sample complexity and average return.


Quick Read (beta)

Model Imitation for Model-Based Reinforcement Learning

Yueh-Hua Wu1,2, Ting-Han Fan311footnotemark: 1 , Peter J. Ramadge3, Hao Su2 Equal contributions.
1 National Taiwan University
2 University of California San Diego
3 Princeton University

Model-based reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with real-world environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in real-world environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the one-step transition models, which has inherent difficulty ensuring the matching of distributions from multi-step rollouts. Based on the claim, we propose to learn the synthesized model by matching the distributions of multi-step rollouts sampled from the synthesized model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one. Our experiments also show that the proposed model imitation method outperforms the state-of-the-art in terms of sample complexity and average return.


C¿c¡ \newcolumntypeL¿l¡

1 Introduction

Reinforcement learning (RL) has become of great interest because plenty of real-world problems can be modeled as a sequential decision-making problem. Model-free reinforcement learning (MFRL) is favored by its capability of learning complex tasks when interactions with environments are cheap. However, in the majority of real-world problems, such as autonomous driving, interactions are extremely costly, thus MFRL becomes infeasible. One critique about MFRL is that it does not fully exploit past queries over the environment, and this motivates us to consider the model-based reinforcement learning (MBRL). In addition to learning an agent policy, MBRL also uses the queries to learn the dynamic of the environment that our agent is interacting with. If the learned dynamic is accurate enough, the agent can acquire the desired skill by simply interacting with the simulated environment, so that the number of samples to collect in the real world can be greatly reduced. As a result, MBRL has become one of the possible solutions to reduce the number of samples required to learn an optimal policy.

Most previous works of MBRL adopt supervised learning with 2-based errors [Luo et al., 2019; Kurutach et al., 2018; Clavera et al., 2018] or maximum likelihood [Janner et al., 2019], to obtain an environment model that synthesizes real transitions. These non-trivial developments imply that optimizing a policy on a synthesized environment is a challenging task. Because the estimation error of model accumulates as the trajectory grows, it is hard to train a policy on a long synthesized trajectory. On the other hand, training on short trajectories makes the policy short-sighted. This issue is known as the planning horizon dilemma [Langlois et al., 2019]. As a result, despite having a strong intuition at first sight, MBRL has to be designed meticulously.

Intuitively, we would like to learn a transition model in a way that it can reproduce the trajectories that have been generated in the real world. Since the attained trajectories are sampled according to a certain policy, directly employing supervised learning may not necessarily lead to the mentioned result especially when the policy is stochastic. The resemblance in trajectories matters because we estimate policy gradient by generating rollouts; however, the one-step model learning adopted by many MBRL methods do not guarantee this. Some previous works propose multi-step training [Luo et al., 2019]; however, experiments show that model learning fails to benefit much from the multi-step loss. We attribute this outcome to the essence of supervised learning, which elementally preserves only one-step transition and the similarity between real trajectories and the synthesized ones cannot be guaranteed.

In this work, we propose to learn the transition model via distribution matching. Specifically, we use WGAN [Arjovsky et al., 2017] to match the distributions of state-action-next-state triple (s,a,s) in real/learned models so that the agent policy can generate similar trajectories when interacting with either the true transition or the learned transition. Figure 1 illustrates the difference between methods based on supervised learning and distribution matching. Different from the ensemble methods proposed in previous works, our method is capable of generalizing to unseen transitions with only one dynamic model because merely incorporating multiple models does not alter the essence that one-step (or few-step) supervised learning fails to imitate the distribution of multi-step rollouts.

Concretely, we gather some transitions in the real world according to a policy. To learn the real transition, we then sample fake transitions from our synthesized model with the same policy. The synthesized model serves as the generator in the WGAN framework and there is a critic that discriminates the two transition data. We update the generator and the critic alternatively until the synthesized data cannot be distinguished from the real one, which we will show later that it gives TT theoretically.

Our contributions are summarized below:

  • We propose an MBRL method called model imitation (MI), which enforces the learned transition model to generate similar rollouts to the real one so that policy gradient is accurate;

  • We theoretically show that the transition can be learned by MI in the sense that TT by consistency and the difference in cumulative rewards |R(T)-R(T)| is small;

  • To stabilize model learning, we deduce guarantee for our sampling technique and investigate training across WGANs;

  • We experimentally show that MI is more sample efficient than state-of-the-art MBRL and MFRL methods and outperforms them on four standard tasks.

Figure 1: Distribution matching enables the learned transition to generate similar rollouts to the real ones even when the policy is stochastic or the initial states are close. On the other hand, training with supervised learning does not ensure rollout similarity and the resulting policy gradient may be inaccurate. This figure considers a fixed policy sampling in the real world and a transition model.

2 Related work

In this section, we introduce our motivation inspired by learning from demonstration (LfD) [Schaal, 1997] and give a brief survey of MBRL methods.

2.1 Learning from Demonstration

A straightforward approach to LfD is to leverage behavior cloning (BC), which reduces LfD to a supervised learning problem. Even though learning a policy via BC is time-efficient, it cannot imitate a policy without sufficient demonstration because the error may accumulate without the guidance of expert [Ross et al., 2011]. Generative Adversarial Imitation Learning (GAIL) [Ho and Ermon, 2016] is another state-of-the-art IfD method that learns an optimal policy by utilizing generative adversarial training to match occupancy measure [Syed et al., 2008b]. GAIL learns an optimal policy by matching the distribution of the trajectories generated from an agent policy with the distribution of the given demonstration. Ho and Ermon [2016] shows that the two distributions match if and only if the agent has learned the optimal policy. One of the advantages of GAIL is that it only requires a small amount of demonstration data to obtain an optimal policy but it requires a considerable number of interactions with environments for the generative adversarial training to converge.

Our intuition is that we analogize transition learning (TL) to learning from demonstration (LfD). In LfD, trajectories sampled from a fixed transition are given, and the goal is to learn a policy. On the other hand, in TL, trajectories sampled from a fixed policy are given, and we would like to imitate the underlying transition. That being said, from LfD to TL, we interchange the roles of the policy and the transition. It is therefore tempting to study the counterpart of GAIL in TL; i.e., learning the transition by distribution matching. Fortunately, by doing so, the pros of GAIL remain while the cons are insubstantial in MBRL because sampling with the learned model is considered to be much cheaper than sampling in the real one. That GAIL learns a better policy than what BC does suggests that distribution matching possess the potential to learn a better transition than supervised learning.

2.2 Model-Based Reinforcement Learning

For deterministic transition, it is usually optimized with 2-based error. Nagabandi et al. [2018], an approach that uses supervised learning with mean-squared error as its objective, is shown to perform well under fine-tuning. To alleviate model bias, some previous works adopt ensembles [Kurutach et al., 2018; Buckman et al., 2018], where multiple transition models with different initialization are trained at the same time. In a slightly more complicated manner, Clavera et al. [2018] utilizes meta-learning to gather information from multiple models. Lastly, on the theoretical side, SLBO [Luo et al., 2019] is the first algorithm that develops from solid theoretical properties for model-based deep RL via a joint model-policy optimization framework.

For the stochastic transition, maximum likelihood estimator or moment matching are natural ways to learn a synthesized transition, which is usually modeled by the Gaussian distribution. Following this idea, Gaussian process [Kupcsik et al., 2013; Deisenroth et al., 2015] and Gaussian process with model predictive control [Kamthe and Deisenroth, 2017] are introduced as an uncertainty-aware version of MBRL. Similar to the deterministic case, to mitigate model bias and foster stability, an ensemble method for probabilistic networks [Chua et al., 2018] is also studied. An important distinction between training a deterministic or stochastic transition is that although the stochastic transition can model the noise hidden within the real world, the stochastic model may also induce instability if the true transition is deterministic. This is a potential reason why an ensemble of models is adopted to reduce variance.

3 Background

3.1 Reinforcement Learning

We consider the standard Markov Decision Process (MDP) [Sutton and Barto, 1998]. MDP is represented by a tuple S,𝒜,T,r,γ, where 𝒮 is the state space, 𝒜 is the action space, T(st+1|st,at) is the transition density of state st+1 at time step t+1 given action at made under state st, r(s,a) is the reward function, and γ(0,1) is the discount factor.

A stochastic policy π(a|s) is a density of action a given state s. Let the initial state distribution be α. The performance of the triple (α,π,T) is evaluated in the expectation of the cumulative reward in the γ-discounted infinite horizon setting:

R(α,π,T)=𝔼[t=0γtr(st,at)|α,π,T]=𝔼[t=0H-1r(st,at)|α,π,T]. (1)

Equivalently, R(α,π,T) is the expected cumulative rewards in a length-H trajectory {st,at}t=0H-1 generated by (α,π,T) with HGeometric(1-γ). When α and T are fixed, R() becomes a function that only depends on π, and reinforcement learning algorithms [Sutton and Barto, 1998] aim to find a policy π to maximize R(π).

3.2 Occupancy Measure

Given initial state distribution α(s), policy π(a|s) and transition T(s|s,a), the normalized occupancy measure ρTα,π(s,a) generated by (α,π,T) is defined as

ρTα,π(s,a)=t=o(1-γ)γt(st=s,at=a|α,π,T)=(1-γ)t=0H-1(st=s,at=a|α,π,T), (2)

where () is the probability measure and will be replaced by a density function if 𝒮 or 𝒜 is continuous. Intuitively, ρTα,π(s,a) is a distribution of (s,a) in a length-H trajectory {st,at}t=0H-1 with HGeometric(1-γ) following the laws of (α,π,T). From Syed et al. [2008a], the relation between ρTα,π and (α,π,T) is characterized by the Bellman flow constraint. Specifically, x=ρTα,π as defined in Eq. 2 is the unique solution to:

x(s,a)=πθ(a|s)[(1-γ)α(s)+γx(s,a)T(s|s,a)𝑑s𝑑a],x(s,a)0. (3)

In addition, Theorem 2 of Syed et al. [2008a] gives that π(a|s) and ρTα,π(s,a) have an one-to-one correspondence with α(s) and T(s|s,a) fixed; i.e., π(a|s)ρ(s,a)ρ(s,a)𝑑a is the only policy whose occupancy measure is ρ.

With the occupancy measure, the cumulative reward Eq. 1 can be represented as

R(α,π,T)=𝔼(s,a)ρTα,π[r(s,a)]/(1-γ). (4)

The goal of maximizing the cumulative reward can then be achieved by adjusting ρTα,π, and this motivates us to adopt distribution matching approaches like WGAN [Arjovsky et al., 2017] to learn a transition model.

4 Theoretical Analysis for WGAN

In this section, we present a consistency result and error bounds for WGAN [Arjovsky et al., 2017]. All proofs of the following theorems and lemmas can be found in Appendix A.

In the setting of MBRL, the training objective for WGAN is

minTmaxfL1𝔼(s,a)ρT,sT(|s,a)[f(s,a,s)]-𝔼(s,a)ρT,sT(|s,a)[f(s,a,s)]. (5)

By Kantorovich-Rubinstein duality [Villani, 2008], the optimal value of the inner maximization is exactly W1(p(s,a,s)||p(s,a,s)) where p(s,a,s)=ρT(s,a)T(s|s,a) is the discounted distribution of (s,a,s). Thus, by minimizing over the choice of T, we are essentially finding p that minimizes W1(p(s,a,s)||p(s,a,s)), which gives the consistency result.

Proposition 4.1 (Consistency for WGAN).

Let T and T be the true and synthesized transitions respectively. If WGAN is trained to its optimal point, we have


where 𝑆𝑢𝑝𝑝(ρT) is the support of ρT.

The support constraint is inevitable because the training data is sampled from ρT and guaranteeing anything beyond it can be difficult. Still, we will empirically show that the support constraint is not an issue in our experiments because the performance boosts up in the beginning, indicating that Supp(ρT) may be large enough initially.

Now that training with WGAN gives a consistent estimate of the true transition, it is sensible to train a synthesized transition upon it. However, the consistency result is too restrictive as it only discusses the optimal case. The next step is to analyze the non-optimal situation and observe how the cumulative reward deviates w.r.t. the training error.

Theorem 4.2 (Error Bound for WGAN).

Let ρT(s,a),ρT(s,a) be the normalized occupancy measures generated by the true transition T and the synthesized one T. If the reward function is Lr-Lipschitz and the training error of WGAN is ϵ, we have |R(T)-R(T)|ϵLr/(1-γ).

Theorem 4.2 indicates that if WGAN is trained properly, i.e., having small ϵ, the cumulative reward on the synthesized trajectory will be close to that on the true trajectory. As MBRL aims to train a policy on the synthesized trajectory, the accuracy of the cumulative reward over the synthesized trajectory is thus the bottleneck. Theorem 4.2 also implies that WGAN’s error is linear to the (expected) length of the trajectory (1-γ)-1. This is a sharp contrast to the error bounds in most RL literature, as the dependency on the trajectory length is usually quadratic [Syed and Schapire, 2010; Ross et al., 2011], or of even higher order. Since WGAN gives us a better estimation of the cumulative reward in the learned model, the policy update becomes more accurate.

5 Model Imitation for Model-Based Reinforcement Learning

In this section, we present a practical MBRL method called model imitation (MI) that incorporates the transition learning mentioned in Section 4.

5.1 Sampling Technique for Transition Learning

Due to the long-term digression, it is hard to train the WGAN directly from a long synthesized trajectory. To tackle this issue, we use the synthesized transition T to sample N short trajectories with initial states sampled from the true trajectory.

To analyze this sampling technique, let β<γ be the discount factor of the short trajectories so that the expected length is 𝔼[L]=(1-β)-1. Let ρTβ, ρ^Tβ, ρTβ, ρT be the normalized occupancy measures of synthesized short trajectories, empirical true short trajectories, true short trajectories and the true long trajectories. The 1-Wasserstein distance can be bounded by


W1(ρTβ||ρ^Tβ) is upper bounded by the training error of WGAN on short trajectories, which can be small empirically because the short ones are easier to imitate. W1(ρ^Tβ||ρTβ)=𝔼L[O((NL)-1/d)]=O(((1-β)/N)1/d/β) by Canas and Rosasco [2012] and Lemma A.3, where d is the dimension of (s,a). W1(ρTβ||ρT)diam(𝒮×𝒜)(1-γ)β/(γ-β) by Lemma A.4 and W1DTVdiam(𝒮×𝒜) [Gibbs and Su, 2002], where diam() is the diameter. The second term encourages β to be large while the third term does the opposite. Besides, β need not be large if N is large enough; in practice we may sample N short trajectories to reduce the error from W1(ρT||ρT) to W1(ρTβ||ρT). Finally, since ρTβ is the occupancy measure we train on, from the proof of Theorem 4.2 we deduce that


Thus, WGAN may perform better under this sampling technique.

5.2 Empirical Transition Learning

To learn the real transition based on the occupancy measure matching mentioned in Section 4, we employ a transition learning scheme by aligning the distribution of (s,a,s) between the real and the learned environments. Inspired by how GAIL [Ho and Ermon, 2016] learns to align (s,a) via solving an MDP with rewards extracted from a discriminator, we formulate an MDP with rewards from a discriminator over (s,a,s). Specifically, the WGAN critic f(s,a,s) in Eq. 5 is used as the (psuedo) rewards r(s,a,s) of our MDP. Interestingly, there is a duality between GAIL and our transition learning: for GAIL, the transition is fixed and the objective is to train a policy to maximize the cumulative pseudo rewards, while for our transition learning, the policy is fixed and the objective is to train a synthesized transition to maximize the cumulative pseudo rewards.

In practice, since the policy is updated alternatively with the synthesized model, we are required to train a number of WGANs along with the change of the policy. Although the generators across WGANs correspond to the same transition and can be similar, we observe that WGAN may get stuck at a local optimum when we switch from one WGAN training to another. The reason is that, unlike GAN that mimics the Jensen-Shannon divergence and hence its inner maximization is upper bounded by log(2), WGAN mimics the Wasserstein distance and the inner maximization is unbounded from above. Intuitively, such unboundedness makes the WGAN critic so strong that the WGAN generator (the synthesized transition) cannot find a way out and gets stuck at a local optimum. Thereby, we have to modify the WGAN objective to alleviate such situation. To ensure the boundedness, for a fixed δ>0, we introduce cut-offs at the WGAN objective so that the inner maximization is upper bounded by 2δ:

minTmaxfL1𝔼(s,a)ρTsT(|s,a)[min(δ,f(s,a,s)]+𝔼(s,a)ρTsT(|s,a)[min(δ,-f(s,a,s))]. (6)

As δ, Eq. 6 recovers the WGAN objective, Eq. 5. Therefore, this is a truncated version of WGAN. To comprehend Eq. 6 further, notice that it is equivalent to

minTmaxfL1𝔼(s,a)ρTsT(|s,a)[min(0,f(s,a,s)-δ)]+𝔼(s,a)ρTsT(|s,a)[min(0,-f(s,a,s)-δ)]minTminfL1𝔼(s,a)ρTsT(|s,a)[max(0,δ-f(s,a,s))]+𝔼(s,a)ρTsT(|s,a)[max(0,δ+f(s,a,s))], (7)

which is a hinge loss version of the generative adversarial objective. Such WGAN is introduced in Lim and Ye [2017], where the consistency result is provided and further experiments are evaluated in Zhang et al. [2018]. According to Lim and Ye [2017], the inner minimization can be interpreted as the soft-margin SVM. Consequently, it provides a geometric intuition of maximizing margin, which potentially enhances robustness. Finally, because the objective of transition learning is to maximize the cumulative pseudo rewards on the MDP, T does not directly optimize Eq. 7. Note that the truncation only takes part in the inner minimization:

minfL1𝔼(s,a)ρTsT(|s,a)[max(0,δ-f(s,a,s))]+𝔼(s,a)ρTsT(|s,a)[max(0,δ+f(s,a,s))], (8)

which gives us a WGAN critic f(s,a,s). As mentioned, f will be the pseudo reward function. Later, we will introduce a transition learning version of PPO [Schulman et al., 2017] to optimize the cumulative pseudo reward.


Model Imitation for Model-Based Reinforcement Learning {algorithmic}[1] \StateInitialize policy πθ, transition model Tϕ, WGAN critic fw, environment dataset 𝒟env \Fori=0,1,2, \StateTake actions in real environment according to πθ; 𝒟env𝒟env𝒟i \StatePre-train Tϕ and fw by optimizing Eq. 8 and 11 with 𝒟i and 𝒟env \ForN epochs \Forntransition epochs \Stateoptimize Eq. 8 and 11 over ϕ and w with 𝒟i \EndFor\Fornpolicy epochs \Stateupdate πθ by TRPO on the data generated by Tϕ \EndFor\EndFor\EndFor

After modifying the WGAN objective, to include both the stochastic and (approximately) deterministic scenarios, the synthesized transition is modeled by a Gaussian distribution T(s|s,a)=Tϕ(s|s,a)𝒩(μϕ(s,a),Σϕ(s,a)). Although the underlying transitions of tasks like MuJoCo [Todorov et al., 2012] are deterministic, modeling by a Gaussian does not harm the transition learning empirically.

Recall that the synthesized transition is trained on an MDP whose reward function is the critic of the truncated WGAN. To achieve this goal with proper stability, we employ PPO [Schulman et al., 2017], which is an efficient approximation of TRPO [Schulman et al., 2015]. Note that although the PPO is originally designed for policy optimization, it can be adapted to transition learning with a fixed sampling policy and the PPO objective (Eq. 7 of Schulman et al. [2017])

PPO(ϕ)=𝔼^t[min(rt(ϕ)A^t,clip(rt(ϕ),1-ϵ,1+ϵ)A^t)], (9)


rt(ϕ)=Tϕ(st+1|st,at)Tϕold(st+1|st,at),A^t:advantage func. derived from the pseudo rewardf(st,at,st+1). (10)

To enhance stability of the transition learning, in addition to PPO, we also optimize maximum likelihood, which can be regarded as a regularization. We empirically observe that jointly optimizing both maximum likelihood and the PPO objective attains better transition model for policy gradient. The overall loss of the transition learning becomes

transition=-PPO+αmle, (11)

where mle is the loss of MLE, which is policy-agnostic and can be estimated with all collected real transitions. For more implementation details, please see Appendix B.1.

We consider a training procedure similar to SLBO [Luo et al., 2019], where they consider the fact that the value function is dependent on the varying transition model. As a result, unlike most of the MBRL methods that have only one pair of model-policy update for each real environment sampling, SLBO proposes to take multiple update pairs for each real environment sampling so that the objective composed of the model loss and the value loss can be optimized. Our proposed model imitation (MI) method is summarized in Algorithm 5.2.

In the experiment section, we would like to answer the following questions. (1) Does the proposed model imitation outperforms the state-of-the-art in terms of sample complexity and average return? (2) Does the proposed model imitation benefit from distribution matching and is superior to its model-free and model-based counterparts, TRPO and SLBO?

Figure 2: Learning curves of our MI versus two model-free and four model-based baselines. The solid lines indicate the mean of five trials and the shaded regions suggest standard deviation.

To fairly compare algorithms and enhance reproducibility, we adopt open-sourced environments released along with a model-based benchmark paper [Langlois et al., 2019], which is based on a physical simulation engine, MuJoCo [Todorov et al., 2012]. Specifically, we evaluate the proposed algorithm MI on four continuous control tasks including Hopper, HalfCheetah, Ant, and Reacher. For hyper-parameters mentioned in Algorithm 5.2 and coefficients such as entropy regularization λ, please refer to Appendix B.2.

We compare to two model-free algorithms, TRPO [Schulman et al., 2015] and PPO [Schulman et al., 2017], to assess the benefit of utilizing the proposed model imitation since our MI (Algorithm 5.2) uses TRPO for policy gradient to update the agent policy. We also compare MI to four model-based methods. SLBO [Luo et al., 2019] gives theoretical guarantee of monotonic improvement for model-based deep RL and proposes to update a joint model-policy objective. PETS [Chua et al., 2018] propose to employ uncertainty-aware dynamic models with sampling-based uncertainty to capture both aleatoric and epistemic uncertainty. METRPO [Kurutach et al., 2018] shows that insufficient data may cause instability and propose to use an ensemble of models to regularize the learning process. STEVE [Buckman et al., 2018] dynamically interpolates among model rollouts of various horizon lengths and favors those whose estimates have lower error.

Figure 2 shows the learning curves for all methods. In Hopper, HalfCheetah, and Ant, MI converges fairly fast and learns a policy significantly better than competitors’. In Ant, even though MI does not improve the performance too much from the initial one, the fact that it maintains the average return at around 1,000 indicates that MI can capture a better transition than other methods do with only 5,000 transition data. Even though we do not employ an ensemble of models, the curves show that our learning does not suffer from high variance. In fact, the performance shown in Figure 2 indicates that the variance of MI is lower than that of methods incorporating ensembles such as METRPO and PETS.

The questions raised at the beginning of this section can now be answered. The learned model enables TRPO to explore the world without directly access real transitions and therefore TRPO equipped with MI needs much fewer interactions with the real world to learn a good policy. Even though MI is based on the training framework proposed in SLBO, the additional distribution matching component allows the synthesized model to generate similar rollouts to that of the real environments, which empirically gives superior performance because we rely on long rollouts to estimate policy gradient.

To better understand the performance presented in Figure 2, we further compare MI with bench-marked RL algorithms recorded in Langlois et al. [2019] including state-of-the-art MFRL methods such as TD3 [Fujimoto et al., 2018] and SAC [Haarnoja et al., 2018]. It should be noted that the reported results of Langlois et al. [2019] are the final performance after 200k time-steps but we only use up to 100k time-steps to train MI. Table 1 indicates that MI significantly outperforms most of the MBRL and MFRL methods with 50% fewer samples, which verifies that MI is more sample-efficient by incorporating distribution matching.

Table 1: Proportion of bench-marked RL methods that are inferior to MI in terms of 5% t-test. x/y indicates that among y approaches, MI is significantly better than x approaches. The detailed performance can be found in Table 1 of Langlois et al. [2019]. It should be noted that the reported results in Langlois et al. [2019] are the final performance after 200k time-steps whereas ours are no more than 100k time-steps.
Hopper HalfCheetah Ant Reacher
MBRL 8/10 10/10 8/10 8/10
MFRL 3/4 2/4 4/4 3/4

6 Conclusion

We have pointed out that the state-of-the-art methods concentrate on learning synthesized models in a supervised fashion, which does not guarantee that the policy is able to reproduce a similar trajectory in the learned model and therefore the model may not be accurate enough to estimate long rollouts. We have proposed to incorporate WGAN to achieve occupancy measure matching between the real transition and the synthesized model and theoretically shown that matching indicates the closeness in cumulative rewards between the synthesized model and the real environment.

To enable stable training across WGANs, we have suggested using a truncated version of WGAN to prevent training from getting stuck at local optimums. The empirical property of WGAN application such as imitation learning indicates its potential to learn the transition with fewer samples than supervised learning. We have confirmed it experimentally by further showing that MI converges much faster and obtains better policy than state-of-the-art model-based and model-free algorithms.


The authors would like to acknowledge the National Science Founding for the grant RI-1764078 and Qualcomm for the generous support.


  • M. Arjovsky, S. Chintala, and L. on Bottou (2017) Wasserstein gan. External Links: arXiv:1701.07875 Cited by: §1, §3.2, §4.
  • J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8224–8234. Cited by: §2.2, §5.2.
  • G. D. Canas and L. A. Rosasco (2012) Learning probability measures with respect to optimal transport metrics. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS’12, USA, pp. 2492–2500. Cited by: §5.1.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4754–4765. Cited by: §2.2, §5.2.
  • I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel (2018) Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, pp. 617–629. Cited by: §1, §2.2.
  • M. P. Deisenroth, D. Fox, and C. E. Rasmussen (2015) Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2), pp. 408–423. Cited by: §2.2.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §5.2.
  • A. L. Gibbs and F. E. Su (2002) On choosing and bounding probability metrics. International Statistical Review 70 (3), pp. 419–435. Cited by: §5.1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §B.1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §5.2.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In NeurIPS, pp. 4565–4573. Cited by: §2.1, §5.2.
  • M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §1.
  • S. Kamthe and M. P. Deisenroth (2017) Data-efficient reinforcement learning with probabilistic model predictive control. In AISTATS, Cited by: §2.2.
  • A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann (2013) Data-efficient generalization of robot skills with contextual policy search. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, pp. 1401–1407. Cited by: §2.2.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, External Links: Link Cited by: §B.1, §1, §2.2, §5.2.
  • E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: §1, §5.2, §5.2, Table 1.
  • J. H. Lim and J. C. Ye (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §5.2.
  • Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma (2019) Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, Cited by: §B.1, §1, §1, §2.2, §5.2, §5.2.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7559–7566. External Links: Document, ISSN 2577-087X Cited by: §2.2.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2.1, §4.
  • S. Schaal (1997) Learning from demonstration. In Advances in neural information processing systems, pp. 1040–1046. Cited by: §2.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In ICML, pp. 1889–1897. Cited by: §5.2, §5.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.2, §5.2, §5.2.
  • R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. Vol. 135, MIT press. Cited by: §3.1, §3.1.
  • U. Syed, M. Bowling, and R. E. Schapire (2008a) Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 1032–1039. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §3.2.
  • U. Syed, M. Bowling, and R. E. Schapire (2008b) Apprenticeship learning using linear programming. In ICML, pp. 1032–1039. Cited by: §2.1.
  • U. Syed and R. E. Schapire (2010) A reduction from apprenticeship learning to classification. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2253–2261. Cited by: §4.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In IROS, pp. 5026–5033. Cited by: §5.2, §5.2.
  • C. Villani (2008) Optimal transport – old and new. Vol. 338, pp. xxii+973. Cited by: §A.1, §4.
  • D. Wood (1992) The computation of polylogarithms. Technical report Technical Report 15-92*, University of Kent, Computing Laboratory, University of Kent, Canterbury, UK. Cited by: §A.2.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §5.2.

Appendix A Proofs

A.1 Proof for WGAN

Proposition A.1 (Consistency for WGAN).

Let α(s),π(a|s),T(s|s,a) be initial state distribution, policy and synthesized transition. Let T be the true transition, p(s,a,s)=ρT(s,a)T(s|s,a) be the discounted distribution of the triple (s,a,s). If the WGAN is trained to its optimal point, we have


Because the loss function of WGAN is the 1-Wasserstein distance, we know p(s,a,s)=p(s,a,s) at its optimal points. Plug in to the Bellman flow constraint Eq. (3),

ρT(s,a) =π(a|s)[(1-γ)α(s)+γρT(s,a)T(s|s,a)𝑑s𝑑a]

That is,

WGAN is opt.p(s,a,s)=p(s,a,s)BellmanρT(s,a)=ρT(s,a).

Finally, recall p(s,a,s)ρT(s,a)T(s|s,a)andp(s,a,s)ρT(s,a)T(s|s,a), we arrive at

WGAN is opt. iff T(s|s,a)=T(s|s,a),(s,a)Supp(ρT).

Theorem A.2 (Two-sided Errors for WGAN).

Let ρT(s,a),ρT(s,a) be normalized occupancy measures generated by the true transition T and the synthesized one T. Suppose the reward is Lr-Lipschitz. If the training error of WGAN is ϵ, then |R(T)-R(T)|ϵLr/(1-γ).


Observe that the occupancy measure ρT(s,a) is a marginal distribution of p(s,a,s)=ρT(s,a)T(s|s,a). Because the distance between the marginal is upper bounded by that of the joint, we have


where W1 is the 1-Wasserstein distance. Then, the cumulative reward is bounded by


where the first inequality holds because r(s,a)/Lr is 1-Lipschitz and the last equality follows from Kantorovich-Rubinstein duality Villani [2008]. Since W1 distance is symmetric, the same conclusion holds if we interchange T and T, so we arrive at


A.2 Lemmas for Sampling Techniques

Lemma A.3.

Let L𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐(1-β). If d1, then E[L-1/d]=O((1-β)1/d/β).


where Li is the polylogarithm function. From Wood [1992], the limiting behavior of it is

Li1/d(e-μ)=Γ(1-1/d)μ1/d-1,as μ0+,

where Γ is the gamma function. Since e-μ1-μ when μ0+, we know when β1-, Li1/d(β)Γ(1-1/d)(1-β)1/d-1. Finally, since Γ(1-1/d)1, we conclude that 𝔼[L-1/d]=O((1-β)1/d/β). ∎

Lemma A.4.

Let ρT(s,a) be a the normalized occupancy measure generated by the triple (α,π,T) with discount factor γ. Let ρTβ(s,a) be the normalized occupancy measure generated by the triple (ρT,π,T) with discount factor β. If γ>β, then DTV(ρT||ρTβ)(1-γ)β/(γ-β).


By definition of the occupancy measure we have


where fi(s,a) is the density of (s,a) at time i if generated by the triple (α,π,T). The TV distance is bounded by


where (*) comes from that -β(1-γ)+(βγ)i(1-β)γ is a strictly decreasing function. Since γ>β, its sign flips from + to - at some index; say M. Finally, the sum of the absolute value are the same from i=0M-1 and from i=M because the total probability is conservative, and the difference on one side is the same as that on the other. ∎

Appendix B Experiments

B.1 Implementation Details

We normalize states according to the statistics derived from the first batch of states from the real world. To ensure stability, we maintain the same mean μ0 and standard deviation σ0 throughout the training process.

Instead of directly predicting the next state, we estimate the state difference st+1-st [Kurutach et al., 2018; Luo et al., 2019]. Since we incorporate state normalization, the transition network is trained to output (st+1-st-μ0)/σ0.

To enhance state exploration, we sample real transitions according to policy β𝒩(μθ(s),σ), where μ(s) is the mean of our Gaussian parameterized policy πθ and σ is a fixed standard deviation. In addition, since model the transition as a Gaussian distribution, we found that matching ρTα,πθ with ρTα,β is empirically more stable and more sample-efficient than matching ρTα,β with ρTα,β.

For policy update, it is shown that using the mean μϕ of the Gaussian-parameterized transition can accelerate policy optimization and better balance exploration and exploitation. In order to enforce the Lipschitz constraint to the WGAN critic f, we employ gradient penalty [Gulrajani et al., 2017] with weight 10.

B.2 Hyperparameters

HalfCheetah Hopper Reacher Ant
N 10
α 1 10
ntransition 100
npolicy 20 60 100 30
horizon for model update 20 10 30
entropy regularization 0.001 0.005
Table 2: List of hyper-parameters adopted in our experiments.