Variational Inference MPC for Bayesian Model-based Reinforcement Learning

  • 2019-07-08 01:54:08
  • Masashi Okada, Tadahiro Taniguchi
  • 1

Abstract

In recent studies on model-based reinforcement learning (MBRL), incorporatinguncertainty in forward dynamics is a state-of-the-art strategy to enhancelearning performance, making MBRLs competitive to cutting-edge model freemethods, especially in simulated robotics tasks. Probabilistic ensembles withtrajectory sampling (PETS) is a leading type of MBRL, which employs Bayesianinference to dynamics modeling and model predictive control (MPC) withstochastic optimization via the cross entropy method (CEM). In this paper, wepropose a novel extension to the uncertainty-aware MBRL. Our main contributionsare twofold: Firstly, we introduce a variational inference MPC, whichreformulates various stochastic methods, including CEM, in a Bayesian fashion.Secondly, we propose a novel instance of the framework, called probabilisticaction ensembles with trajectory sampling (PaETS). As a result, our BayesianMBRL can involve multimodal uncertainties both in dynamics and optimaltrajectories. In comparison to PETS, our method consistently improvesasymptotic performance on several challenging locomotion tasks.

 

Quick Read (beta)

Variational Inference MPC for
Bayesian Model-based Reinforcement Learning

Masashi Okada
Panasonic Corp., Japan
[email protected]
&Tadahiro Taniguchi
Ritsumeikan Univ. & Panasonic Corp., Japan
[email protected]
Abstract

In recent studies on model-based reinforcement learning (MBRL), incorporating uncertainty in forward dynamics is a state-of-the-art strategy to enhance learning performance, making MBRLs competitive to cutting-edge model-free methods, especially in simulated robotics tasks. Probabilistic ensembles with trajectory sampling (PETS) is a leading type of MBRL, which employs Bayesian inference to dynamics modeling and model predictive control (MPC) with stochastic optimization via the cross entropy method (CEM). In this paper, we propose a novel extension to the uncertainty-aware MBRL. Our main contributions are twofold: Firstly, we introduce a variational inference MPC (VI-MPC), which reformulates various stochastic methods, including CEM, in a Bayesian fashion. Secondly, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). As a result, our Bayesian MBRL can involve multimodal uncertainties both in dynamics and optimal trajectories. In comparison to PETS, our method consistently improves asymptotic performance on several challenging locomotion tasks.

\SetKwInput

KwInputInput \SetKwInputKwOutputOutput

Variational Inference MPC for
Bayesian Model-based Reinforcement Learning

Masashi Okada
Panasonic Corp., Japan
[email protected]
Tadahiro Taniguchi
Ritsumeikan Univ. & Panasonic Corp., Japan
[email protected]
\@float

noticebox[b]\[email protected]

Keywords: model predictive control, variational inference, model-based reinforcement learning

1 Introduction

Model predictive control (MPC) is a powerful and accepted technology for advanced control systems such as manufacturing processes [vargas2000multilayer], HVAC systems [afram2014theory], power electronics [vazquez2014model], autonomous vehicles [paden2016survey], and humanoids [kuindersma2016optimization]. MPC utilizes the specified models of system dynamics to predict future states and rewards (or costs) to plan future actions that maximize the total reward over the predicted trajectories. Especially for industrial applications, the clear explainability of such a decision-making process is advantageous. Furthermore, in some tasks (e.g., games) [silver2016mastering], planning-based policies of this nature could outperform reactive-policies (e.g., full neural network policies).

Model-based reinforcement learning (MBRL) methods that employ expressive function approximators (e.g., deep neural networks: DNNs) [deisenroth2011pilco, williams2017information, nagabandi2018neural] present appealing approaches for MPC. The main difficulty in introducing MPC to practical systems is specifying the forward dynamics models of target systems. However, accurate system identification is challenging in many advanced applications. Take robotics for example, where robots encounter floors and walls, and must be able to manipulate some objects, making the dynamics highly non-linear. The main objective of MBRL is to train approximators of complex dynamics through experiences in real systems. The general procedure of MBRL is summarized as; (1. training-step) train the approximate model with a given training dataset, then (2. test-step) execute the actions (or policies) optimized with the dynamics model in a real environment and augment the dataset with the observed results. The above training and test steps are iteratively conducted to collect sufficient and diverse data so as to achieve the desired performance.

One feature of MBRL is its considerable sample efficiency compared to model-free reinforcement learning (MFRL), which directly trains policies through experiences. In other words, MBRL requires much less test time in real environments. In addition, MBRL benefits from the generalizability of the trained model, which can be easily applied to new tasks in the same system. However, the asymptotic performance of MBRL is generally inferior to that of model-free methods. This discrepancy is primarily due to the overfitting of dynamics models to the few data available during initial MBRL steps, which is called the model-bias problem [deisenroth2011pilco]. Several studies have demonstrated that incorporating uncertainty in dynamics models can alleviate this issue. The uncertainty-aware modeling is realized by Bayesian inference employing a Gaussian Process [deisenroth2011pilco], dropout as variational inference [gal2016dropout, gal2017concrete, kahn2017uncertainty], or neural network ensembles [chua2018deep, kurutach2018model, clavera2018model].

Probabilistic ensembles with trajectory sampling (PETS) [chua2018deep] is one type of uncertainty-aware MBRL. As an MPC-oriented MBRL method, PETS conducts trajectory optimization via the cross entropy method (CEM) [botev2013cross] by using trajectories probabilistically sampled from the ensemble networks. Experiments have demonstrated that PETS can achieve competitive performance over state-of-the-art MFRL methods like Soft Actor Critic (SAC) [haarnoja2018soft], while yielding much higher sample efficiency. Since our primary interest is MPC and its application to practical systems, this paper mainly focuses on PETS and treats this method as a strong baseline.

Considering the success of probabilistic dynamics modeling, incorporating uncertainty in optimal trajectories appears very promising for MBRL. However, an optimization scheme that can utilize uncertainty has not yet been discussed. Although several stochastic approaches, including CEM, model predictive path integral (MPPI) [williams2016aggressive, williams2017information], covariance matrix adaptation evolution strategy (CMA-ES) [hansen2003reducing], and proportional CEM (Prop-CEM) [goschin2013cross], have been proposed, they are not uncertainty-aware and tend to underestimate uncertainty. In addition, although their optimization procedures are very similar, they have been independently derived. Consequently, theoretical relations among these methods are unclear, preventing us from systematically understanding and reformulating them to be uncertainty-aware in a Bayesian fashion.

Motivated by these, in this paper, we propose a novel MPC concept for Bayesian MBRL. The organization and contributions of this paper are summarized as follows. (1) In Sec. 3, we introduce a novel MPC framework, variational inference MPC (VI-MPC), which generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. The key observations for deriving this framework are organized in Sec. 2, where we point out that general stochastic optimization methods can be regarded as the moment matching of the optimal trajectory posterior, which appear in a Bayesian MBRL formulation. (2) In Sec. 4, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). Toy task examples and the concept of our method are exhibited in Fig. 1. (3) In Sec. 5, we demonstrate that our method consistently outperforms PETS via experiments with challenging locomotion tasks in the MuJoCo physics simulator [todorov2012mujoco].

(a) Vanilla CEM used in PETS [chua2018deep]: VIMPC(‘CEM’, ‘Gaussian’, False) (b) PaETS (Ours): VIMPC(‘CEM’, ‘GMM(M=5)’, True)
Figure 1: Toy task examples that illustrate the concept of our method. The objective of this task is to navigate a point mass on the x-y plane by actuating 𝒂t=(Δx,Δy) with maximum magnitude ||𝒂t||=0.05, while avoiding obstacles . This task is designed to have multiple (sub-)optimal trajectories. (a) A trajectory found by vanilla CEM. (b) Multiple trajectories found by PaETS that approximates the trajectory posterior via variational inference with a Gaussian mixture model. The line-width indicates the magnitude of mixture-coefficients. Exploiting diverse plans encourages active exploration in state-action spaces, improving the optimization performance and training dataset diversity. The notation of VIMPC() is introduced in Sec. 3.

2 Model-based Reinforcement Learning as Bayesian Inference

In this section, we describe MBRL as a Bayesian inference problem using control as inference framework [levine2018reinforcement]. Fig. 2 displays the graphical model for the formulation, with which an MBRL procedure can be re-written in a Bayesian fashion: (1. training-step) do inference of p(θ|𝒟). (2. test-step) do inference of p(τ|𝒪1:T=𝟏), then, sample actions from the posterior and execute the actions in a real environment. We denote a trajectory as τ:={(𝒔t,𝒂t)}t=1T, where 𝒔t and 𝒂t respectively represent state and action. Given a state-action pair at time t, the next state can be predicted by a forward-dynamics model 𝒔t+1p(𝒔t+1|𝒔t,𝒂t,θ) parameterized with θ. The posterior of θ is inferred from training dataset 𝒟, where 𝒟={(𝒔t,𝒂t,𝒔t+1)} consists of states and actions observed during the test step. To formulate optimal control as inference, we auxiliarly introduce a binary random variable 𝒪t{0,1} to represent the optimality of (𝒔t, 𝒂t). Given p(θ|𝒟), trajectory optimization can be expressed as an inference problem:

p(τ|𝒪){t=1Tp(𝒪t=1|𝒔t,𝒂t)}:=p(𝒪|τ)p(𝒔1){t=1Tp(𝒔t+1|𝒔t,𝒂t,θ)}:=p(𝒔|𝒂,θ)p(θ|𝒟):=p𝒟(θ)𝑑θ, (1)

where uninformative action prior (i.e., p(𝒂t)=𝒰: uniform distribution) is supposed. For readability, 𝒪1:T=𝟏 is simply denoted as 𝒪. For the same reason, we omit the subscripts of sequences 𝒂1:T, 𝒔1:T. In the remainder of the paper, this simplified notation is employed. In Sec. 2.12.2, we review how these inference problems have been approximately handled in previous works.

2.1 Inference of Forward-dynamics Posterior p𝒟(θ)

Figure 2: Graphical model for Bayesian MBRL.

Given a sufficiently parameterized expressive model, i.e., DNNs, one of the most practical and promising schemes for approximating the posterior p𝒟(θ) is to utilize neural network ensembles [chua2018deep, kurutach2018model, clavera2018model]. This process approximates the posterior as a set of particles p𝒟(θ)1EiEδ(θ-θi), where δ is Dirac delta function and E is the number of networks. Each particle θi is independently trained by stochastic gradient descent so as to (sub-)optimize logp𝒟(θ)logp(𝒟|θ)p(θ). Although this approximation is incompletely Bayesian, this scheme has several useful features. First, we can simply implement this process in standard deep learning frameworks. Furthermore, the ensemble model successfully involves multimodal uncertainty in the exact posterior.

Another possible way to infer p𝒟(θ) is dropout as variational inference [gal2016dropout, gal2017concrete, kahn2017uncertainty], which approximates p𝒟(θ) as a Gaussian distribution q(θ). It is proofed that the variational inference problem: argminqKL(q(θ)||p𝒟(θ)) approximately equivalent to training networks with dropout, where KL(||) denote Kullback-Leibler (KL) divergence. Although this scheme is also simple and theoretically supported, approximation by a single Gaussian distribution tends to underestimate uncertainty (or multimodality) in the posterior. To remedy this problem, α-divergence dropout has been proposed [li2017dropout], which replaces KL-divergence to α-divergence so as to prevent q(θ) from overfitting a single mode. However, as long as q(θ) is Gaussian, the multimodality cannot be managed well.

In our preliminary experiments of MBRL, we have tested the above two schemes and observed that the ensemble performs much better than (α-)dropout. This result provides us with the insight that capturing multimodality in the posterior has crucial effects in MBRL literature. Therefore, in this paper, we also employ this ensemble scheme to approximate p𝒟(θ) in the same way as our baseline: PETS [chua2018deep]. In Sec. 4, we also attempt to incorporate multimodality in the posterior p(τ|𝒪).

2.2 Moment Matching of Trajectory Posterior p(τ|𝒪)

This section clarifies the connection between trajectory optimization and the posterior approximation problem. The key observation delineated here is that several MPC methods, including CEM used in PETS and MPPI, can be regarded as the moment matching of the posterior.

Given an inferred model posterior p𝒟(θ), we can sample trajectories from (1).11 1 Trajectory sampling methods with p𝒟(θ) have been discussed and experimented in [chua2018deep]. In this paper, we employ the TS1 method suggested in the reference (see 3–6 in Alg. 1). Let us approximate the action posterior with a Gaussian distribution q(𝒂;𝝁,𝚺). The mean of posterior action sequence 𝝁 can be estimated by moment matching:

𝝁=𝔼[𝒂p(τ|𝒪)]=𝔼𝒔p(𝒔|𝒂,θ),θp𝒟(θ),𝒂𝒰[𝒂p(𝒪|τ)]𝔼𝒔p(𝒔|𝒂,θ),θp𝒟(θ),𝒂𝒰[p(𝒪|τ)]=𝔼𝒂𝒰[𝒂𝒲(𝒂)]𝔼𝒂𝒰[𝒲(𝒂)], (2)

where

𝒲(𝒂):=𝔼𝒔t+1p(𝒔t+1|𝒔t,𝒂t,θ),θp𝒟(θ)[p(𝒪|τ)]. (3)

Eq. (2) can be viewed as a weighted average where each sampled action is weighted by the likelihood of optimality 𝒲(𝒂). In the same way, we can also estimate the variance of the posterior 𝚺=𝔼𝒂𝒰[(𝒂-𝝁)2𝒲(𝒂)]/𝔼𝒂𝒰[𝒲(𝒂)].

In practice, sampling from uniform distribution 𝒰 is quite inefficient and requires almost infinite samples. Hence, let us consider iteratively estimating the parameters by incorporating importance sampling. Let 𝝁(j), 𝚺(j) be the estimated parameters at iteration j; we can rearrange (2) as

𝝁(j+1){RHSof(2)}×q(𝒂;𝝁(j),𝚺(j))q(𝒂;𝝁(j),𝚺(j))=𝔼𝒂q(𝒂;𝝁(j),𝚺(j))[𝒂𝒲(𝒂)]𝔼𝒂q(𝒂;𝝁(j),𝚺(j))[𝒲(𝒂)]. (4)

It is worth noting that a similar iterative law can also be derived by solving the optimization problem argmaxq(𝒂;𝝁,𝚺)𝔼[logp(𝒪|τ)] by mirror descent [miyashita2018mirror, okada2018acceleration]. To connect this inference problem to trajectory optimization, we define the optimality likelihood with trajectory reward r(τ) and a monotonically increasing function f(), as p(𝒪|τ):=f(r(τ)). If we define f(r(τ))er(τ) the same as [levine2018reinforcement, piche2018probabilistic], an optimization algorithm similar to MPPI [williams2016aggressive, williams2017information, okada2018acceleration] is recovered. As summarized in Table 1, other similarities to well-known optimization algorithms, including CEM, can be observed with different optimality definitions. 22 2 We implicitly assume the existence of step-wise likelihood p(𝒪t|𝒔t,𝒂t) corresponding to each definition. Since another graphical model with a single unified optimality can be defined, the existence is not critical.

Table 1: Optimization algorithms derived by moment matching of p(τ|𝒪) and different f definitions; 𝟙 indicates an indicator function, g: denotes rank-preserving transformation.
MPPI [williams2016aggressive] CEM [botev2013cross] Prop-CEM [goschin2013cross] CMA-ES [hansen2003reducing]
f(r(τ)) er(τ) 𝟙[r(τ)>rthd] r(τ)-rminrmax-rmin logg(r(τ))𝟙[r(τ)>rthd]

There is a discrepancy between (4) and the CEM implementation in [chua2018deep]; in which 𝒲(𝒂)=f(𝔼[r(τ)]) is used instead of 𝒲(𝒂)=𝔼[f(r(τ))]. Since f is a convex function, Jensen’s inequality holds in this case, thus 𝒲𝒲. The equality holds when f() is constant, implying that 𝒲𝒲 for low-variance r(τ) and 𝒲>𝒲 for high-variance (or more uncertain) r(τ). Namely, 𝒲(𝒂) underestimates the optimality likelihood if 𝒂 generates uncertain trajectories. Since we have experimentally observed that this filtering effect of 𝒲 demonstrates higher optimization performance than 𝒲 (see Sec. A), this paper heuristically employs the use of 𝒲.

In practice, expectation operators 𝔼[] should be implemented on digital computers through the Monte Carlo integration with K sampled actions and P trajectories for each action: 𝝁(j+1)k=1K[𝒂k𝒲(𝒂k)]/k=1K[𝒲(𝒂k)] and 𝒲(𝒂k)f(1Pi=1Pr(τk,i)).

3 Variational Inference MPC: From Moment Matching to Inference

Given uncertainty in a dynamics model, it is natural to suppose that optimal trajectories are also uncertain. However, as exhibited in the previous section, PETS employs the moment matching of the trajectory posterior, ignoring almost uncertainty in optimal trajectories. In this section, we newly introduce a variational inference MPC (VI-MPC) framework to formulate MBRL as fully Bayesian and involve uncertainty both in the dynamics and optimalities.

Let us consider a variational inference problem: KL(qθ(τ)||p(τ,θ|𝒪)). We assume the variational distribution qθ(τ) is decomposed to qθ(τ)=q(𝒂)p(𝒔|𝒂,θ)p𝒟(θ); hence, we introduce p(τ,θ|𝒪)(=p(𝒪|τ)p(𝒔|𝒂,θ)p𝒟(θ)) as a posterior, which takes the similar decomposable form as qθ(τ). This assumption forces optimal state transitions to be controlled only by p(𝒔t+1|𝒔t,𝒂t,θ) [levine2018reinforcement]. As shown in Sec. B.1, this inference problem can be transformed to the maximization problem: argmaxqθ(τ)𝔼[logp(𝒪|τ)-logq(𝒂)]. A notable property is that this objective has an entropy regularization term -logq(𝒂), which encourages q(𝒂) to have broader shape to capture more uncertainty. For the sake of convenience, we introduce a tunable hyperparameter α(>0) to the optimality likelihood p(𝒪|τ)p1α(𝒪|τ). Then the above objective can be transformed as argmaxqθ(τ)𝔼[logp(𝒪|τ)-αlogq(𝒂)]. By applying mirror descent [bubeck2015convex] to this optimization problem, we can derive an update law for q(𝒂) (see Sec. C for the detailed derivation):

q(j+1)(𝒂)q(j)(𝒂)𝒲(𝒂)1λ(q(j)(𝒂))-κ/𝔼q(j)(𝒂)[𝒲(𝒂)1λ(q(j)(𝒂))-κ], (5)

where λ(>0), κ(>0) are hyperparameters and α is absorbed into them. λ is inverted step-size to control optimization speed and κ is the weight of the entropy regularization term q-κ.

Eq. (5) suggests a novel and general MPC framework, which we call variational inference MPC (VI-MPC). To realize a specific VI-MPC method, we specify the following parameters: (1) optimality definition (or f(); see Table 1), (2) variational distribution model q, and (3) entropy regularization κ>0 or κ=0. We did not include λ into the specifications since it is highly dependent on the optimality definition (see Sec. G). In this paper, we describe the above specifications as VIMPC(<optimality_def>, <variational_dist>, <max_ent>). For example, we respectively express vanilla CEM and MPPI as VIMPC(‘CEM’, ‘Gaussian’, False) and VIMPC(‘MPPI’, ‘Gaussian’, False). In Sec. 4, we propose a new instance of VI-MPC to incorporate multimodal uncertainty in the posterior.

4 Probabilistic Action Ensembles with Trajectory Sampling

As reviewed in Sec. 2.1, previous methods have successfully involved multimodality in p𝒟(θ) with network ensembles. If this multimodality in p𝒟(θ) is given, other distributions depending on p𝒟(θ), including p(𝒪|τ), would also be multimodal. In other words, there are various possible optimal trajectories (or actions) like Fig. 1. It is obvious that VIMPC(*, ‘Gaussian’, *) will still easily fail to capture multimodality because of overfitting to a single mode. Inspired by the success of the ensemble approach for dynamics modeling, we propose a novel VI-MPC method that introduces action ensembles with a Gaussian mixture model (GMM), i.e., VIMPC(*, ‘GMM(M=*)’, *), which we call PaETS (Probabilistic Action Ensembles with Trajectory Sampling).

PaETS defines the variational distribution q(𝒂) as

q(j)(𝒂):=q(𝒂;ϕ(j))=m=1Mπm(j)𝒩(𝒂;𝝁m(j),𝚺m(j)), (6)

where ϕ(j):={(πm(j),𝝁m(j),𝚺m(j))}m=1M and M is the number of components of the mixture model. Now, we derive the iteration scheme to update the parameters of GMM. At first, drawing K samples from q(j)(𝒂), we approximate q(j)(𝒂) as a discretized distribution (or a set of particles):

q(j)(𝒂;ϕ)q(𝒂;𝐖(j)):=k=1Kwk(j)δ(𝒂-𝒂k), (7)

where 𝐖(j):={wk(j)}k=1K. Just after sampling, the weight of each particle is uniform: 𝐖(j)=𝟏/K. By substituting this approximated distribution to (5), the update law for the particle weights is derived as

wk(j+1)𝒲(𝒂k)1λ(q(j)(𝒂k))-κ/k=1K𝒲(𝒂k)1λ(q(j)(𝒂k))-κ. (8)

Then we estimate ϕ(j+1), which maximizes the observation probability of the weighted particles:

ϕ(j+1)=argmaxϕlogp({(wk(j+1),𝒂k)}k=1K|ϕ)=argmaxϕk=1Kwk(j+1)logq(𝒂k;ϕ). (9)

By taking the derivative ϕlogp(|ϕ)=𝟎 and borrowing the concept of the EM algorithm [bilmes1998gentle], we get the update laws of ϕ(j+1) which take the weight-average form like (4) (see Sec. D for the complete definition):

(𝝁m(j+1),𝚺m(j+1),πm(j+1))(k=1Kωm,k(j+1)𝒂k,k=1Kωm,k(j+1)(𝒂k-𝝁m(j+1))2,Nmm=1MNm). (10)
Figure 3: Evaluated locomotion tasks simulated in MuJoCo.

Fig. 7 in Sec. E illustrates how this method works in a toy optimization task.

In summary, PaETS and the MPC utilizing it are respectively described in Algs. 1 and 2, where U is the number of iterations for optimization and H is the length of the task episode. At 2 in Alg. 2, 𝝁ms are initialized independently at random. At 11, 𝚺ms and πms are reset to be initial values, encouraging exploration for the next time-step and preventing q(𝒂;ϕ) from degenerating to a single mode. If we set M=1, these procedures are almost equivalent to those of PETS. The use of GMM (M>1) does not increase computational complexity significantly (see Sec. F). \SetCommentStymycommfont

\DontPrintSemicolon\KwInputState 𝒔1, GMM param. ϕ(1) and p𝒟(θ) \KwOutputOptimized GMM param. ϕ(U+1) \Forj1 \KwToU Sample actions {𝒂kq(𝒂;ϕ(j))}k=1K \[email protected]
Sample states {{{ \[email protected]
\Indpθk,i,tp𝒟(θ) \tcp*[l]TS1 method 𝒔k,i,t+1p(𝒔t+1|𝒂k,t,𝒔k,i,t,θk,i,t), \[email protected]
\Indm}t=1T-1}i=1P}k=1K \[email protected]
Eval. {𝒲(𝒂k)f(i=1Pr(τk,i))}k=1K \[email protected]
Calc. {wk(j+1)}k=1K by (8) \[email protected]
Update ϕ(j+1) by (10)
\algorithmcfname 1 PaETS
\DontPrintSemicolon\KwInputInitial state 𝒔1 \KwDataTraining data 𝒟, initial variance  𝚺init Infer p𝒟(θ) \tcp*[l]train ensemble neural networks {𝝁m𝒩(𝒂;𝟎,𝚺init)}m=1M \tcp*[l]random init. {(𝚺m,πm)(𝚺init,1/M)}m=1M \[email protected]
\Forn1 \KwToH \tcp*[l]controll loop ϕExec. Alg. 1(𝒔n,ϕ,p𝒟(θ))\[email protected]
Sample 𝒂q(𝒂;ϕ) \[email protected]
Send 𝒂1 to actuators and observe 𝒔n+1\[email protected]
𝒟𝒟{(𝒔n,𝒂1,𝒔n+1)} \[email protected]
{𝝁m{𝝁m,2:T,𝟎}}m=1M \tcp*[l]warm startup {(𝚺m,πm)(𝚺init,1/M)}m=1M \[email protected]
\algorithmcfname 2 MPC with PaETS

5 Experiments

5.1 Comparison to State-of-the-art Methods

The main objective of this experiment is to demonstrate that PaETS has advantages over the state-of-the-art MBRL baseline: PETS [chua2018deep]. In this experiment, PaETS and PETS (or vanilla CEM) were implemented using our same codebase with different parameters, i.e., VIMPC(‘CEM’, ‘GMM(M=5)’, True) for PaETS, and VIMPC(‘CEM’, ‘GMM(M=1)’, False) for PETS. We also evaluated another MBRL baseline with MPPI  [williams2017information], realized as VIMPC(‘MPPI’, ‘GMM(M=1)’, False). These above methods share the settings for p𝒟(θ) inference (training of network ensembles). The state-of-the-art MFRL method SAC [haarnoja2018soft], was also evaluated to compare asymptotic performance.33 3 We used the open-source code: https://github.com/pranz24/pytorch-soft-actor-critic Fig. 3 illustrates the simulated locomotion tasks evaluated in this experiment, which are complex and challenging due to their high non-linearity. All the tasks, except for HalfCheetah, were not evaluated in the original PETS paper [chua2018deep]. Other details about our implementation and experimental settings are described in Sec. G and Sec. H. Fig. 4 presents the experimental results, in which PaETS consistently exhibits better asymptotic performance than that of the MBRL baselines. In addition, PaETS outperforms or is comparable to SAC while requiring significantly fewer samples (about x10 more sample efficient).

Figure 4: Learning curves for different tasks and algorithms. These are averaged results of 8 (for MBRL) and 20 (for SAC) independent training trials with different random seeds. We stopped the training when convergence was observed or after reaching the specified test steps (500 for MBRL and 5,000 for SAC). The asymptotic performances (averages of the last 10 test steps) are depicted in dashed lines.

5.2 Ablation Study

This experiment clarifies which component of PaETS (GMM and entropy-regularization) contributed to the overall improvement. Fig. 5 expresses the results of this ablation study and Welch’s t-test for some selected representative pairs. From this figure, one can observe that the use of GMM (M=5) significantly improves performance. The effect of the regularization (κ>0) is relatively small, but not negligible. In certain tasks, setting κ to particular values could improve the performance. In the case of M>1, the regularization sheds light on actions sampled from low πm, thus encouraging q(𝒂;ϕ) to be multimodal. In some tasks which requires rather delicate controls (e.g., Hopper, Walker2d), the effect of κ seems less significant. Fig. 6 examines sensitivity with the number of mixture components M, for which M=5 achieved the highest performance. If infinite or enough samples are given (K0), it would be reasonable to set M to be large enough to capture multimodality. However, in practice, K is finite and could be small enough due to computational constraints. In this case, larger M makes it difficult to approximate q(𝒂;ϕ) as a set of particles q(𝒂;𝐖), resulting in degradation of the optimization performance.

Figure 5: Asymptotic performance comparison with varying Ms and κs. These are averaged results over 8 different MBRL trials and the last 10 test steps. The error bars denote confidence intervals (95%). Symbols ‘*’, ‘**’ and ‘n.s.’ respectively mean p<0.05, p<0.01 and p0.05 in Welch’s t-test.
Figure 6: Asymptotic performance comparison with varying M{1,3,5,7} and fixed κ(=0.5). Only the HalfCheetah task is evaluated in this test.

6 Related Work

Dynamics Posterior Inference Recent MBRL methods, MB-MPO (Model-Based Meta-Policy-Optimization) [clavera2018model] and ME-TRPO (Model Ensemble Trust Region Optimization) [kurutach2018model], also employ network ensembles to model dynamics, but they utilize the ensembles differently than we do: to train policy networks, not MPC.

Trajectory Optimization Sequential Monte-Carlo based MPC, described as VIMPC(*, ‘Particles’, False), has been introduced in [kantas2009sequential], but it requires well-designed proposal distribution to sample particles for the next iteration j+1. Another particle-based method has been derived [piche2018probabilistic] by utilizing the control as inference framework. However, this method relies on not only a dynamics model, but also policy and value functions to manage particles, so MFRL methods must be incorporated.

Recent studies have demonstrated that entropy regularization is a promising strategy in policy training  [abdolmaleki2015model, abdolmaleki2017deriving, haarnoja2017reinforcement, haarnoja2018soft]. However, to the best of our knowledge, the introduction of entropy regularization to MPC is novel along with explicit multimodal expression to successfully realize their synergistic effect.

Ref. [wagener2019online] also systematically organizes the stochastic MPC methods from the perspective of online learning, but uncertainty-aware discussions from a Bayesian viewpoint are not conducted.

Bayesian Reformulation Ref. [jeon2018bayesian] proposes a novel approach to generative adversarial imitation learning (GAIL) [ho2016generative], which reformulates general GAIL in a Bayesian fashion and utilizes ensembles to infer discriminator posteriors. Another Bayesian reformulation of GAIL integrates imitation and reinforcement learning by introducing another optimality (i.e., imitation optimality 𝒪tI[kinose2019integration].

7 Conclusion & Discussions

This paper introduces a novel VI-MPC framework that systematically generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. We also devise a novel instance of this framework, called PaETS, which can successfully incorporate multimodal uncertainty in optimal trajectories. By combining our method and the recent uncertainty-aware dynamics modeling with neural network ensembles, our Bayesian MBRL is able to involve multimodalities both in dynamics and optimalities. In addition, our method is a quite simple extension of general stochastic methods and requires no significant additional computational complexity. Our experiments demonstrate that PaETS can improve asymptotic performance compared to the leading MBRL baseline PETS, and thus substantially enhances MBRL potential to be more competitive to the state-of-the-art MFRL.

Considering the simplicity and generalizability of VI-MPC and PaETS, we expect that our concept is applicable to a variety of tasks, such as traditional MPC with deterministic dynamics and advanced MPC with latent dynamics from pixels by Deep Planning Network  [hafner2018learning]. By introducing a categorical mixture model as a variational distribution, application to combinational optimizations is also feasible. In fact, our ongoing work includes experiments of discrete MPC for a practical system.

A question that remains is how to determine VI-MPC specifications. As implied in Fig. 4, the best optimality definition could be task dependent (e.g., MPPI outperformed vanilla CEM in the Ant but not in other tasks). The regularization weight κ also has task dependency as shown in Fig. 5. It would be challenging but interesting future work to add the parameters to the graphical model in Fig. 2 as latent variables to infer promising parameters along with optimal trajectories, like infinite GMM [rasmussen2000infinite]. Another appealing endeavor for future work is to introduce the concept of parallel tempering [brooks2011handbook] in Markov Chain Monte Carlo. By adaptively varying different temperatures (λ in our case) of ensemble actions, we can expect the ensemble diversity to improve.

Acknowledgments

We thank Vishwajeet Singh, Hiroki Nakamura and Akira Kinose for their cooperation in this study during their student-internship periods. Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.

References

Appendix A Comparison Between 𝒲 and 𝒲

We evaluated the impact of 𝒲 and 𝒲 on the optimization performance of (vanilla) CEM and MPPI, the results of which are summarized in Table 2, where 𝒲 gained much higher rewards than 𝒲.

Table 2: Episode reward of HalfCheetah task with 𝒲 and 𝒲. A common dynamics model (sufficiently trained ensemble neural network by MBRL) was employed for this test. Ten different trials were conducted and the results were averaged.
CEM MPPI
𝒲 𝒲 𝒲 𝒲
5603.24±541.31 11843.05±295.80 2789.03±647.82 9765.27±231.04

Appendix B Derivations

B.1 Derivation of the Variational Inference Objective

By using the assumption of qθ(τ)=q(𝒂)p(𝒔|𝒂,θ)p𝒟(θ), the KL-divergence can be transformed as

KL(qθ(τ)||p(τ,θ|𝒪)) =qθ(τ)logqθ(τ)p(τ,θ|𝒪)dτdθ (11)
=qθ(τ)logq(𝒂)p(𝒔|𝒂,θ)p𝒟(θ)p(𝒪|τ)p(𝒔|𝒂,θ)p𝒟(θ)dτdθ (12)
=-𝔼qθ(τ)[logp(𝒪|τ)-logq(𝒂)]. (13)

Appendix C Derivation of (5)

In this section, we simply denote q𝒂 as q(𝒂) and qτ as q(τ)(=q𝒂p(𝒔|𝒂,θ)p𝒟(θ)) for readability. Let us consider the optimization problem:

argminqτ𝒥=argminqτ𝔼qτ[-logp(𝒪|τ)+αlogq𝒂]. (14)

By applying mirror descent [bubeck2015convex], the iterative update law of qτ(j+1) is given as

qτ(j+1)=argminqτqτ𝒥,qτ+βKL(qτ||qτ(j))+γ(1-qτdτdθ), (15)

where , is the inner-product operator, β is a hyper-parameter related to the step-size, and γ is the Lagrange multiplier for the constraint qτ𝑑τ𝑑θ=1. The arguments in the argmin operator can be rearranged as

qτ(-logp(𝒪|τ)+αlogq𝒂+βlogq𝒂-βlogq𝒂(j)-γ)𝑑τ𝑑θ+γ, (16)

where, we used the relations:

qτ𝒥,qτ=𝒥, (17)
KL(qτ||qτ(j))=qτlogqτqτ(j)dτdθ=qτlogq𝒂q𝒂(j)dτdθ. (18)

The integrand of (16) can be organized as

qτlogq𝒂α+βp(𝒪|τ)e-γ(q𝒂(j))β qτlogq𝒂p(𝒪|τ)1α+βe-γα+β(q𝒂(j))βα+β (19)
=qτlogp(𝒔|𝒂,θ)p𝒟(θ)q𝒂(p(𝒔|𝒂,θ)p𝒟(θ)q𝒂(j))p(𝒪|τ)1α+βe-γα+β(q𝒂(j))-αα+β (20)
=qτlogqτqτ(j)p(𝒪|τ)1α+βe-γα+β(q𝒂(j))-αα+β. (21)

Integrating the above equation yields,

(16)=KL(qτ||qτ(j)p(𝒪|τ)1α+βe-γα+β(q𝒂(j))-αα+β)+γ. (22)

By minimizing this equation, we get:

qτ(j+1)=qτ(j)p(𝒪|τ)1α+βe-γα+β(q𝒂(j))-αα+β. (23)

The Lagrange multiplier can be removed using the constraint qτ(j+1)𝑑τ𝑑θ=1:

eγα+β =𝔼qτ(j)[p(𝒪|τ)1α+β(q𝒂(j))-αα+β] (24)
=𝔼𝒂q𝒂(j)[𝔼𝒔p(𝒔|𝒂,θ),θp𝒟(θ)[p(𝒪|τ)1α+β](*)(q𝒂(j))-αα+β]. (25)

Considering the discussion in Sec. 2.2 and Sec. A, we compute (*) as

(*)f(𝔼[r(τ)])1α+β=𝒲(𝒂)1α+β. (26)

Substituting (25) to (23) results in:

qτ(j+1)=qτ(j)p(𝒪|τ)1α+β(q𝒂(j))-αα+β𝔼𝒂q𝒂(j)[𝒲(𝒂)1α+β(q𝒂(j))-αα+β]. (27)

Marginalizing (𝒔,θ), we finally obtain:

q𝒂(j+1)=q𝒂(j)𝒲(𝒂)1α+β(q𝒂(j))-αα+β𝔼𝒂q𝒂(j)[𝒲(𝒂)1α+β(q𝒂(j))-αα+β]. (28)

In (5), we replaced λ:=α+β, κ:=α/(α+β).

Appendix D Complete Definition of PaETS

ηm(𝒂k) :=πm(j)𝒩(𝒂k;𝝁m(j),𝚺m(j))/m=1Mπm(j)𝒩(𝒂k;𝝁m(j),𝚺m(j)) (29)
ωm,k(j+1) :=ηm(𝒂k)wk(j+1)/k=1Kηm(𝒂k)wk(j+1):=Nm (30)
𝝁m(j+1) k=1Kωm,k(j+1)𝒂k (31)
𝚺m(j+1) k=1Kωm,k(j+1)(𝒂k-𝝁m(j+1))2 (32)
πm(j+1) Nm/m=1MNm. (33)

Appendix E Optimization of Toy Objective Function by PaETS

Fig. 7 illustrates how PaETS optimizes q(𝒂;ϕ(j)) in a toy multimodal objective function.

Figure 7: The optimization process of a 2D multimodal objective function by PaETS (VIMPC(‘MPPI’, ‘GMM(M=2)’, True)), in which two distribution components are successfully optimized to fit the two modals. depict particles that approximates q(𝒂;ϕ(j)).

Appendix F Computational Complexity

The main computational bottleneck of PaETS (and PETS) is the execution of 3–6 in Alg. 1, in which total K×P trajectories must be sampled. In our experiment, K and P were respectively set as K=500, P=20 as in [chua2018deep]. Compared to PETS, PaETS requires additional procedures like action sampling from GMM (2) and GMM parameter update (9). However, these additional procedures are easily parallelizable on GPUs, and their computation times are much shorter than the above mentioned bottleneck. In the experiments with our early prototype in TensorFlow, it took about 57 ms for M=5 and 55 ms for M=1 (equivalent to PETS) to execute one iteration of the for-loop in Alg. 1 on a single NVIDIA RTX2080 GPU. The above execution time does not meet the real-time constraints (e.g., 30 Hz). However, considering the success of the real-time implementation of MPPI in [williams2016aggressive, williams2017information], we believe real-time implantation of our method is feasible with optimized implementation using compiled language, low-level GPU APIs, and thorough tuning of hyperparameters (e.g., K, P and DNN complexity).

Appendix G Implementation Notes

Cross Entropy Method

It is general technique to adaptively determine rthd in Table 1 so that only the top-e% samples satisfies the threshold condition. We employ this technique and the eliteness ratio is set to be e=10%. λ has no effect on CEM optimization since f() takes binary values.

MPPI

Reward normalization heuristics, as suggested in [theodorou2010generalized], were also introduced for our MPPI implementation as

𝒲(𝒂k)1λ=exp{1λr(τk)-min{r(τk)}k=1Kmax{r(τk)}k=1K-min{r(τk)}k=1K}, (34)

where r(τk)=1Pi=1Pr(τk,i). λ was set to be λ=0.1 as also suggested in [theodorou2010generalized].

Entropy Regularization

The value of κ is very sensitive to task settings, especially for the dimensionalities of action spaces. To make κ insensitive, we introduced the following normalization trick inspired by the above heuristics. First, we rearrange (8) as

wk(j+1)𝒲(𝒂)1λexp{κ(-logq(j)(𝒂k))}. (35)

Then, we replace -logq(j)(𝒂k) to normalized one:

-logq(j)(𝒂k)-logq(j)(𝒂k)-min{-logq(j)(𝒂k)}k=1Kmax{-logq(j)(𝒂k)}k=1K-min{-logq(j)(𝒂k)}k=1K[0,1]. (36)

By applying these heuristics, the range of entropy bonus is limited to [1,eκ], where the action with the lowest probability among K samples gains the highest entropy bonus of eκ.

Appendix H Experimental Setup

We used MuJoCo tasks modified from standard OpenAI Gym tasks.44 4 https://github.com/openai/gym Table 3 summarizes the task settings, where vx, φ and z respectively denote the velocity, orientation angle, and height of the agents. Penalty functions Φ, Ψ are newly introduced to encourage the agents to move forward in the proper form. Instead, done flags used originally for early task stopping are removed. Φ, Ψ are defined as

Φ(z,zdes)=e-(z-zdes)2, (37)
Ψ(φ)=1+cos(2φ)2. (38)

We modified the range of actions (i.e., torques) from [-1,1] to [-5,5] to exaggerate uncertainties in the optimal trajectory posteriors.

Table 3: MuJoCo task settings.
Task Reward Function 𝒔t 𝒂t Misc.
HalfCheetah vx1+sign(cos(φ))2-0.1||𝒂t||2 18 6
Ant vxΦ(z,zdes)-10-3||𝒂t||2 28 8 zdes=0.75
Hopper vxΦ(z,zdes)Ψ(φ)-10-3||𝒂t||2 12 3 zdes=1.2
Walker2d vxΦ(z,zdes)Ψ(φ)-10-3||𝒂t||2 18 6 zdes=1.2

Table 4 summarizes the shared parameter settings for MBRL (PaETS, PETS, and MPPI). For SAC, we used the default parameters from the original codebase.

Table 4: MBRL parameters.
HalfCheetah Ant Hopper Walker2d
T: prediction horizon 30 30 60 45
κ: weight of entropy regularizer 0.5 0.25 0.5 0.5
K: # sampled actions 500
P: # trajectories for each action 20
U: # optimization-iterations 5
H: # episode length 1000
E: # neural networks 5
hidden nodes (200, 200, 200, 200)
activation function Swish
optimizer Adam
learning rate 10-3
batch-size 160