### Abstract

In recent studies on model-based reinforcement learning (MBRL), incorporatinguncertainty in forward dynamics is a state-of-the-art strategy to enhancelearning performance, making MBRLs competitive to cutting-edge model freemethods, especially in simulated robotics tasks. Probabilistic ensembles withtrajectory sampling (PETS) is a leading type of MBRL, which employs Bayesianinference to dynamics modeling and model predictive control (MPC) withstochastic optimization via the cross entropy method (CEM). In this paper, wepropose a novel extension to the uncertainty-aware MBRL. Our main contributionsare twofold: Firstly, we introduce a variational inference MPC, whichreformulates various stochastic methods, including CEM, in a Bayesian fashion.Secondly, we propose a novel instance of the framework, called probabilisticaction ensembles with trajectory sampling (PaETS). As a result, our BayesianMBRL can involve multimodal uncertainties both in dynamics and optimaltrajectories. In comparison to PETS, our method consistently improvesasymptotic performance on several challenging locomotion tasks.

### Quick Read (beta)

# Variational Inference MPC for

Bayesian Model-based Reinforcement Learning

###### Abstract

In recent studies on model-based reinforcement learning (MBRL), incorporating uncertainty in forward dynamics is a state-of-the-art strategy to enhance learning performance, making MBRLs competitive to cutting-edge model-free methods, especially in simulated robotics tasks. Probabilistic ensembles with trajectory sampling (PETS) is a leading type of MBRL, which employs Bayesian inference to dynamics modeling and model predictive control (MPC) with stochastic optimization via the cross entropy method (CEM). In this paper, we propose a novel extension to the uncertainty-aware MBRL. Our main contributions are twofold: Firstly, we introduce a variational inference MPC (VI-MPC), which reformulates various stochastic methods, including CEM, in a Bayesian fashion. Secondly, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). As a result, our Bayesian MBRL can involve multimodal uncertainties both in dynamics and optimal trajectories. In comparison to PETS, our method consistently improves asymptotic performance on several challenging locomotion tasks.

KwInputInput \SetKwInputKwOutputOutput

Variational Inference MPC for

Bayesian Model-based Reinforcement Learning

Masashi Okada |
---|

Panasonic Corp., Japan |

[email protected] |

Tadahiro Taniguchi |
---|

Ritsumeikan Univ. & Panasonic Corp., Japan |

[email protected] |

noticebox[b]\[email protected]

Keywords: model predictive control, variational inference, model-based reinforcement learning

## 1 Introduction

Model predictive control (MPC) is a powerful and accepted technology for advanced control systems such as manufacturing processes [vargas2000multilayer], HVAC systems [afram2014theory], power electronics [vazquez2014model], autonomous vehicles [paden2016survey], and humanoids [kuindersma2016optimization]. MPC utilizes the specified models of system dynamics to predict future states and rewards (or costs) to plan future actions that maximize the total reward over the predicted trajectories. Especially for industrial applications, the clear explainability of such a decision-making process is advantageous. Furthermore, in some tasks (e.g., games) [silver2016mastering], planning-based policies of this nature could outperform reactive-policies (e.g., full neural network policies).

Model-based reinforcement learning (MBRL) methods that employ expressive function approximators (e.g., deep neural networks: DNNs) [deisenroth2011pilco, williams2017information, nagabandi2018neural] present appealing approaches for MPC. The main difficulty in introducing MPC to practical systems is specifying the forward dynamics models of target systems. However, accurate system identification is challenging in many advanced applications. Take robotics for example, where robots encounter floors and walls, and must be able to manipulate some objects, making the dynamics highly non-linear. The main objective of MBRL is to train approximators of complex dynamics through experiences in real systems. The general procedure of MBRL is summarized as; (1. training-step) train the approximate model with a given training dataset, then (2. test-step) execute the actions (or policies) optimized with the dynamics model in a real environment and augment the dataset with the observed results. The above training and test steps are iteratively conducted to collect sufficient and diverse data so as to achieve the desired performance.

One feature of MBRL is its considerable sample efficiency compared to model-free reinforcement learning (MFRL), which directly trains policies through experiences. In other words, MBRL requires much less test time in real environments. In addition, MBRL benefits from the generalizability of the trained model, which can be easily applied to new tasks in the same system. However, the asymptotic performance of MBRL is generally inferior to that of model-free methods. This discrepancy is primarily due to the overfitting of dynamics models to the few data available during initial MBRL steps, which is called the model-bias problem [deisenroth2011pilco]. Several studies have demonstrated that incorporating uncertainty in dynamics models can alleviate this issue. The uncertainty-aware modeling is realized by Bayesian inference employing a Gaussian Process [deisenroth2011pilco], dropout as variational inference [gal2016dropout, gal2017concrete, kahn2017uncertainty], or neural network ensembles [chua2018deep, kurutach2018model, clavera2018model].

Probabilistic ensembles with trajectory sampling (PETS) [chua2018deep] is one type of uncertainty-aware MBRL. As an MPC-oriented MBRL method, PETS conducts trajectory optimization via the cross entropy method (CEM) [botev2013cross] by using trajectories probabilistically sampled from the ensemble networks. Experiments have demonstrated that PETS can achieve competitive performance over state-of-the-art MFRL methods like Soft Actor Critic (SAC) [haarnoja2018soft], while yielding much higher sample efficiency. Since our primary interest is MPC and its application to practical systems, this paper mainly focuses on PETS and treats this method as a strong baseline.

Considering the success of probabilistic dynamics modeling, incorporating uncertainty in optimal trajectories appears very promising for MBRL. However, an optimization scheme that can utilize uncertainty has not yet been discussed. Although several stochastic approaches, including CEM, model predictive path integral (MPPI) [williams2016aggressive, williams2017information], covariance matrix adaptation evolution strategy (CMA-ES) [hansen2003reducing], and proportional CEM (Prop-CEM) [goschin2013cross], have been proposed, they are not uncertainty-aware and tend to underestimate uncertainty. In addition, although their optimization procedures are very similar, they have been independently derived. Consequently, theoretical relations among these methods are unclear, preventing us from systematically understanding and reformulating them to be uncertainty-aware in a Bayesian fashion.

Motivated by these, in this paper, we propose a novel MPC concept for Bayesian MBRL. The organization and contributions of this paper are summarized as follows. (1) In Sec. 3, we introduce a novel MPC framework, variational inference MPC (VI-MPC), which generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. The key observations for deriving this framework are organized in Sec. 2, where we point out that general stochastic optimization methods can be regarded as the moment matching of the optimal trajectory posterior, which appear in a Bayesian MBRL formulation. (2) In Sec. 4, we propose a novel instance of the framework, called probabilistic action ensembles with trajectory sampling (PaETS). Toy task examples and the concept of our method are exhibited in Fig. 1. (3) In Sec. 5, we demonstrate that our method consistently outperforms PETS via experiments with challenging locomotion tasks in the MuJoCo physics simulator [todorov2012mujoco].

## 2 Model-based Reinforcement Learning as Bayesian Inference

In this section, we describe MBRL as a Bayesian inference problem using control as inference framework [levine2018reinforcement]. Fig. 2 displays the graphical model for the formulation, with which an MBRL procedure can be re-written in a Bayesian fashion: (1. training-step) do inference of $p(\theta |\mathcal{D})$. (2. test-step) do inference of $p(\tau |{\mathcal{O}}_{1:T}=\mathrm{\U0001d7cf})$, then, sample actions from the posterior and execute the actions in a real environment. We denote a trajectory as $\tau :={\{({\bm{s}}_{t},{\bm{a}}_{t})\}}_{t=1}^{T}$, where ${\bm{s}}_{t}$ and ${\bm{a}}_{t}$ respectively represent state and action. Given a state-action pair at time $t$, the next state can be predicted by a forward-dynamics model ${\bm{s}}_{t+1}\sim p({\bm{s}}_{t+1}|{\bm{s}}_{t},{\bm{a}}_{t},\theta )$ parameterized with $\theta $. The posterior of $\theta $ is inferred from training dataset $\mathcal{D}$, where $\mathcal{D}=\{({\bm{s}}_{t},{\bm{a}}_{t},{\bm{s}}_{t+1})\}$ consists of states and actions observed during the test step. To formulate optimal control as inference, we auxiliarly introduce a binary random variable ${\mathcal{O}}_{t}\in \{0,1\}$ to represent the optimality of (${\bm{s}}_{t}$, ${\bm{a}}_{t}$). Given $p(\theta |\mathcal{D})$, trajectory optimization can be expressed as an inference problem:

$$p(\tau |\mathcal{O})\propto \int \underset{:=p(\mathcal{O}|\tau )}{\underset{\u23df}{\left\{\prod _{t=1}^{T}p({\mathcal{O}}_{t}=1|{\bm{s}}_{t},{\bm{a}}_{t})\right\}}}\cdot \underset{:=p(\bm{s}|\bm{a},\theta )}{\underset{\u23df}{p\left({\bm{s}}_{1}\right)\left\{\prod _{t=1}^{T}p\left({\bm{s}}_{t+1}|{\bm{s}}_{t},{\bm{a}}_{t},\theta \right)\right\}}}\cdot \underset{:={p}_{\mathcal{D}}(\theta )}{\underset{\u23df}{p(\theta |\mathcal{D})}}\mathit{d}\theta ,$$ | (1) |

where uninformative action prior (i.e., $p({\bm{a}}_{t})=\mathcal{U}$: uniform distribution) is supposed. For readability, ${\mathcal{O}}_{1:T}=\mathrm{\U0001d7cf}$ is simply denoted as $\mathcal{O}$. For the same reason, we omit the subscripts of sequences ${\bm{a}}_{1:T}$, ${\bm{s}}_{1:T}$. In the remainder of the paper, this simplified notation is employed. In Sec. 2.1–2.2, we review how these inference problems have been approximately handled in previous works.

### 2.1 Inference of Forward-dynamics Posterior ${p}_{\mathcal{D}}(\theta )$

Given a sufficiently parameterized expressive model, i.e., DNNs, one of the most practical and promising schemes for approximating the posterior ${p}_{\mathcal{D}}(\theta )$ is to utilize neural network ensembles [chua2018deep, kurutach2018model, clavera2018model]. This process approximates the posterior as a set of particles ${p}_{\mathcal{D}}(\theta )\simeq \frac{1}{E}{\sum}_{i}^{E}\delta (\theta -{\theta}_{i})$, where $\delta $ is Dirac delta function and $E$ is the number of networks. Each particle ${\theta}_{i}$ is independently trained by stochastic gradient descent so as to (sub-)optimize $\mathrm{log}{p}_{\mathcal{D}}(\theta )\propto \mathrm{log}p(\mathcal{D}|\theta )p(\theta )$. Although this approximation is incompletely Bayesian, this scheme has several useful features. First, we can simply implement this process in standard deep learning frameworks. Furthermore, the ensemble model successfully involves multimodal uncertainty in the exact posterior.

Another possible way to infer ${p}_{\mathcal{D}}(\theta )$ is dropout as variational inference [gal2016dropout, gal2017concrete, kahn2017uncertainty], which approximates ${p}_{\mathcal{D}}(\theta )$ as a Gaussian distribution $q(\theta )$. It is proofed that the variational inference problem: ${\mathrm{argmin}}_{q}\mathrm{KL}(q(\theta )||{p}_{\mathcal{D}}(\theta ))$ approximately equivalent to training networks with dropout, where $\mathrm{KL}(\cdot ||\cdot )$ denote Kullback-Leibler (KL) divergence. Although this scheme is also simple and theoretically supported, approximation by a single Gaussian distribution tends to underestimate uncertainty (or multimodality) in the posterior. To remedy this problem, $\alpha $-divergence dropout has been proposed [li2017dropout], which replaces KL-divergence to $\alpha $-divergence so as to prevent $q(\theta )$ from overfitting a single mode. However, as long as $q(\theta )$ is Gaussian, the multimodality cannot be managed well.

In our preliminary experiments of MBRL, we have tested the above two schemes and observed that the ensemble performs much better than ($\alpha $-)dropout. This result provides us with the insight that capturing multimodality in the posterior has crucial effects in MBRL literature. Therefore, in this paper, we also employ this ensemble scheme to approximate ${p}_{\mathcal{D}}(\theta )$ in the same way as our baseline: PETS [chua2018deep]. In Sec. 4, we also attempt to incorporate multimodality in the posterior $p(\tau |\mathcal{O})$.

### 2.2 Moment Matching of Trajectory Posterior $p(\tau |\mathcal{O})$

This section clarifies the connection between trajectory optimization and the posterior approximation problem. The key observation delineated here is that several MPC methods, including CEM used in PETS and MPPI, can be regarded as the moment matching of the posterior.

Given an inferred model posterior ${p}_{\mathcal{D}}(\theta )$, we can sample trajectories from (1).^{1}^{1}
1
Trajectory sampling methods with ${p}_{\mathcal{D}}(\theta )$ have been discussed and experimented in [chua2018deep].
In this paper, we employ the TS1 method suggested in the reference (see $\mathrm{\ell}$3–6 in Alg. 1).
Let us approximate the action posterior with a Gaussian distribution $q(\bm{a};\bm{\mu},\mathbf{\Sigma})$.
The mean of posterior action sequence $\bm{\mu}$ can be estimated by moment matching:

$$\bm{\mu}=\mathbb{E}\left[\bm{a}\cdot p(\tau |\mathcal{O})\right]=\frac{{\mathbb{E}}_{\bm{s}\sim p(\bm{s}|\bm{a},\theta ),\theta \sim {p}_{\mathcal{D}}(\theta ),\bm{a}\sim \mathcal{U}}\left[\bm{a}\cdot p(\mathcal{O}|\tau )\right]}{{\mathbb{E}}_{\bm{s}\sim p(\bm{s}|\bm{a},\theta ),\theta \sim {p}_{\mathcal{D}}(\theta ),\bm{a}\sim \mathcal{U}}\left[p(\mathcal{O}|\tau )\right]}=\frac{{\mathbb{E}}_{\bm{a}\sim \mathcal{U}}\left[\bm{a}\cdot \mathcal{W}(\bm{a})\right]}{{\mathbb{E}}_{\bm{a}\sim \mathcal{U}}\left[\mathcal{W}(\bm{a})\right]},$$ | (2) |

where

$$\mathcal{W}(\bm{a}):={\mathbb{E}}_{{\bm{s}}_{t+1}\sim p({\bm{s}}_{t+1}|{\bm{s}}_{t},{\bm{a}}_{t},\theta ),\theta \sim {p}_{\mathcal{D}}(\theta )}\left[p(\mathcal{O}|\tau )\right].$$ | (3) |

Eq. (2) can be viewed as a weighted average where each sampled action is weighted by the likelihood of optimality $\mathcal{W}(\bm{a})$. In the same way, we can also estimate the variance of the posterior $\mathbf{\Sigma}={\mathbb{E}}_{\bm{a}\sim \mathcal{U}}\left[{(\bm{a}-\bm{\mu})}^{2}\mathcal{W}(\bm{a})\right]/{\mathbb{E}}_{\bm{a}\sim \mathcal{U}}\left[\mathcal{W}(\bm{a})\right]$.

In practice, sampling from uniform distribution $\mathcal{U}$ is quite inefficient and requires almost infinite samples. Hence, let us consider iteratively estimating the parameters by incorporating importance sampling. Let ${\bm{\mu}}^{(j)}$, ${\mathbf{\Sigma}}^{(j)}$ be the estimated parameters at iteration $j$; we can rearrange (2) as

$${\bm{\mu}}^{(j+1)}\leftarrow \{\mathrm{RHS}\mathrm{of}\mathrm{(}\text{2}\mathrm{)}\}\times \frac{q(\bm{a};{\bm{\mu}}^{(j)},{\mathbf{\Sigma}}^{(j)})}{q(\bm{a};{\bm{\mu}}^{(j)},{\mathbf{\Sigma}}^{(j)})}=\frac{{\mathbb{E}}_{\bm{a}\sim q(\bm{a};{\bm{\mu}}^{(j)},{\mathbf{\Sigma}}^{(j)})}\left[\bm{a}\cdot \mathcal{W}(\bm{a})\right]}{{\mathbb{E}}_{\bm{a}\sim q(\bm{a};{\bm{\mu}}^{(j)},{\mathbf{\Sigma}}^{(j)})}\left[\mathcal{W}(\bm{a})\right]}.$$ | (4) |

It is worth noting that a similar iterative law can also be derived by solving the optimization problem
${\mathrm{argmax}}_{q(\bm{a};\bm{\mu},\mathbf{\Sigma})}\mathbb{E}\left[\mathrm{log}p(\mathcal{O}|\tau )\right]$
by mirror descent [miyashita2018mirror, okada2018acceleration].
To connect this inference problem to trajectory optimization, we define the optimality likelihood with trajectory reward $r(\tau )$ and a monotonically increasing function $f(\cdot )$, as $p(\mathcal{O}|\tau ):=f(r(\tau ))$.
If we define $f(r(\tau ))\propto {e}^{r(\tau )}$ the same as [levine2018reinforcement, piche2018probabilistic], an optimization algorithm similar to MPPI [williams2016aggressive, williams2017information, okada2018acceleration] is recovered.
As summarized in Table 1, other similarities to well-known optimization algorithms, including CEM, can be observed with different optimality definitions. ^{2}^{2}
2
We implicitly assume the existence of step-wise likelihood $p({\mathcal{O}}_{t}|{\bm{s}}_{t},{\bm{a}}_{t})$ corresponding to each definition.
Since another graphical model with a single unified optimality can be defined, the existence is not critical.

MPPI [williams2016aggressive] | CEM [botev2013cross] | Prop-CEM [goschin2013cross] | CMA-ES [hansen2003reducing] | |
---|---|---|---|---|

$f(r(\tau ))$ | $\propto {e}^{r(\tau )}$ | $\mathrm{\U0001d7d9}[r(\tau )>{r}_{thd}]$ | $\frac{r(\tau )-{r}_{min}}{{r}_{max}-{r}_{min}}$ | $\propto \mathrm{log}g(r(\tau ))\cdot \mathrm{\U0001d7d9}[r(\tau )>{r}_{thd}]$ |

There is a discrepancy between (4) and the CEM implementation in [chua2018deep]; in which ${\mathcal{W}}^{\prime}(\bm{a})=f(\mathbb{E}[r(\tau )])$ is used instead of $\mathcal{W}(\bm{a})=\mathbb{E}[f(r(\tau ))]$. Since $f$ is a convex function, Jensen’s inequality holds in this case, thus $\mathcal{W}\ge {\mathcal{W}}^{\prime}$. The equality holds when $f(\cdot )$ is constant, implying that $\mathcal{W}\simeq {\mathcal{W}}^{\prime}$ for low-variance $r(\tau )$ and $\mathcal{W}>{\mathcal{W}}^{\prime}$ for high-variance (or more uncertain) $r(\tau )$. Namely, ${\mathcal{W}}^{\prime}(\bm{a})$ underestimates the optimality likelihood if $\bm{a}$ generates uncertain trajectories. Since we have experimentally observed that this filtering effect of ${\mathcal{W}}^{\prime}$ demonstrates higher optimization performance than $\mathcal{W}$ (see Sec. A), this paper heuristically employs the use of ${\mathcal{W}}^{\prime}$.

In practice, expectation operators $\mathbb{E}[\cdot ]$ should be implemented on digital computers through the Monte Carlo integration with $K$ sampled actions and $P$ trajectories for each action: ${\bm{\mu}}^{(j+1)}\simeq {\sum}_{k=1}^{K}\left[{\bm{a}}_{k}\cdot {\mathcal{W}}^{\prime}({\bm{a}}_{k})\right]/{\sum}_{k=1}^{K}\left[{\mathcal{W}}^{\prime}({\bm{a}}_{k})\right]$ and ${\mathcal{W}}^{\prime}({\bm{a}}_{k})\simeq f\left(\frac{1}{P}{\sum}_{i=1}^{P}r({\tau}_{k,i})\right)$.

## 3 Variational Inference MPC: From Moment Matching to Inference

Given uncertainty in a dynamics model, it is natural to suppose that optimal trajectories are also uncertain. However, as exhibited in the previous section, PETS employs the moment matching of the trajectory posterior, ignoring almost uncertainty in optimal trajectories. In this section, we newly introduce a variational inference MPC (VI-MPC) framework to formulate MBRL as fully Bayesian and involve uncertainty both in the dynamics and optimalities.

Let us consider a variational inference problem: $\mathrm{KL}({q}_{\theta}(\tau )||p(\tau ,\theta |\mathcal{O}))$. We assume the variational distribution ${q}_{\theta}(\tau )$ is decomposed to ${q}_{\theta}(\tau )=q(\bm{a})p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta )$; hence, we introduce $p(\tau ,\theta |\mathcal{O})\phantom{\rule{veryverythickmathspace}{0ex}}(=p(\mathcal{O}|\tau )p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta ))$ as a posterior, which takes the similar decomposable form as ${q}_{\theta}(\tau )$. This assumption forces optimal state transitions to be controlled only by $p({\bm{s}}_{t+1}|{\bm{s}}_{t},{\bm{a}}_{t},\theta )$ [levine2018reinforcement]. As shown in Sec. B.1, this inference problem can be transformed to the maximization problem: ${\mathrm{argmax}}_{{q}_{\theta}(\tau )}\mathbb{E}\left[\mathrm{log}p(\mathcal{O}|\tau )-\mathrm{log}q(\bm{a})\right]$. A notable property is that this objective has an entropy regularization term $-\mathrm{log}q(\bm{a})$, which encourages $q(\bm{a})$ to have broader shape to capture more uncertainty. For the sake of convenience, we introduce a tunable hyperparameter $\alpha \phantom{\rule{veryverythickmathspace}{0ex}}(>0)$ to the optimality likelihood $p(\mathcal{O}|\tau )\to {p}^{\frac{1}{\alpha}}(\mathcal{O}|\tau )$. Then the above objective can be transformed as ${\mathrm{argmax}}_{{q}_{\theta}(\tau )}\mathbb{E}\left[\mathrm{log}p(\mathcal{O}|\tau )-\alpha \mathrm{log}q(\bm{a})\right]$. By applying mirror descent [bubeck2015convex] to this optimization problem, we can derive an update law for $q(\bm{a})$ (see Sec. C for the detailed derivation):

$${q}^{(j+1)}(\bm{a})\leftarrow {q}^{(j)}(\bm{a})\cdot {\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\lambda}}\cdot {({q}^{(j)}(\bm{a}))}^{-\kappa}/{\mathbb{E}}_{{q}^{(j)}(\bm{a})}\left[{\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\lambda}}\cdot {({q}^{(j)}(\bm{a}))}^{-\kappa}\right],$$ | (5) |

where $\lambda \phantom{\rule{veryverythickmathspace}{0ex}}(>0)$, $\kappa \phantom{\rule{veryverythickmathspace}{0ex}}(>0)$ are hyperparameters and $\alpha $ is absorbed into them. $\lambda $ is inverted step-size to control optimization speed and $\kappa $ is the weight of the entropy regularization term ${q}^{-\kappa}$.

Eq. (5) suggests a novel and general MPC framework, which we call variational inference MPC (VI-MPC). To realize a specific VI-MPC method, we specify the following parameters: (1) optimality definition (or $f(\cdot )$; see Table 1), (2) variational distribution model $q$, and (3) entropy regularization $\kappa >0$ or $\kappa =0$. We did not include $\lambda $ into the specifications since it is highly dependent on the optimality definition (see Sec. G). In this paper, we describe the above specifications as VIMPC(<optimality_def>, <variational_dist>, <max_ent>). For example, we respectively express vanilla CEM and MPPI as VIMPC(‘CEM’, ‘Gaussian’, False) and VIMPC(‘MPPI’, ‘Gaussian’, False). In Sec. 4, we propose a new instance of VI-MPC to incorporate multimodal uncertainty in the posterior.

## 4 Probabilistic Action Ensembles with Trajectory Sampling

As reviewed in Sec. 2.1, previous methods have successfully involved multimodality in ${p}_{\mathcal{D}}(\theta )$ with network ensembles. If this multimodality in ${p}_{\mathcal{D}}(\theta )$ is given, other distributions depending on ${p}_{\mathcal{D}}(\theta )$, including $p(\mathcal{O}|\tau )$, would also be multimodal. In other words, there are various possible optimal trajectories (or actions) like Fig. 1. It is obvious that VIMPC(*, ‘Gaussian’, *) will still easily fail to capture multimodality because of overfitting to a single mode. Inspired by the success of the ensemble approach for dynamics modeling, we propose a novel VI-MPC method that introduces action ensembles with a Gaussian mixture model (GMM), i.e., VIMPC(*, ‘GMM(M=*)’, *), which we call PaETS (Probabilistic Action Ensembles with Trajectory Sampling).

PaETS defines the variational distribution $q(\bm{a})$ as

$${q}^{(j)}(\bm{a}):=q(\bm{a};{\varphi}^{(j)})=\sum _{m=1}^{M}{\pi}_{m}^{(j)}\mathcal{N}(\bm{a};{\bm{\mu}}_{m}^{(j)},{\mathbf{\Sigma}}_{m}^{(j)}),$$ | (6) |

where ${\varphi}^{(j)}:={\{({\pi}_{m}^{(j)},{\bm{\mu}}_{m}^{(j)},{\mathbf{\Sigma}}_{m}^{(j)})\}}_{m=1}^{M}$ and $M$ is the number of components of the mixture model. Now, we derive the iteration scheme to update the parameters of GMM. At first, drawing $K$ samples from ${q}^{(j)}(\bm{a})$, we approximate ${q}^{(j)}(\bm{a})$ as a discretized distribution (or a set of particles):

$${q}^{(j)}(\bm{a};\varphi )\simeq q(\bm{a};{\mathbf{W}}^{(j)}):=\sum _{k=1}^{K}{w}_{k}^{(j)}\delta (\bm{a}-{\bm{a}}_{k}),$$ | (7) |

where ${\mathbf{W}}^{(j)}:={\{{w}_{k}^{(j)}\}}_{k=1}^{K}$. Just after sampling, the weight of each particle is uniform: ${\mathbf{W}}^{(j)}=\mathrm{\U0001d7cf}/K$. By substituting this approximated distribution to (5), the update law for the particle weights is derived as

$${w}_{k}^{(j+1)}\leftarrow {\mathcal{W}}^{\prime}{({\bm{a}}_{k})}^{\frac{1}{\lambda}}\cdot {({q}^{(j)}({\bm{a}}_{k}))}^{-\kappa}/\sum _{{k}^{\prime}=1}^{K}{\mathcal{W}}^{\prime}{({\bm{a}}_{{k}^{\prime}})}^{\frac{1}{\lambda}}\cdot {({q}^{(j)}({\bm{a}}_{{k}^{\prime}}))}^{-\kappa}.$$ | (8) |

Then we estimate ${\varphi}^{(j+1)}$, which maximizes the observation probability of the weighted particles:

$${\varphi}^{(j+1)}={\mathrm{argmax}}_{\varphi}\mathrm{log}p({\{({w}_{k}^{(j+1)},{\bm{a}}_{k})\}}_{k=1}^{K}|\varphi )={\mathrm{argmax}}_{\varphi}\sum _{k=1}^{K}{w}_{k}^{(j+1)}\mathrm{log}q({\bm{a}}_{k};\varphi ).$$ | (9) |

By taking the derivative ${\nabla}_{\varphi}\mathrm{log}p(\cdot |\varphi )=\mathrm{\U0001d7ce}$ and borrowing the concept of the EM algorithm [bilmes1998gentle], we get the update laws of ${\varphi}^{(j+1)}$ which take the weight-average form like (4) (see Sec. D for the complete definition):

$$({\bm{\mu}}_{m}^{(j+1)},{\mathbf{\Sigma}}_{m}^{(j+1)},{\pi}_{m}^{(j+1)})\leftarrow (\sum _{k=1}^{K}{\omega}_{m,k}^{(j+1)}{\bm{a}}_{k},\sum _{k=1}^{K}{\omega}_{m,k}^{(j+1)}{({\bm{a}}_{k}-{\bm{\mu}}_{m}^{(j+1)})}^{2},\frac{{N}_{m}}{{\sum}_{m=1}^{M}{N}_{m}}).$$ | (10) |

In summary, PaETS and the MPC utilizing it are respectively described in Algs. 1 and 2, where $U$ is the number of iterations for optimization and $H$ is the length of the task episode. At $\mathrm{\ell}2$ in Alg. 2, ${\bm{\mu}}_{m}$s are initialized independently at random. At $\mathrm{\ell}11$, ${\mathbf{\Sigma}}_{m}$s and ${\pi}_{m}$s are reset to be initial values, encouraging exploration for the next time-step and preventing $q(\bm{a};\varphi )$ from degenerating to a single mode. If we set $M=1$, these procedures are almost equivalent to those of PETS. The use of GMM ($M>1$) does not increase computational complexity significantly (see Sec. F). \SetCommentStymycommfont

## 5 Experiments

### 5.1 Comparison to State-of-the-art Methods

The main objective of this experiment is to demonstrate that PaETS has advantages over the state-of-the-art MBRL baseline: PETS [chua2018deep].
In this experiment, PaETS and PETS (or vanilla CEM) were implemented
using our same codebase with different parameters, i.e.,
VIMPC(‘CEM’, ‘GMM(M=5)’, True) for PaETS, and VIMPC(‘CEM’, ‘GMM(M=1)’, False) for PETS.
We also evaluated another MBRL baseline with MPPI [williams2017information], realized as VIMPC(‘MPPI’, ‘GMM(M=1)’, False).
These above methods share the settings for ${p}_{\mathcal{D}}(\theta )$ inference (training of network ensembles).
The state-of-the-art MFRL method SAC [haarnoja2018soft], was also evaluated to compare asymptotic performance.^{3}^{3}
3
We used the open-source code: https://github.com/pranz24/pytorch-soft-actor-critic
Fig. 3 illustrates the simulated locomotion tasks evaluated in this experiment, which are complex and challenging due to their high non-linearity.
All the tasks, except for HalfCheetah, were not evaluated in the original PETS paper [chua2018deep].
Other details about our implementation and experimental settings are described in Sec. G and Sec. H.
Fig. 4 presents the experimental results, in which PaETS consistently exhibits better asymptotic performance than that of the MBRL baselines.
In addition, PaETS outperforms or is comparable to SAC while requiring significantly fewer samples (about x10 more sample efficient).

### 5.2 Ablation Study

This experiment clarifies which component of PaETS (GMM and entropy-regularization) contributed to the overall improvement. Fig. 5 expresses the results of this ablation study and Welch’s $t$-test for some selected representative pairs. From this figure, one can observe that the use of GMM ($M=5$) significantly improves performance. The effect of the regularization ($\kappa >0$) is relatively small, but not negligible. In certain tasks, setting $\kappa $ to particular values could improve the performance. In the case of $M>1$, the regularization sheds light on actions sampled from low ${\pi}_{m}$, thus encouraging $q(\bm{a};\varphi )$ to be multimodal. In some tasks which requires rather delicate controls (e.g., Hopper, Walker2d), the effect of $\kappa $ seems less significant. Fig. 6 examines sensitivity with the number of mixture components $M$, for which $M=5$ achieved the highest performance. If infinite or enough samples are given ($K\gg 0$), it would be reasonable to set $M$ to be large enough to capture multimodality. However, in practice, $K$ is finite and could be small enough due to computational constraints. In this case, larger $M$ makes it difficult to approximate $q(\bm{a};\varphi )$ as a set of particles $q(\bm{a};\mathbf{W})$, resulting in degradation of the optimization performance.

## 6 Related Work

Dynamics Posterior Inference Recent MBRL methods, MB-MPO (Model-Based Meta-Policy-Optimization) [clavera2018model] and ME-TRPO (Model Ensemble Trust Region Optimization) [kurutach2018model], also employ network ensembles to model dynamics, but they utilize the ensembles differently than we do: to train policy networks, not MPC.

Trajectory Optimization Sequential Monte-Carlo based MPC, described as VIMPC(*, ‘Particles’, False), has been introduced in [kantas2009sequential], but it requires well-designed proposal distribution to sample particles for the next iteration $j+1$. Another particle-based method has been derived [piche2018probabilistic] by utilizing the control as inference framework. However, this method relies on not only a dynamics model, but also policy and value functions to manage particles, so MFRL methods must be incorporated.

Recent studies have demonstrated that entropy regularization is a promising strategy in policy training [abdolmaleki2015model, abdolmaleki2017deriving, haarnoja2017reinforcement, haarnoja2018soft]. However, to the best of our knowledge, the introduction of entropy regularization to MPC is novel along with explicit multimodal expression to successfully realize their synergistic effect.

Ref. [wagener2019online] also systematically organizes the stochastic MPC methods from the perspective of online learning, but uncertainty-aware discussions from a Bayesian viewpoint are not conducted.

Bayesian Reformulation Ref. [jeon2018bayesian] proposes a novel approach to generative adversarial imitation learning (GAIL) [ho2016generative], which reformulates general GAIL in a Bayesian fashion and utilizes ensembles to infer discriminator posteriors. Another Bayesian reformulation of GAIL integrates imitation and reinforcement learning by introducing another optimality (i.e., imitation optimality ${\mathcal{O}}_{t}^{I}$) [kinose2019integration].

## 7 Conclusion & Discussions

This paper introduces a novel VI-MPC framework that systematically generalizes and reformulates various stochastic MPC methods in a Bayesian fashion. We also devise a novel instance of this framework, called PaETS, which can successfully incorporate multimodal uncertainty in optimal trajectories. By combining our method and the recent uncertainty-aware dynamics modeling with neural network ensembles, our Bayesian MBRL is able to involve multimodalities both in dynamics and optimalities. In addition, our method is a quite simple extension of general stochastic methods and requires no significant additional computational complexity. Our experiments demonstrate that PaETS can improve asymptotic performance compared to the leading MBRL baseline PETS, and thus substantially enhances MBRL potential to be more competitive to the state-of-the-art MFRL.

Considering the simplicity and generalizability of VI-MPC and PaETS, we expect that our concept is applicable to a variety of tasks, such as traditional MPC with deterministic dynamics and advanced MPC with latent dynamics from pixels by Deep Planning Network [hafner2018learning]. By introducing a categorical mixture model as a variational distribution, application to combinational optimizations is also feasible. In fact, our ongoing work includes experiments of discrete MPC for a practical system.

A question that remains is how to determine VI-MPC specifications. As implied in Fig. 4, the best optimality definition could be task dependent (e.g., MPPI outperformed vanilla CEM in the Ant but not in other tasks). The regularization weight $\kappa $ also has task dependency as shown in Fig. 5. It would be challenging but interesting future work to add the parameters to the graphical model in Fig. 2 as latent variables to infer promising parameters along with optimal trajectories, like infinite GMM [rasmussen2000infinite]. Another appealing endeavor for future work is to introduce the concept of parallel tempering [brooks2011handbook] in Markov Chain Monte Carlo. By adaptively varying different temperatures ($\lambda $ in our case) of ensemble actions, we can expect the ensemble diversity to improve.

#### Acknowledgments

We thank Vishwajeet Singh, Hiroki Nakamura and Akira Kinose for their cooperation in this study during their student-internship periods. Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.

## References

## Appendix A Comparison Between $\mathcal{W}$ and ${\mathcal{W}}^{\prime}$

We evaluated the impact of $\mathcal{W}$ and ${\mathcal{W}}^{\prime}$ on the optimization performance of (vanilla) CEM and MPPI, the results of which are summarized in Table 2, where ${\mathcal{W}}^{\prime}$ gained much higher rewards than $\mathcal{W}$.

CEM | MPPI | ||
---|---|---|---|

$\mathcal{W}$ | ${\mathcal{W}}^{\prime}$ | $\mathcal{W}$ | ${\mathcal{W}}^{\prime}$ |

$5603.24\pm 541.31$ | $\mathbf{11843.05}\pm \mathbf{295.80}$ | $2789.03\pm 647.82$ | $\mathbf{9765.27}\pm \mathbf{231.04}$ |

## Appendix B Derivations

### B.1 Derivation of the Variational Inference Objective

By using the assumption of ${q}_{\theta}(\tau )=q(\bm{a})p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta )$, the KL-divergence can be transformed as

$\mathrm{KL}({q}_{\theta}(\tau )||p(\tau ,\theta |\mathcal{O}))$ | $={\displaystyle \int {q}_{\theta}(\tau )\mathrm{log}\frac{{q}_{\theta}(\tau )}{p(\tau ,\theta |\mathcal{O})}d\tau d\theta}$ | (11) | ||

$={\displaystyle \int {q}_{\theta}(\tau )\mathrm{log}\frac{q(\bm{a})p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta )}{p(\mathcal{O}|\tau )p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta )}d\tau d\theta}$ | (12) | |||

$=-{\mathbb{E}}_{{q}_{\theta}(\tau )}\left[\mathrm{log}p(\mathcal{O}|\tau )-\mathrm{log}q(\bm{a})\right].$ | (13) |

## Appendix C Derivation of (5)

In this section, we simply denote ${q}_{\bm{a}}$ as $q(\bm{a})$ and ${q}_{\tau}$ as $q(\tau )\phantom{\rule{veryverythickmathspace}{0ex}}(={q}_{\bm{a}}p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta ))$ for readability. Let us consider the optimization problem:

$${\mathrm{argmin}}_{{q}_{\tau}}\mathcal{J}={\mathrm{argmin}}_{{q}_{\tau}}{\mathbb{E}}_{{q}_{\tau}}\left[-\mathrm{log}p(\mathcal{O}|\tau )+\alpha \mathrm{log}{q}_{\bm{a}}\right].$$ | (14) |

By applying mirror descent [bubeck2015convex], the iterative update law of ${q}_{\tau}^{(j+1)}$ is given as

$${q}_{\tau}^{(j+1)}={\mathrm{argmin}}_{{q}_{\tau}}\u27e8{\nabla}_{{q}_{\tau}}\mathcal{J},{q}_{\tau}\u27e9+\beta \cdot \mathrm{KL}({q}_{\tau}||{q}_{\tau}^{(j)})+\gamma (1-\int {q}_{\tau}\cdot d\tau d\theta ),$$ | (15) |

where $\u27e8\cdot ,\cdot \u27e9$ is the inner-product operator, $\beta $ is a hyper-parameter related to the step-size, and $\gamma $ is the Lagrange multiplier for the constraint $\int {q}_{\tau}\cdot \mathit{d}\tau \mathit{d}\theta =1$. The arguments in the $\mathrm{argmin}$ operator can be rearranged as

$$\int {q}_{\tau}\cdot \left(-\mathrm{log}p(\mathcal{O}|\tau )+\alpha \mathrm{log}{q}_{\bm{a}}+\beta \mathrm{log}{q}_{\bm{a}}-\beta \mathrm{log}{q}_{\bm{a}}^{(j)}-\gamma \right)\mathit{d}\tau \mathit{d}\theta +\gamma ,$$ | (16) |

where, we used the relations:

$$\u27e8{\nabla}_{{q}_{\tau}}\mathcal{J},{q}_{\tau}\u27e9=\mathcal{J},$$ | (17) |

$$\mathrm{KL}({q}_{\tau}||{q}_{\tau}^{(j)})=\int {q}_{\tau}\mathrm{log}\frac{{q}_{\tau}}{{q}_{\tau}^{(j)}}d\tau d\theta =\int {q}_{\tau}\mathrm{log}\frac{{q}_{\bm{a}}}{{q}_{\bm{a}}^{(j)}}d\tau d\theta .$$ | (18) |

The integrand of (16) can be organized as

${q}_{\tau}\cdot \mathrm{log}{\displaystyle \frac{{q}_{\bm{a}}^{\alpha +\beta}}{p(\mathcal{O}|\tau ){e}^{-\gamma}{({q}_{\bm{a}}^{(j)})}^{\beta}}}$ | $\propto {q}_{\tau}\cdot \mathrm{log}{\displaystyle \frac{{q}_{\bm{a}}}{p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {e}^{\frac{-\gamma}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{\beta}{\alpha +\beta}}}}$ | (19) | ||

$={q}_{\tau}\cdot \mathrm{log}{\displaystyle \frac{p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta ){q}_{\bm{a}}}{(p(\bm{s}|\bm{a},\theta ){p}_{\mathcal{D}}(\theta ){q}_{\bm{a}}^{(j)})\cdot p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {e}^{\frac{-\gamma}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}}}$ | (20) | |||

$={q}_{\tau}\cdot \mathrm{log}{\displaystyle \frac{{q}_{\tau}}{{q}_{\tau}^{(j)}\cdot p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {e}^{\frac{-\gamma}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}}}.$ | (21) |

Integrating the above equation yields,

$$(\mathit{\text{16}})=\mathrm{KL}({q}_{\tau}||{q}_{\tau}^{(j)}\cdot p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {e}^{\frac{-\gamma}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}})+\gamma .$$ | (22) |

By minimizing this equation, we get:

$${q}_{\tau}^{(j+1)}={q}_{\tau}^{(j)}\cdot p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {e}^{\frac{-\gamma}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}.$$ | (23) |

The Lagrange multiplier can be removed using the constraint $\int {q}_{\tau}^{(j+1)}\cdot \mathit{d}\tau \mathit{d}\theta =1$:

${e}^{\frac{\gamma}{\alpha +\beta}}$ | $={\mathbb{E}}_{{q}_{\tau}^{(j)}}\left[p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}\right]$ | (24) | ||

$={\mathbb{E}}_{\bm{a}\sim {q}_{\bm{a}}(j)}\left[\underset{(*)}{\underset{\u23df}{{\mathbb{E}}_{\bm{s}\sim p(\bm{s}|\bm{a},\theta ),\theta \sim {p}_{\mathcal{D}}(\theta )}\left[p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\right]}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}\right].$ | (25) |

Considering the discussion in Sec. 2.2 and Sec. A, we compute $(*)$ as

$$(*)\simeq f{(\mathbb{E}[r(\tau )])}^{\frac{1}{\alpha +\beta}}={\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\alpha +\beta}}.$$ | (26) |

Substituting (25) to (23) results in:

$${q}_{\tau}^{(j+1)}=\frac{{q}_{\tau}^{(j)}\cdot p{(\mathcal{O}|\tau )}^{\frac{1}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}}{{\mathbb{E}}_{\bm{a}\sim {q}_{\bm{a}}^{(j)}}\left[{\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}\right]}.$$ | (27) |

Marginalizing $(\bm{s},\theta $), we finally obtain:

$${q}_{\bm{a}}^{(j+1)}=\frac{{q}_{\bm{a}}^{(j)}\cdot {\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}}{{\mathbb{E}}_{\bm{a}\sim {q}_{\bm{a}}^{(j)}}\left[{\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\alpha +\beta}}\cdot {({q}_{\bm{a}}^{(j)})}^{\frac{-\alpha}{\alpha +\beta}}\right]}.$$ | (28) |

In (5), we replaced $\lambda :=\alpha +\beta $, $\kappa :=\alpha /(\alpha +\beta )$.

## Appendix D Complete Definition of PaETS

${\eta}_{m}({\bm{a}}_{k})$ | $:={\pi}_{m}^{(j)}\mathcal{N}({\bm{a}}_{k};{\bm{\mu}}_{m}^{(j)},{\mathbf{\Sigma}}_{m}^{(j)})/{\displaystyle \sum _{{m}^{\prime}=1}^{M}}{\pi}_{{m}^{\prime}}^{(j)}\mathcal{N}({\bm{a}}_{k};{\bm{\mu}}_{{m}^{\prime}}^{(j)},{\mathbf{\Sigma}}_{{m}^{\prime}}^{(j)})$ | (29) | ||

${\omega}_{m,k}^{(j+1)}$ | $:={\eta}_{m}({\bm{a}}_{k}){w}_{k}^{(j+1)}/\underset{:={N}_{m}}{\underset{\u23df}{{\displaystyle \sum _{{k}^{\prime}=1}^{K}}{\eta}_{m}({\bm{a}}_{{k}^{\prime}}){w}_{{k}^{\prime}}^{(j+1)}}}$ | (30) | ||

${\bm{\mu}}_{m}^{(j+1)}$ | $\leftarrow {\displaystyle \sum _{k=1}^{K}}{\omega}_{m,k}^{(j+1)}{\bm{a}}_{k}$ | (31) | ||

${\mathbf{\Sigma}}_{m}^{(j+1)}$ | $\leftarrow {\displaystyle \sum _{k=1}^{K}}{\omega}_{m,k}^{(j+1)}{({\bm{a}}_{k}-{\bm{\mu}}_{m}^{(j+1)})}^{2}$ | (32) | ||

${\pi}_{m}^{(j+1)}$ | $\leftarrow {N}_{m}/{\displaystyle \sum _{m=1}^{M}}{N}_{m}.$ | (33) |

## Appendix E Optimization of Toy Objective Function by PaETS

Fig. 7 illustrates how PaETS optimizes $q(\bm{a};{\varphi}^{(j)})$ in a toy multimodal objective function.

## Appendix F Computational Complexity

The main computational bottleneck of PaETS (and PETS) is the execution of $\mathrm{\ell}$3–6 in Alg. 1, in which total $K\times P$ trajectories must be sampled. In our experiment, $K$ and $P$ were respectively set as $K=500$, $P=20$ as in [chua2018deep]. Compared to PETS, PaETS requires additional procedures like action sampling from GMM ($\mathrm{\ell}$2) and GMM parameter update ($\mathrm{\ell}$9). However, these additional procedures are easily parallelizable on GPUs, and their computation times are much shorter than the above mentioned bottleneck. In the experiments with our early prototype in TensorFlow, it took about 57 ms for $M=5$ and 55 ms for $M=1$ (equivalent to PETS) to execute one iteration of the for-loop in Alg. 1 on a single NVIDIA RTX2080 GPU. The above execution time does not meet the real-time constraints (e.g., 30 Hz). However, considering the success of the real-time implementation of MPPI in [williams2016aggressive, williams2017information], we believe real-time implantation of our method is feasible with optimized implementation using compiled language, low-level GPU APIs, and thorough tuning of hyperparameters (e.g., $K$, $P$ and DNN complexity).

## Appendix G Implementation Notes

##### Cross Entropy Method

It is general technique to adaptively determine ${r}_{thd}$ in Table 1 so that only the top-$e\%$ samples satisfies the threshold condition. We employ this technique and the eliteness ratio is set to be $e=10\%$. $\lambda $ has no effect on CEM optimization since $f(\cdot )$ takes binary values.

##### MPPI

Reward normalization heuristics, as suggested in [theodorou2010generalized], were also introduced for our MPPI implementation as

$${\mathcal{W}}^{\prime}{({\bm{a}}_{k})}^{\frac{1}{\lambda}}=\mathrm{exp}\left\{\frac{1}{\lambda}\cdot \frac{r({\tau}_{k})-\mathrm{min}{\{r({\tau}_{{k}^{\prime}})\}}_{{k}^{\prime}=1}^{K}}{\mathrm{max}{\{r({\tau}_{{k}^{\prime}})\}}_{{k}^{\prime}=1}^{K}-\mathrm{min}{\{r({\tau}_{{k}^{\prime}})\}}_{{k}^{\prime}=1}^{K}}\right\},$$ | (34) |

where $r({\tau}_{k})=\frac{1}{P}{\sum}_{i=1}^{P}r({\tau}_{k,i})$. $\lambda $ was set to be $\lambda =0.1$ as also suggested in [theodorou2010generalized].

##### Entropy Regularization

The value of $\kappa $ is very sensitive to task settings, especially for the dimensionalities of action spaces. To make $\kappa $ insensitive, we introduced the following normalization trick inspired by the above heuristics. First, we rearrange (8) as

$${w}_{k}^{(j+1)}\propto {\mathcal{W}}^{\prime}{(\bm{a})}^{\frac{1}{\lambda}}\mathrm{exp}\left\{\kappa \cdot (-\mathrm{log}{q}^{(j)}({\bm{a}}_{k}))\right\}.$$ | (35) |

Then, we replace $-\mathrm{log}{q}^{(j)}({\bm{a}}_{k})$ to normalized one:

$$-\mathrm{log}{q}^{(j)}({\bm{a}}_{k})\to \frac{-\mathrm{log}{q}^{(j)}({\bm{a}}_{k})-\mathrm{min}{\{-\mathrm{log}{q}^{(j)}({\bm{a}}_{{k}^{\prime}})\}}_{{k}^{\prime}=1}^{K}}{\mathrm{max}{\{-\mathrm{log}{q}^{(j)}({\bm{a}}_{{k}^{\prime}})\}}_{{k}^{\prime}=1}^{K}-\mathrm{min}{\{-\mathrm{log}{q}^{(j)}({\bm{a}}_{{k}^{\prime}})\}}_{{k}^{\prime}=1}^{K}}\in [0,1].$$ | (36) |

By applying these heuristics, the range of entropy bonus is limited to $[1,{e}^{\kappa}]$, where the action with the lowest probability among $K$ samples gains the highest entropy bonus of ${e}^{\kappa}$.

## Appendix H Experimental Setup

We used MuJoCo tasks modified from standard OpenAI Gym tasks.^{4}^{4}
4
https://github.com/openai/gym
Table 3 summarizes the task settings, where ${v}_{x}$, $\phi $ and $z$ respectively denote the velocity, orientation angle, and height of the agents.
Penalty functions $\mathrm{\Phi}$, $\mathrm{\Psi}$ are newly introduced to encourage the agents to move forward in the proper form.
Instead, done flags used originally for early task stopping are removed.
$\mathrm{\Phi}$, $\mathrm{\Psi}$ are defined as

$$\mathrm{\Phi}(z,{z}_{des})={e}^{-{(z-{z}_{des})}^{2}},$$ | (37) |

$$\mathrm{\Psi}(\phi )=\frac{1+\mathrm{cos}(2\phi )}{2}.$$ | (38) |

We modified the range of actions (i.e., torques) from $[-1,1]$ to $[-5,5]$ to exaggerate uncertainties in the optimal trajectory posteriors.

Task | Reward Function | ${\bm{s}}_{t}\in $ | ${\bm{a}}_{t}\in $ | Misc. |
---|---|---|---|---|

HalfCheetah | ${v}_{x}\cdot \frac{1+\mathrm{sign}(\mathrm{cos}(\phi ))}{2}-0.1\cdot {||{\bm{a}}_{t}||}^{2}$ | ${\mathbb{R}}^{18}$ | ${\mathbb{R}}^{6}$ | — |

Ant | ${v}_{x}\cdot \mathrm{\Phi}(z,{z}_{des})-{10}^{-3}\cdot {||{\bm{a}}_{t}||}^{2}$ | ${\mathbb{R}}^{28}$ | ${\mathbb{R}}^{8}$ | ${z}_{des}=0.75$ |

Hopper | ${v}_{x}\cdot \mathrm{\Phi}(z,{z}_{des})\cdot \mathrm{\Psi}(\phi )-{10}^{-3}\cdot {||{\bm{a}}_{t}||}^{2}$ | ${\mathbb{R}}^{12}$ | ${\mathbb{R}}^{3}$ | ${z}_{des}=1.2$ |

Walker2d | ${v}_{x}\cdot \mathrm{\Phi}(z,{z}_{des})\cdot \mathrm{\Psi}(\phi )-{10}^{-3}\cdot {||{\bm{a}}_{t}||}^{2}$ | ${\mathbb{R}}^{18}$ | ${\mathbb{R}}^{6}$ | ${z}_{des}=1.2$ |

Table 4 summarizes the shared parameter settings for MBRL (PaETS, PETS, and MPPI). For SAC, we used the default parameters from the original codebase.

HalfCheetah | Ant | Hopper | Walker2d | |
---|---|---|---|---|

$T$: prediction horizon | 30 | 30 | 60 | 45 |

$\kappa $: weight of entropy regularizer | 0.5 | 0.25 | 0.5 | 0.5 |

$K$: # sampled actions | 500 | |||

$P$: # trajectories for each action | 20 | |||

$U$: # optimization-iterations | 5 | |||

$H$: # episode length | 1000 | |||

$E$: # neural networks | 5 | |||

hidden nodes | (200, 200, 200, 200) | |||

activation function | Swish | |||

optimizer | Adam | |||

learning rate | ${10}^{-3}$ | |||

batch-size | 160 |