Abstract
Probabilistic Graphical Models and Variational Inference play an importantrole in recent advances in Deep Reinforcement Learning. As a selfinclusivetutorial survey, this article illustrates basic concepts of reinforcementlearning with Probabilistic Graphical Models and offers derivation of somebasic formula as a recap. Reviews and comparisons on recent advances in deepreinforcement learning are made from various aspects. We offer ProbabilisticGraphical Models, detailed explanation and derivation to several use cases ofGraphical Model and Variational Inference, which serves as a complementarymaterial on top of the original contributions.
Quick Read (beta)
Tutorial and Survey on Probabilistic Graphical Model and Variational Inference in Deep Reinforcement Learning
Abstract
Probabilistic Graphical Models and Variational Inference play an important role in recent advances in Deep Reinforcement Learning. As a selfinclusive tutorial survey, this article illustrates basic concepts of reinforcement learning with Probabilistic Graphical Models and offers derivation of some basic formula as a recap. Reviews and comparisons on recent advances in deep reinforcement learning are made from various aspects. We offer Probabilistic Graphical Models, detailed explanation and derivation to several use cases of Graphical Model and Variational Inference, which serves as a complementary material on top of the original contributions.
I Introduction
Reinforcement Learning, powered by Deep Neural Networks, has gaining increasing attention recently due to its great success in complicated tasks like games [19] and robot locomotion [23], as well as optimization tasks like Automatic Machine Learning [28]. Despite the existing survey [2] in Deep Reinforcement Learning, this paper, however, focuses on the application of Probabilistic Graphical Model and Variational Inference, especially Amortized Variational Inference [14]in Deep Reinforcement Learning. Specifically, we make the following contributions:
 •

•
We cover a taxonomy of Graphical Model and Variational Inference [4] used in Deep Reinforcement Learning and give detailed derivation for many of the critical equations, which is either not given in the original contributions [30, 15, 11], or in a slightly different way. This makes the paper in a relative standalone position to serve as a selfinclusive tutorial from beginner to advanced.
IA Organization of the paper
In section IB, we first introduce the fundamentals of Probabilistic Graphical Models and Variational Inference, then we review the basics about reinforcement learning by connecting probabilistic graphical models (PGM) in section IIA, as well as the basics and a incomplete overview about deep reinforcement learning, accompanied with a comparison of different methods in section IIE. In section III, we discuss how undirected graph could be used in modeling both the value function and the policy, which works well on high dimensional discrete state and action spaces. In section IV, we introduce the directed acyclic graph framework on how to treat the policy as posterior on actions, while adding many proofs that does not exist in the original contributions. In section V, we introduce works on how to use variational inference to approximate the environment model, while adding graphical models and proofs which does not exist in the original contributions.
IB Prerequisite on Probabilistic Graphical Models and Variational Inference, Terminologies and Conventions
Directed Acyclic Graphs (DAG) [3] as a PGM offers an instinctive way of defining factorized join distributions of Random Variables (RV) by assuming the conditional independence [3] through dseparation [3]. We use capital letter to denote a RV, while using the lower case letter to represent the realization of corresponding RV. To avoid symbol collision of using $A$ to represent advantage in many RL literature, we use ${A}^{act}$ explicitly to represent action. We use $(B\text{scalebox}\text{1.07}\u27c2\u27c2C)\mid A$ to represent $B$ is conditionally independent from $C$, given $A$, which is equivalent to write $p(BA,C)=p(BA)$ or $p(BCA)=P(BA)P(CA)$.
Variational Inference (VI) approximates intractable posterior distribution, specified in a probabilistic graphical model usually, with a variational proposal posterior distribution, by optimizing the Evidence Lower Bound (ELBO) [4], which assigns the values of latent unobservables at the same time. Variational Inference is widely used in Deep Learning Community like variational resampling [27]. VI is also used in approximating the posterior on the weights distribution of neural networks for Thompson Sampling to tackle the explorationexploitationtrade off in bandit problems [5], as well as approximating on the activations distribution like Variational AutoEncoder using Amortized VI [14].
II Reinforcement Learning and Deep Reinforcement Learning
IIA Basics about Reinforcement Learning with graphical model
IIA1 RL Concepts, Terminology and Convention
As shown in Figure 1, Reinforcement Learning (RL) involves optimizing the behavior of an agent via interaction with the environment. At time $t$, the agent lives on state ${S}_{t}$, By executing an action ${a}_{t}$ according to a policy [30] $\pi (a{S}_{t})$, the agent jumps to another state ${S}_{t+1}$, while receiving a reward ${R}_{t}$. Let discount factor $\gamma $ decides how much the immediate reward is favored compared to longer term return, with which one could also allow tractability in infinite horizon reinforcement learning [30], as well as reducing variance in Monte Carlo setting [15]. The goal is to maximize the accumulated rewards, $G={\sum}_{t=0}^{T}{\gamma}^{t}{R}_{t}$ which is usually termed return in RL literature.
For simplicity, we interchangeably use two conventions whenever convenient: Suppose an episode last from $t=0:T$, with $T\to \mathrm{\infty}$ correspond to continuous nonepisodic reinforcement learning. We use another convention of $t\in \{0,\mathrm{\cdots},\mathrm{\infty}\}$ by assuming when episode ends, the agent stays at a self absorbing state with a null action, while receiving null reward.
By unrolling Figure 1, we get a sequence of state, action and reward tuples $\{({S}_{t},{A}_{t}^{act},{R}_{t})\}$ in an episode, which is coined trajectory $\tau $ [33, 6]. Figure 2 illustrates part of a trajectory in one rollout. The state space $\mathcal{S}$ and action space $\mathcal{A}$, which can be either discrete or continuous and multidimensional, are each represented with one continuous dimension in Figure 2 and plotted in an orthogonal way with different colors, while we use the thickness of the plate to represent the reward space $\mathcal{R}$.
IIA2 DAGs for (Partially Observed ) Markov Decision Process
Reinforcement Learning is a stochastic decision process, which usually comes with three folds of uncertainty. That is, under a particular stochastic policy characterized by $\pi (as)=p(as)$, within a particular environment characterized by state transition probability $p({s}_{t+1}{s}_{t},a)$ and reward distribution function $p({r}_{t}{s}_{t},{a}_{t})$, a learning agent could observe different trajectories with different unrolling realizations. This is usually modeled as a Markov Decision Process [30], with its graphical model shown in Figure 3, where we could define a joint probability distribution over the trajectory of state , action and reward RVs. In Figure 3, we use dashed arrows connecting state and action to represent the policy, upon fixed policy, we have the trajectory likelihood in Equation (1)
$p(\tau )={\displaystyle \prod _{t=0}^{T}}p({s}_{t+1}{s}_{t},{a}_{t})p({r}_{t}{s}_{t},{a}_{t})\pi ({a}_{t}{s}_{t})$  (1) 
Upon observation of a state ${s}_{t}$ in Figure 3, the action at the time step in question is conditionally independent with the state and action history ${\mathcal{E}}_{t}=\{{S}_{0},{A}_{0}^{act},\mathrm{\cdots},{S}_{t1}\}$, which could be denoted as $({A}_{t}^{act}\text{scalebox}\text{1.07}\u27c2\u27c2{\mathcal{E}}_{t})\mid {S}_{t}$.
A more realistic model, however, is the Partially Observable Markov Decision process [12], with its Directed Acyclic Graph [3] representation shown in Figure 4, where the agent could only observe the state partially of getting ${O}_{t}$ through a non invertible function of the latent state ${S}_{t}$ and the previous action ${a}_{t1}$, as indicated the Figure by $p({o}_{t}{s}_{t},{a}_{t1})$, while the distributions on other edges are omitted since they are the same as in Figure 3. Under the graph specification of Figure 4, the observable ${O}_{t}$ is no longer Markov, but depends on the whole history. However, by introducing a probability distribution $b(S)$ over the hidden state $S$, with ${\sum}_{\mathcal{S}}b(S)=1$, which is termed belief state [12], where state $S$ takes value in range $\mathcal{S}$.
IIB Value Function, Bellman Equation, Policy Iteration
Define state value function of state $s\in \mathcal{S}$ in Equation (2), where the corresponding Bellman Equation is derived in Equation (3).
${V}^{\pi}(s)$  
$=$  ${E}_{\pi ,\epsilon}[{\displaystyle \sum _{i=0}^{\mathrm{\infty}}}{\gamma}^{i}{R}_{t+i}({S}_{t+i},{A}_{t+i}^{act})\mid \forall {S}_{t}=s]$  (2)  
$=$  ${E}_{\pi ,\epsilon}[{R}_{t}({S}_{t},{A}_{t}^{act})+\gamma {\displaystyle \sum _{i=1}^{\mathrm{\infty}}}{\gamma}^{(i1)}{R}_{t+i}({S}_{t+i},{A}_{t+i}^{act})]$  
$=$  ${E}_{\pi ,\epsilon}[{R}_{t}({S}_{t},{A}_{t}^{act})+\gamma {\displaystyle \sum _{{i}^{{}^{\prime}}=0}^{\mathrm{\infty}}}{\gamma}^{{i}^{{}^{\prime}}}{R}_{t+1+{i}^{{}^{\prime}}}({S}_{t+{i}^{{}^{\prime}}+1},{A}_{t+{i}^{{}^{\prime}}+1}^{act})]$  
$=$  ${E}_{\pi ,\epsilon}[{R}_{t}({S}_{t},{A}_{t}^{act})+\gamma {V}^{\pi}({S}_{t+1})]$  (3) 
where ${S}_{t+i}\sim p({s}_{t+i+1}{s}_{t+i},{a}_{t+i})$ takes value from $\mathcal{S}$, ${A}_{t+i}^{act}\sim \pi (a{S}_{t+i+1})$ taking value from $\mathcal{A}$, and we have used the $\pi $ and $\epsilon $ in the subscript of the expectation $E$ operation to represent the probability distribution of the policy and the environment (including transition probability and reward probability) respectively. State action value function [30] is defined in Equation (4),
${Q}^{\pi}(s,a)(\forall {S}_{t}=s,{A}_{t}^{act}=a)$  
$=$  ${E}_{\pi ,\epsilon}[{R}_{t}({S}_{t}=s,{A}_{t}^{act}=a)+{\displaystyle \sum _{i=1}^{\mathrm{\infty}}}{\gamma}^{i}{R}_{t+i}({S}_{t+i},{A}_{t+i}^{act})]$  (4)  
$=$  ${E}_{\pi ,\epsilon}[{R}_{t}({S}_{t}=s,{A}_{t}^{act}=a)+\gamma {V}^{\pi}({S}_{t+1})]$  (5) 
, where in Equation (5), its relationship to the state value function is stated.
Combining Equation (3) and Equation (4), we have
$$V(s)=\sum _{a}\pi (as)Q(s,a)$$  (6) 
Define optimal policy [30] to be
${\pi}^{*}$  $=\underset{\mathit{\pi}}{argmax}{V}^{\pi}(s),\forall s\in S$  
$=\underset{\mathit{\pi}}{argmax}{E}_{\pi}[{R}_{t}+\gamma {V}^{\pi}({S}_{t+1})]$  (7) 
Taking the optimal policy ${\pi}^{*}$ into the Bellman Equation in Equation (3), we have
${V}^{{\pi}^{*}}(s)={E}_{{\pi}^{*},\epsilon}\left[{R}_{t}(s,{A}_{t}^{act})+\gamma {V}^{{\pi}^{*}}({S}_{t+1})\right]$  (8) 
Taking the optimal policy ${\pi}^{*}$ into Equation (4), we have
${Q}^{{\pi}^{*}}(s,a)={E}_{{\pi}^{*},\epsilon}[{R}_{t}(s,a)+{\displaystyle \sum _{i=1}^{\mathrm{\infty}}}{\gamma}^{i}{R}_{t+i}({S}_{t+i},{A}_{t+i}^{act})]$  (9) 
Based on Equation (9) and Equation (8), we get
${V}^{{\pi}^{*}}(s)=\underset{\mathit{a}}{max}{Q}^{{\pi}^{*}}(s,a)$  (10) 
and
${Q}^{{\pi}^{*}}(s,a)={E}_{\epsilon ,{\pi}^{*}}\left[{R}_{t}(s,a)+\gamma \underset{\overline{a}}{max}{Q}^{{\pi}^{*}}({S}_{t+1},\overline{a})\right]$  (11) 
For learning the optimal policy and value function, General Policy Iteration [30] can be conducted, as shown in Figure 5, where a contracting process [30] is drawn. Starting from initial policy ${\pi}_{0}$, the corresponding value function ${V}^{{\pi}_{0}}$ could be estimated, which could result in improved policy ${\pi}_{1}$ by greedy maximization over actions. The contracting process is supposed to converge to the optimal policy ${\pi}^{*}$.
As theoretically fundamentals of learning algorithms, Dynamic programming and Monte Carlo learning serve as two extremeties of complete knowledge of environment and complete model free [30], while time difference learning [30] is more ubiquitously used, like a bridge connecting the two extremities. Time difference learning is based on the Bellman update error ${\delta}_{t}=Q({s}_{t},{a}_{t})\left({R}_{t}(s,a)+\gamma \underset{\mathit{a}}{max}Q({s}_{t+1},a)\right)$.
IIC Policy Gradient and Actor Critic
Reinforcement Learning could be viewed as a functional optimization process. We could define an objective function over a policy ${\pi}_{\theta}(as)$, as a functional, characterized by parameter $\theta $, which could correspond to the neural network weights, for example.
Suppose all episodes start from an auxiliary initial state ${s}_{0}$, which with probability $h(s)$, jumps to different state $s\in \mathcal{S}$ without reward. $h(s)$ characterizes the initial state distribution which only depends on the environment. Let $\eta (s)$ represent the expected number of steps spent on state $s$, which can be calculated by summing up the $\gamma $ discounted probability ${P}^{\pi}({s}_{0}\to s,k+1)$ of entering state $s$ with $k+1$ steps from auxiliary state ${s}_{0}$, as stated in Equation (12), which can be thought of as the expectation of the R.V. ${\gamma}^{k}$ conditional on state $s$.
$\eta (s)$  $={\displaystyle \sum _{k=0}}{\gamma}^{k}{P}^{\pi}({s}_{0}\to s,k+1)$  (12)  
$=h(s)+{\displaystyle \sum _{\overline{s},a}}\gamma \eta (\overline{s}){\pi}_{\theta}(a\overline{s})P(s\overline{s},a)$  (13) 
In Equation (13), the quantity is calculated by either directly starting from state $s$, which correspond to $k=0$ in Equation (12), or entering state $s$ from state $\overline{s}$ with one step, corresponding to $k+1\ge 2$ in Equation (12).
For an arbitrary state $s\in \mathcal{S}$, using ${s}^{{}^{\prime}}$ and ${s}^{{}^{\prime \prime}}$ to represent subsequent states as dummy index, we have
${\nabla}_{\theta}{V}^{\pi (\theta )}(s)$  
$=$  ${\nabla}_{\theta}\left[{\displaystyle \sum _{a}}{Q}^{\pi (\theta )}(s,a){\pi}_{\theta}(as)\right]$  (14)  
$=$  $\sum _{a}}\left[{\nabla}_{\theta}{Q}^{\pi (\theta )}(s,a){\pi}_{\theta}(as)+{\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)\right]$  
$=$  $\sum _{a}}{\nabla}_{\theta}\left[{\displaystyle \sum _{{s}^{{}^{\prime}},R}}P({s}^{{}^{\prime}},Rs,a)\left(R+\gamma {V}^{\pi (\theta )}({s}^{{}^{\prime}})\right)\right]{\pi}_{\theta}(as)$  
$+{\displaystyle \sum _{a}}{\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)$  (15)  
$=$  $\sum _{a}}{\displaystyle \sum _{{s}^{{}^{\prime}}}}\gamma P({s}^{{}^{\prime}}s,a){\nabla}_{\theta}{V}^{\pi (\theta )}({s}^{{}^{\prime}}){\pi}_{\theta}(as)+$  
$\sum _{a}}{\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)$  (16)  
$=$  $\sum _{a}}{\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)+{\displaystyle \sum _{a}}{\displaystyle \sum _{{s}^{{}^{\prime}}}}\gamma P({s}^{{}^{\prime}}s,a){\pi}_{\theta}(as)$  
$[{\displaystyle \sum _{{a}^{{}^{\prime}}}}{\displaystyle \sum _{{s}^{{}^{\prime \prime}}}}\gamma P({s}^{{}^{\prime \prime}}{s}^{{}^{\prime}},{a}^{{}^{\prime}}){\nabla}_{\theta}{V}^{\pi (\theta )}({s}^{{}^{\prime \prime}}){\pi}_{\theta}({a}^{{}^{\prime}}{s}^{{}^{\prime}})+$  
$\sum _{{a}^{{}^{\prime}}}}{\nabla}_{\theta}{\pi}_{\theta}({a}^{{}^{\prime}}{s}^{{}^{\prime}}){Q}^{\pi (\theta )}({s}^{{}^{\prime}},{a}^{{}^{\prime}})]$  (17) 
The terms in square brackets in Equation (17) are simply Equation (16) with $a$ and ${s}^{{}^{\prime}}$ replaced by ${a}^{{}^{\prime}}$ and ${s}^{{}^{\prime \prime}}$. Since ${\nabla}_{\theta}{V}^{\pi (\theta )}({s}^{\mathrm{\infty}})=0$, Equation (17) could be written as Equation (18),
${\nabla}_{\theta}{V}^{\pi (\theta )}(s)$  
$=$  $\sum _{a}}{\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)+$  
$\sum _{k=1}}{\displaystyle \sum _{{s}_{k}}}{\displaystyle \sum _{{a}_{k}}}{\gamma}^{k}{P}^{\pi}(s\to {s}_{k},k){\nabla}_{\theta}{\pi}_{\theta}({a}_{k}{s}_{k}){Q}^{\pi (\theta )}({s}_{k},{a}_{k})$  (18) 
, where ${s}_{k}$ represent the state of $k$ steps after $s$ and ${P}^{\pi}(s\to {s}_{k},k)$ already includes integration of intermediate state ${s}_{k1},\mathrm{\dots}{s}_{1}$ before reaching state ${s}_{k}$.
Let objective function with respect to policy be defined to be the value function starting from auxiliary state ${s}_{0}$ as in Equation (19).
$J({\pi}_{\theta})={V}^{\pi}({s}_{0})={E}_{\pi ,\epsilon}{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}{R}_{t}({S}_{0}=s)$  (19) 
The optimal policy could be obtained by gradient accent optimization, leading to the policy gradient algorithm [30], as in Equation (24).
${\nabla}_{\theta}J({\pi}_{\theta})$  
$=$  ${\nabla}_{\theta}{V}^{\pi}({s}_{0})$  
$=$  $\sum _{k=0}}{\displaystyle \sum _{{s}_{k}}}{\displaystyle \sum _{{a}_{k}}}{\gamma}^{k}{P}^{\pi}({s}_{0}\to {s}_{k},k){\nabla}_{\theta}{\pi}_{\theta}({a}_{k}{s}_{k}){Q}^{\pi (\theta )}({s}_{k},{a}_{k})$  
$=$  $\sum _{s}}{\displaystyle \sum _{a}}\eta (s){\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)$  (20)  
$=$  $\frac{{\sum}_{s}\eta (s)}{{\sum}_{s}\eta (s)}}{\displaystyle \sum _{s}}{\displaystyle \sum _{a}}\eta (s){\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)$  (21)  
$=$  $\sum _{\overline{s}}}\eta (\overline{s}){\displaystyle \sum _{s}}{\displaystyle \sum _{a}}\mu (s){\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)$  (22)  
$=$  $\sum _{\overline{s}}}\eta (\overline{s}){\displaystyle \sum _{s}}{\displaystyle \sum _{a}}\mu (s){\displaystyle \frac{{\pi}_{\theta}(as)}{{\pi}_{\theta}(as)}}{\nabla}_{\theta}{\pi}_{\theta}(as){Q}^{\pi (\theta )}(s,a)$  (23)  
$\propto $  $\mathrm{\hspace{0.17em}}{E}_{\pi}\left[{\displaystyle \frac{{\nabla}_{\theta}{\pi}_{\theta}(AS)}{{\pi}_{\theta}(AS)}}{\widehat{Q}}^{\pi (\theta )}(S,A)\right]$  (24) 
, where $\mu (s)$ is the relative occupancy of state $s$. The integration of $\mu (s)$ with respect to $s$ and ${\pi}_{\theta}(as)$ in the nominator with respect to $a$ in Equation (23) is replaced with expectation with respect to interaction with the environment in Equation (24), and ${Q}^{\pi (\theta )}(s,a)$ is replaced by an estimator ${\widehat{Q}}^{\pi (\theta )}(S,A)$, which is usually ${G}_{t}={\sum}_{i=0}^{\mathrm{\infty}}{\gamma}^{i}{R}_{t+i}$. Functional approximation ${f}_{w}(s,a)$ to $Q$ should be compatible with policy parameterization in the sense that $\frac{\partial {f}_{w}(s,a)}{\partial w}=\frac{\partial \pi (s,a)}{\partial \theta}\frac{1}{\pi (s,a)}$ [31].
The policy gradient could be augmented to include zero gradient baseline $b(s)$, with respect to objective function $J({\pi}_{\theta})$ in Equation (23), as a function of state $s$, which does not include parameters for policy $\theta $, since ${\sum}_{a}{\nabla}_{\theta}{\pi}_{\theta}(as)=0$. To reduce variance of the gradient, the baseline is usually chosen to be the state value function estimator ${\widehat{V}}_{w}(s)$ to smooth out the variation of $Q(s,a)$ at each state, while ${\widehat{V}}_{w}(s)$ is updated in a Monte Carlo way by comparing with ${\widehat{Q}}^{{\pi}_{\theta}}(S,A)={G}_{t}$.
The actorcritic algorithm [30] decomposes ${G}_{t}{V}_{w}({s}_{t})$ to be ${R}_{t}+\gamma {V}_{w}({s}_{t+1}){V}_{w}({s}_{t})$, so bootstrap is used instead of Monte Carlo.
IID Policy Optimization and Trust Region Method
Another category of policy optimization methods is built on the identity from [13] in Equation (25), where ${\eta}_{{\pi}^{new}}(s)$ means the state visitation frequency under policy ${\pi}^{new}$ and advantage ${A}^{{\pi}^{old}}({a}_{t},{s}_{t})={Q}^{{\pi}^{old}}({a}_{t},{s}_{t}){V}^{{\pi}^{old}}({s}_{t})$.
$J({\pi}^{new})$  $=J({\pi}^{old})+{\displaystyle \sum _{s}}{\eta}_{{\pi}^{new}}(s){\displaystyle \sum _{a}}{\pi}^{new}(as){A}^{{\pi}^{old}}({a}_{t},{s}_{t})$  (25) 
Based on Policy Advantage [13] ${A}_{{\pi}^{old}}({\pi}^{new})={\sum}_{s}{\eta}_{{\pi}^{old}}(s){\sum}_{a}{\pi}^{new}(as){A}^{{\pi}^{old}}({a}_{t},{s}_{t})$, a local approximation ${L}_{{\pi}^{old}}({\pi}^{new})$ could be defined in Equation (26), which could form a surrogate function $M$ in Equation (27) that minorizes ${J}^{{\pi}^{new}}$ at ${\pi}^{old}$, where ${D}_{KL}^{max}({\pi}^{old},{\pi}^{new})=\underset{\mathit{s}}{max}{D}_{KL}({\pi}^{old}(as),{\pi}^{new}(as))$ is the maximum KL divergence, so MM algorithm could be used to improve the policy, leading to the trust region method [23].
${L}_{{\pi}^{old}}({\pi}^{new})$  
$=$  $J({\pi}^{old})+{\displaystyle \sum _{s}}{\eta}_{{\pi}^{old}}(s){\displaystyle \sum _{a}}{\pi}^{new}(as){A}^{{\pi}^{old}}({a}_{t},{s}_{t})$  (26) 
$M={L}_{{\pi}^{old}}({\pi}^{new}){\displaystyle \frac{4\underset{a,s}{max}{A}_{{\pi}^{old}}(s,a)\gamma}{1{\gamma}^{2}}}{D}_{KL}^{max}({\pi}^{old},{\pi}^{new})$  (27) 
IIE Basics of Deep Reinforcement Learning
Deep Q learning [19] makes a breakthrough in using neural network as functional approximator on complicated tasks. It solves the experience correlation problem by using a reply memory and the instability of the target problem with a frozen target network. Specifically, the reinforcement learning is transformed in a supervised learning task by fitting on the target ${R}_{t}+\gamma \underset{\mathit{a}}{max}Q({s}_{t+1},a)$ from the replay memory with state ${s}_{t}$ as input. However, the target can get drifted easily which leads to unstable learning. In [19], a target network is used to provide a stable target for the updating network to be learned before getting updated occasionally. Double Deep Q learning [32], however, solves the problem by having two Q network and update the parameters in a alternating way.
IIE1 On Policy methods
A3C [18] stands out in the asynchronous methods in deep learning [18] which can be run in parallel on a single multicore CPU. Trust Region Policy Optimization [23] and Proximal Policy Optimization [24] assimilates the natural policy gradient, which use a local approximation to the expected return. The local approximation could serve as a lower bound for the expected return, which can be optimized safely subject to the KL divergence constraint between two subsequent policies, while in practice, the constraint is relaxed to be a regularization.
IIE2 Off Policy methods
IIE3 Goal based Reinforcement Learning
In robot manipulation tasks, the goal could be represented with state in some cases [33]. Universal Value Function Approximator (UVFA) [21] incorporate the goal into the deep neural network, which let the neural network functional approximator also generalize to goal changes in tasks, similar to Recommendation System. Work of this direction include [1, 33], for example.
IIE4 Replay Memory Manipulation based Method
Replay memory is a critical component in Deep Reinforcement Learning, which solves the problem of correlated transition in one episode. Beyond the uniform sampling of replay memory in Deep Q Network [19], Prioritized Experience Replay [22] improves the performance by giving priority to those transitions with bigger TD error, while Hindsight Experience Replay (HER) [1] manipulate the replay memory with changing goals to transition so as to change reward to promote exploration. Maximum entropy regularized multi goal reinforcement learning [33] gives priority to those rarely occurred trajectory in sampling, which has been shown to improve over HER [33].
IIE5 Exploration with sparse reward
In complicated real environment, an agent has to explore for a long trajectory before it can get any reward as feedback. Due to lack to enough rewards, traditional Reinforcement Learning methods performs poorly, which lead to a lot of contributions in the sufficient exploration methods. The methods using graphical model and variational method we introduce later each use different mechanisms to explore the environments. In the following sections, we give detailed explanation on how graphical model and variational inference could be used to model and optimize the reinforcement learning process with each category a different section. Together with the methods mentioned above, we make a comparison of them in Table I.
IIE6 Surrogate optimization
Like surrogate model used in Bayesian Optimization [26], lower bound surrogate is also used in Reinforcement Learning, including Evidence Lower Bound (ELBO) based methods introduced below and Trust Region Policy Optimization (TRPO) [23].
Algorithm  S  A  standalone  var  p 

Deep Q  c  d  y  na  off 
A3C  c  c/d  y  na  on 
TRPO/PPO  c  c/d  y  na  on 
DDPG  c  c  y  na  off 
Boltzmann  d  d  y  na  on 
VIME  c  c  n  ${p}_{\theta}\left({s}_{t+1}{s}_{t},{a}_{t}\right)$  na 
VAST  c  d  n  $p\left({s}_{t}{o}_{tk}\right)$  na 
SoftQ  c  c/d  y  $p\left({a}_{t}{s}_{t}\right)$  on 
III Policy and value function with undirected graphs
We first discuss the application of undirected graphs in deep reinforcement learning, where we use deep belief network here. Rather than modeling conditional distribution, as in directed acyclic graphs, undirected graphs model joint distribution of variables in question and focus on cliques [3] with free energy associated with it, which could be used to model the value function in reinforcement learning. Restricted Boltzman Machine has nice property of tractable factorized posterior distribution over the latent variable conditioned on observables, instead of having to do gibbs sampling in general Boltzman Machine.
In [20], the authors use Restricted Bolzman Machine to deal with MDPs of large state and action spaces, by modeling the stateaction value function with the negative free energy of the graph, where free energy of the graph could be easily calculated through the product of expert [20]. Specifically, the visible states of the Restricted Bolzmann Machine [20] consists of both state $s$ and action $a$ binary variables, as shown in Figure 6, where the hidden nodes consist of $L$ binary variables, while state variable $s$ are dark colored to represent it can be observed and action are light colored to represent it need to be sampled. Together with the auxilliary hidden variables, the undirected graph defines a joint probability distribution over state and action pairs, which defines a stochastic policy network that could sample actions out for on policy learning. Since it is pretty easy to calculate the derivative of the free energy $F(s,a,h)$ with respect to the coefficient ${w}_{k,j}$ of the network, one could use temporal difference learning to update the coefficients in the network. Thanks to properties of Boltzmann Machine, the conditional distribution of action over state $p(as)$ is still Boltzmann distributed, governed by the free energy, by adjusting the temperature, one could also change between different exploration strength.
The conditional distribution of actions under state could serve as the policy, which is
$p(as)={\displaystyle \frac{1}{Z}}{e}^{{\sum}_{h}F(s,a,h)/T}={\displaystyle \frac{1}{Z}}{e}^{{\sum}_{h}Q(s,a)/T}$  (28) 
, where $Z$ is the partition function [3] and we use the negative free energy to approximate the state action value function. Upon the state value function $Q(s,a)$ in Equation (28) is learned as a critic [30], such that its associated policy is defined, MCMC sampling [3] could be used to sample actions, as an actor [30]. With the sampled actions, time difference learning method like SARSAR [30], could be carried out to update the state value function estimation. Such an onpolicy process has been shown to be empirically effective in the large state actions spaces [20].
IV Variational Inference on Policies
IVA policy as ”optimal” posterior
The Boltzmann Machine defined Product of Expert Model in [20] works well for large state and action spaces, but are limited to discrete specifically binary state and action variables. For continuous state and action spaces, in [9], the author proposed deep energy based models with Directed Acyclic Graphs (DAG) [3], which we reorganize in a different form in Figure 7 with annotations added. The difference with respect to Figure 3 is that, in Figure 7, the reward is not explicit expressed in the directed graphical model. Instead, an auxilliary binary Observable $O$ is used to define whether the corresponding action at the current step is optimal or not. The conditional probability of the action being optimal is $p({O}_{t}=1{s}_{t},{a}_{t})=\mathrm{exp}(r({s}_{t},{a}_{t}))$, which connects conditional optimality with the amount of award received by encouraging the agent to take highly rewarded actions in an exponential manner. Note that the reward here must be negative to ensure the validity of probability, which does not hurt generality since reward range can be translated [15].
The Graphical Model in Figure 7 in total defines the trajectory likelihood or the evidence in Equation (29):
$p(\tau )$  $=p({s}_{1}){\displaystyle \prod _{t}}[p({s}_{t+1}{s}_{t},{a}_{t})p({O}_{t}=1{s}_{t},{a}_{t})]$  
$=\left[p({s}_{1}){\displaystyle \prod _{t}}p({s}_{t+1}{s}_{t},{a}_{t})\right]exp\left({\displaystyle \sum _{t}}r({s}_{t},{a}_{t})\right)$  (29) 
.
By doing so, the author is forcing a form of functional expression on top of the conditional independence structure of the graph by assigning a likelihood. In this way, calculating the optimal policy of actions distributions becomes an inference problem of calculating the posterior $p({a}_{t}{s}_{t},{O}_{t:T}=1)$, which reads as, conditional on optimality from current time step until end of episode, and the current current state to be ${s}_{t}$, the distribution of action ${a}_{t}$, and this posterior corresponds to the optimal policy. Observing the dseparation from Figure 7, ${O}_{1:t1}$ is conditionally independent of ${a}_{t}$ given ${s}_{t}$, $({O}_{1:t1}\text{scalebox}\text{1.07}\u27c2\u27c2{A}_{t}^{act})\mid {S}_{t}$, so $p({a}_{t}{s}_{t},{O}_{1:t1}=,{O}_{t:T}=1)=p({a}_{t}{s}_{t},{O}_{t:T})$
IVB Message passing for exact inference on the posterior
In this section, we give detailed derivation on doing exact inference on the policy posterior which is not given in [15]. Although the results are not used due to unexpected behavior, there is theoretical insights that is worth being noted.
The graph in Figure 7 is similar to Hidden Markov Models (HMM) [3], if we could treat the tuple of variable $({a}_{t},{s}_{t})$ as the latent variable counterpart of a HMM, with emission probability $p({O}_{t}=1{s}_{t},{a}_{t})=exp(r({s}_{t},{a}_{t}))$, while the transition probability, is from the variable tuple $({a}_{t},{s}_{t})$ to a subcomponent ${s}_{t+1}$ of the ”latent” variable tuple $({a}_{t+1},{s}_{t+1})$.
Similar to the forwardbackward message passing algorithm [3] in Hidden Markov Models [3], the posterior $p({a}_{t}{s}_{t},{O}_{t:T}=1)$ could also be calculated by passing messages. We offer a detailed derivation of the decomposition of the posterior $p({a}_{t}{s}_{t},{O}_{t:T}=1)$ in Equation (30), which is not available in [15].
$p({a}_{t}{s}_{t},{O}_{t:T}=1)$  
$=$  $\frac{p({a}_{t},{s}_{t},{O}_{t:T}=1)}{p({s}_{t},{O}_{t:T}=1)}$  
$=$  $\frac{p({O}_{t:T}=1{a}_{t},{s}_{t})p({a}_{t},{s}_{t})}{p({s}_{t},{O}_{t:T}=1)}$  
$=$  $\frac{p({O}_{t:T}=1{a}_{t},{s}_{t})p({a}_{t}{s}_{t})p({s}_{t})}{{\int}_{{a}_{t}{}^{\prime}}p({s}_{t},{a}_{t}^{{}^{\prime}},{O}_{t:T}=1)d\{{a}_{t}^{{}^{\prime}}\}}$  
$=$  $\frac{p({O}_{t:T}=1{a}_{t},{s}_{t})p({a}_{t}{s}_{t})p({s}_{t})}{{\int}_{{a}_{t}^{{}^{\prime}}}p({O}_{t:T}=1{a}_{t}{}^{\prime},{s}_{t})p({a}_{t}{}^{\prime}{s}_{t})p({s}_{t})d\{{a}_{t}^{{}^{\prime}}\}}$  
$=$  $\frac{p({O}_{t:T}=1{a}_{t},{s}_{t})p({a}_{t}{s}_{t})}{{\int}_{{a}_{t}^{{}^{\prime}}}p({O}_{t:T}=1{a}_{t}{}^{\prime},{s}_{t})p({a}_{t}{}^{\prime}{s}_{t})d\{{a}_{t}^{{}^{\prime}}\}}$  
$=$  $\frac{\beta ({a}_{t},{s}_{t})}{{\int}_{{a}_{t}^{{}^{\prime}}}\beta ({a}_{t}^{{}^{\prime}},{s}_{t})d\{{a}_{t}^{{}^{\prime}}\}}$  
$=$  $\frac{\beta ({a}_{t},{s}_{t})}{\beta ({s}_{t})}$  (30) 
In Equation (30), we define message $\beta ({a}_{t},{s}_{t})=p({O}_{t:T}=1{a}_{t},{s}_{t})p({a}_{t}{s}_{t})$ and message $\beta ({s}_{t})={\int}_{{a}_{t}^{{}^{\prime}}}\beta ({a}_{t}^{{}^{\prime}},{s}_{t})d\{{a}_{t}^{{}^{\prime}}\}$. If we consider $p({a}_{t}{s}_{t})$ as a prior with a trivial form [15], the only policy related term becomes $p({O}_{t:T}=1{a}_{t},{s}_{t})$.
In Hidden Markov Models (HMM) [3], if we use $O$ to represent the visible observed state and $S$ to represent the hidden latent state, $T$ for the series length, then it is essential to calculate the posterior $p({S}_{t}{O}_{1:T})$ and $p({S}_{t},{S}_{t+1}{O}_{1:T})$, which is the marginal of the complete posterior $p({S}_{1:T}{O}_{1:T})$. The posterior marginal could be computed by the forward message $\alpha ({S}_{t})=p({O}_{1:t},{S}_{t})$ and the backward message $\beta ({S}_{t})=p({O}_{t:T}{S}_{t})$, which is the probability distribution of observables from current time step until the end of the sequence, conditional on the current latent state.
In contrast, here, only the backward messages are relevant. Additionally, the backward message $\beta ({a}_{t},{s}_{t})$ here is not a probability distribution as in HMM, instead, is just a probability. In Figure 7, the backward message $\beta ({a}_{t},{s}_{t})$ could be decomposed recursively. Since in [15] the author only give the conclusion without derivation, we give a detailed derivaion of this recursion in Equation (31).
$\beta ({s}_{t},{a}_{t})$  
$=$  $p({O}_{t}=1,{O}_{t+1:T}=1{s}_{t},{a}_{t})$  
$=$  $\frac{\int p({O}_{t}=1,{O}_{t+1:T}=1,{s}_{t},{a}_{t},{s}_{t+1},{a}_{t+1})d\{{s}_{t+1},{a}_{t+1}\}}{p({s}_{t},{a}_{t})}$  
$=$  $\int}p({O}_{t+1:T}=1,{s}_{t+1},{a}_{t+1},{O}_{t}=1{s}_{t},{a}_{t})d\{{s}_{t+1},{a}_{t+1}\$  
$=$  $\int}p({O}_{t+1:T}=1,{s}_{t+1},{a}_{t+1}{s}_{t},{a}_{t})p({O}_{t}=1{s}_{t},{a}_{t})$  
$d\{{s}_{t+1},{a}_{t+1}\}$  ($({O}_{t+1:T},{S}_{t+1},{A}_{t+1}\text{scalebox}\text{1.07}\u27c2\u27c2{O}_{t})\mid {S}_{t},{A}_{t}$)  
$=$  $\int \frac{p({O}_{t+1:T}=1,{s}_{t+1},{a}_{t+1})}{p({s}_{t+1},{a}_{t+1})}\frac{p({s}_{t+1},{s}_{t},{a}_{t})}{p({s}_{t},{a}_{t})}$  
$p({O}_{t}=1{s}_{t},{a}_{t})d\{{s}_{t+1},{a}_{t+1}\}$  
$=$  $\int}p({O}_{t+1:T}=1{s}_{t+1},{a}_{t+1})p({s}_{t+1}{s}_{t},{a}_{t})p({O}_{t}=1{s}_{t},{a}_{t})$  
$d\{{s}_{t+1},{a}_{t+1}\}$  
$=$  $\int}\beta ({s}_{t+1})p({s}_{t+1}{s}_{t},{a}_{t})p({O}_{t}=1{s}_{t},{a}_{t})d{s}_{t+1$  (31) 
The recursion in Equation (31) start from the last time point $T$ of an episode.
IVC Connection between Message Passing and Bellman equation
If we define Q function in Equation (32) and V function in Equation (33)
$$Q({s}_{t},{a}_{t})=\mathrm{log}(\beta ({a}_{t},{s}_{t}))$$  (32) 
$V({s}_{t})$  $=\mathrm{log}\beta ({s}_{t})$  
$=\mathrm{log}{\displaystyle \int \beta ({s}_{t},{a}_{t})\mathit{d}{a}_{t}}$  
$=\mathrm{log}{\displaystyle \int exp(Q({s}_{t},{a}_{t}))\mathit{d}{a}_{t}}\approx \underset{{a}_{t}}{max}Q({s}_{t},{a}_{t})$  (33) 
then the corresponding policy could be written as Equation (34).
$$\pi ({a}_{t}{s}_{t})=p({a}_{t}{s}_{t},{O}_{t:T}=1)=exp(Q({s}_{t},{a}_{t})V({s}_{t}))$$  (34) 
.
Taking the logrithm of Equation (31), we get Equation (35)
$\mathrm{log}(\beta ({s}_{t},{a}_{t}))$  
$=\mathrm{log}{\displaystyle \int}\beta ({s}_{t+1})p({s}_{t+1}{s}_{t},{a}_{t})p({O}_{t}=1{s}_{t},{a}_{t})d{s}_{t+1}$  
$=\mathrm{log}{\displaystyle \int exp[r({s}_{t},{a}_{t})+V({s}_{t+1})]p({s}_{t+1}{s}_{t},{a}_{t})\mathit{d}{s}_{t+1}}$  
$=r({s}_{t},{a}_{t})+\mathrm{log}{\displaystyle \int exp(V({s}_{t+1}))p({s}_{t+1}{s}_{t},{a}_{t})\mathit{d}{s}_{t+1}}$  (35) 
which reduces to the risk seeking backup in Equation (36) as mentioned in [15]:
$$Q({s}_{t},{a}_{t})=r({s}_{t},{a}_{t})+\mathrm{log}{E}_{{s}_{t+1}\sim p({s}_{t+1}{s}_{t},{a}_{t})}[\mathrm{exp}(V({s}_{t+1}))]$$  (36) 
IVD Variational approximation to ”optimal” policy
Since the exact inference lead to unexpected behavior, approximate inference could be used. The optimization of the policy could be considered as a variational inference problem, and we use the variational policy of the action posterior distribution $q({a}_{t}{s}_{t})$, which could be represented by a neural network, to compose the proposal variational likelihood of the trajectory as in Equation (37):
$q(\tau )$  $=p({s}_{1}){\displaystyle \prod _{t}}[p({s}_{t+1}{s}_{t},{a}_{t})q({a}_{t}{s}_{t})]$  (37) 
, where the initial state distribution $p({s}_{1})$ and the environmental dynamics of state transmission is kept intact. Using the proposal trajectory as a pivot, we could derive the Evidence Lower Bound (ELBO) of the optimal trajectory as in Equation (38), which correspond to an interesting objective function of reward plus entropy return, as in Equation (39).
$\mathrm{log}(p({O}_{1:T}))$  
$=$  $\mathrm{log}{\displaystyle \int}p({O}_{1:T}=1,{s}_{1:T},{a}_{1:T}){\displaystyle \frac{q({s}_{1:T},{a}_{1:T})}{q({s}_{1:T},{a}_{1:T})}}d{s}_{1:T}d{a}_{1:T}$  
$=$  $\mathrm{log}{E}_{q({s}_{1:T},{a}_{1:T})}{\displaystyle \frac{p({O}_{1:T}=1,{s}_{1:T},{a}_{1:T})}{q({s}_{1:T},{a}_{1:T})}}$  
$\ge $  ${E}_{q({s}_{1:T},{a}_{1:T})}[\mathrm{log}p({O}_{1:T}=1,{s}_{1:T},{a}_{1:T})\mathrm{log}q({s}_{1:T},{a}_{1:T})]$  (38)  
$=$  ${D}_{KL}(q(\tau )p(\tau ))$  (take $q({a}_{t}{s}_{t})=\pi ({a}_{t}{s}_{t})$)  
$=$  ${E}_{q({s}_{1:T},{a}_{1:T})}[{\displaystyle \sum _{t=1:T}}[r({s}_{t},{a}_{t})\mathrm{log}q({a}_{t}{s}_{t})]]$  
$=$  $\sum _{t=1:T}}{E}_{{s}_{t},{a}_{t}}[r({s}_{t},{a}_{t})+H(\pi ({a}_{t}{s}_{t}))]$  (39) 
IVE Examples
$${Q}_{soft}^{\pi}(s,a)={r}_{0}+{E}_{r\sim \pi ,{s}_{0}=s,{a}_{0}=a}[\sum _{t=1}^{\mathrm{\infty}}{\gamma}^{t}({r}_{t}+\alpha H(\pi (.{s}_{t})))]$$  (40) 
and a soft version of Bellman update similar to Q Learning [30] is carried out, which lead to policy improvement with respect to the corresponding functional objective in Equation (41).
$J(\pi )$  
$=$  $\sum _{t}}{E}_{({s}_{t},{a}_{t})\sim {\rho}_{\pi}}{\displaystyle \sum _{l=t}^{\mathrm{\infty}}}{\gamma}^{lt}{E}_{({s}_{l},{a}_{l})}[r({s}_{l},{a}_{l})+$  
$\alpha H(\pi (.{s}_{l})){s}_{t},{a}_{t}]]$  
$=$  $\sum _{t}}{E}_{({s}_{t},{a}_{t})\sim {\rho}_{\pi}}[{Q}_{soft}^{\pi}({s}_{t},{a}_{t})+\alpha H(\pi (.{s}_{t}))]$  (41) 
Setting policy as Equation (34) lead to policy improvement. We offer a detailed proof for a key formula in Equation (42), which is stated in Equation (19) of [9] without proof. In Equation (42), we use $\pi (\cdot s)$ to implicitly represent $\pi (as)$ to avoid symbol aliasing whenever necessary.
$H(\pi (\cdot s))+{E}_{a\sim \pi}[{Q}_{soft}^{\pi}(s,a)]$  
$=$  ${\displaystyle {\int}_{a}}\pi (as)[\mathrm{log}\pi (as){Q}_{soft}^{\pi}(s,a)]\mathit{d}a$  
$=$  ${\displaystyle {\int}_{a}}\pi (as)[\mathrm{log}\pi (as)\mathrm{log}[\mathrm{exp}({Q}_{soft}^{\pi}(s,a))]]\mathit{d}a$  
$=$  ${\displaystyle {\int}_{a}}\pi (as)[\mathrm{log}\pi (as)\mathrm{log}[{\displaystyle \frac{\mathrm{exp}({Q}_{soft}^{\pi}(s,a))}{\int \mathrm{exp}({Q}_{soft}^{\pi}(s,{a}^{{}^{\prime}}))\mathit{d}{a}^{{}^{\prime}}}}$  
$\int}\mathrm{exp}({Q}_{soft}^{\pi}(s,{a}^{{}^{\prime}}))d{a}^{{}^{\prime}}]]da$  
$=$  ${\displaystyle {\int}_{a}}\pi (as)[\mathrm{log}\pi (as)\mathrm{log}[\stackrel{~}{\pi}(as)]$  
$log{\displaystyle \int}exp({Q}_{soft}^{\pi}(s,{a}^{{}^{\prime}}))]da{}^{{}^{\prime}}$  
$=$  ${D}_{KL}(\pi (\cdot s)\stackrel{~}{\pi}(\cdot s))+\mathrm{log}{\displaystyle \int}\mathrm{exp}({Q}_{soft}^{\pi}(s,{a}^{{}^{\prime}}))d{a}^{{}^{\prime}}$  (42) 
V Variational Inference on the Environment
Another direction of using Variational Inference in Reinforcement Learning is to learn an environmental model, either on the dynamics or the latent state space posterior, instead of approximating the maximum entropy policy posterior in [15], explained in Section IV.
VA Variational inference on transition model
In Variational Information Maximizing Exploration (VIME) [11], where dynamic model ${p}_{\theta}({s}_{t+1}{s}_{t},{a}_{t})$ for the agent’s interaction with the environment is modeled using Bayesian Neural Network [5]. The R.V. for $\theta $ is denoted by $\mathrm{\Theta}$, and is treated in a Bayesian way by modeling the weight uncertainty. We represent this model with the graphical model in Figure 8, which is not given in [11]. The belief about the environment is modeled as entropy of the neural network weights posterior distribution $H(\mathrm{\Theta}{\xi}_{t})$ based on trajectory observations ${\xi}_{t}=\{{s}_{1:t},{a}_{1:t1}\}$. The method encourages taking exploratory actions by alleviating the average information gain of the agent’s belief about the environment after observing a new state ${s}_{t+1}$, which is ${E}_{p({s}_{t+1}{\xi}_{t},{a}_{t})}{D}_{KL}(p(\theta {\xi}_{t+1})p(\theta {\xi}_{t}))$, and this is equivalent to the entropy minus conditional entropy $H(\mathrm{\Theta}{\xi}_{t},{a}_{t})H(\mathrm{\Theta}{\xi}_{t},{a}_{t},{s}_{t+1})=H(\mathrm{\Theta}{\xi}_{t},{a}_{t})H(\mathrm{\Theta}{\xi}_{t+1})$.
We derive in Equation (4344) that the entropy difference is actually the average information gain, which is equal to the mutual information $I(\mathrm{\Theta},{S}_{t+1}{\xi}_{t},{a}_{t})$ between environmental parameter $\mathrm{\Theta}$ and the new state ${S}_{t+1}$. Such a derivation is not given in [11].
$H(\mathrm{\Theta}{\xi}_{t},{a}_{t})H(\mathrm{\Theta}{\xi}_{t},{a}_{t},{s}_{t+1})$  
$=$  ${\displaystyle {\int}_{\mathrm{\Theta}}}p(\theta {\xi}_{t},{a}_{t})\mathrm{log}p(\theta {\xi}_{t},{a}_{t})\mathit{d}\theta +{\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}$  
$p({s}_{t+1}{\xi}_{t},{a}_{t})p(\theta {\xi}_{t},{a}_{t},{s}_{t+1})\mathrm{log}p(\theta {\xi}_{t},{a}_{t},{s}_{t+1})d\theta d{s}_{t+1}$  
$=$  ${\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}p(\theta ,{s}_{t+1}{\xi}_{t},{a}_{t})\mathrm{log}p(\theta {\xi}_{t})\mathit{d}\theta \mathit{d}{s}_{t+1}+{\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}$  
$p({s}_{t+1}{\xi}_{t},{a}_{t})p(\theta {\xi}_{t},{a}_{t},{s}_{t+1})\mathrm{log}p(\theta {\xi}_{t},{a}_{t},{s}_{t+1})d\theta d{s}_{t+1}$  
$=$  ${\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}p({s}_{t+1}{\xi}_{t},{a}_{t})p(\theta {\xi}_{t},{a}_{t},{s}_{t+1})\mathrm{log}p(\theta {\xi}_{t})\mathit{d}\theta \mathit{d}{s}_{t+1}$  
$+{\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}p({s}_{t+1}{\xi}_{t},{a}_{t})p(\theta {\xi}_{t+1})\mathrm{log}p(\theta {\xi}_{t+1})\mathit{d}\theta \mathit{d}{s}_{t+1}$  
$=$  ${E}_{p({s}_{t+1}{\xi}_{t},{a}_{t})}{D}_{KL}(p(\theta {\xi}_{t+1})p(\theta {\xi}_{t}))$  (43)  
$=$  ${\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}p({s}_{t+1}{\xi}_{t},{a}_{t})p(\theta {\xi}_{t+1})log[{\displaystyle \frac{p(\theta {\xi}_{t+1})}{p(\theta {\xi}_{t})}}]\mathit{d}\theta \mathit{d}{s}_{t+1}$  
$=$  ${\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}p({s}_{t+1},\theta {\xi}_{t},{a}_{t})\mathrm{log}[{\displaystyle \frac{p(\theta {\xi}_{t+1})}{p(\theta {\xi}_{t})}}{\displaystyle \frac{p({s}_{t+1}{\xi}_{t},{a}_{t})}{p({s}_{t+1}{\xi}_{t},{a}_{t})}}]$  
$d\theta d{s}_{t+1}$  
$=$  ${\int}_{\mathrm{\Theta}}}{\displaystyle {\int}_{\mathcal{S}}}p({s}_{t+1},\theta {\xi}_{t},{a}_{t})log[{\displaystyle \frac{p({s}_{t+1},\theta {\xi}_{t},{a}_{t})}{p(\theta {\xi}_{t})p({s}_{t+1}{\xi}_{t},{a}_{t})}}]\mathit{d}\theta \mathit{d}{s}_{t+1$  
$=$  $I(\mathrm{\Theta},{S}_{t+1}{\xi}_{t},{a}_{t})$  (44) 
Based on Equation (43), an intrinsic reward can be augmented from the environmental reward function, thus the method could be incorporated with any existing reinforcement learning algorithms for exploration, TRPO [23], for example. Upon additional observation of action ${a}_{t}$ and state ${s}_{t+1}$ pair on top of trajectory history ${\xi}_{t}$, the posterior on the distribution of the environmental parameter $\theta $, $p(\theta {\xi}_{t})$, could be updated to be $p(\theta {\xi}_{t+1})$ in a Bayesian way as derived in Equation (45), which is first proposed in [29].
$p(\theta {\xi}_{t+1})=$  $\frac{p(\theta ,{\xi}_{t},{a}_{t},{s}_{t+1})}{p({\xi}_{t},{a}_{t},{s}_{t+1})}$  
$=$  $\frac{p({s}_{t+1}\theta ,{\xi}_{t},{a}_{t})p(\theta ,{\xi}_{t},{a}_{t})}{p({\xi}_{t},{a}_{t},{s}_{t+1})}$  
$=$  $\frac{p({s}_{t+1}\theta ,{\xi}_{t},{a}_{t})p(\theta ,{\xi}_{t},{a}_{t})}{p({a}_{t},{\xi}_{t})p({s}_{t+1}{a}_{t},{\xi}_{t})}$  
$=$  $\frac{p({s}_{t+1}\theta ,{\xi}_{t},{a}_{t})p(\theta {\xi}_{t},{a}_{t})}{p({s}_{t+1}{a}_{t},{\xi}_{t})}$  
$=$  $\frac{p({s}_{t+1}\theta ,{\xi}_{t},{a}_{t})p(\theta {\xi}_{t})}{p({s}_{t+1}{a}_{t},{\xi}_{t})}$  (45) 
In Equation (45), the denominator can be written as Equation (46), so that the dynamics of the environment modeled by neural network weights $\theta $, $p({s}_{t+1}\theta ,{a}_{t},{\xi}_{t})$, could be used.
$p({s}_{t+1}{a}_{t},{\xi}_{t})$  
$={\displaystyle {\int}_{\mathrm{\Theta}}}p({s}_{t+1},\theta {a}_{t},{\xi}_{t})\mathit{d}\theta $  
$={\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle \frac{p({s}_{t+1},\theta ,{a}_{t},{\xi}_{t})}{p({a}_{t},{\xi}_{t})}}\mathit{d}\theta $  
$={\displaystyle {\int}_{\mathrm{\Theta}}}{\displaystyle \frac{p({s}_{t+1}\theta ,{a}_{t},{\xi}_{t})p(\theta ,{a}_{t},{\xi}_{t})}{p({a}_{t},{\xi}_{t})}}\mathit{d}\theta $  
$={\displaystyle {\int}_{\mathrm{\Theta}}}p({s}_{t+1}\theta ,{a}_{t},{\xi}_{t})p(\theta {\xi}_{t})\mathit{d}\theta $  (46) 
The last step of Equation (46) makes use of $p(\theta {\xi}_{t},{a}_{t})=p(\theta {\xi}_{t})$.
Since the integral in Equation (46) is not tractable, variational treatment over the neural network weights posterior distribution $p(\theta {\xi}_{t})$ is used, characterized by variational parameter $\varphi $, as shown in the dotted line in Figure 8. The variational posterior about the model parameter $\theta $, updated at each step, could than be used to calculate the intrinsic reward in Equation (43).
VB Variational Inference on hidden state posterior
In Variational State Tabulation (VaST) [7], the author assume the high dimensional observed state to be represented by Observable $O$, while the transition happens at the latent state space represented by $S$, which is finite and discrete. The author assume a factorized form of observation and latent space joint probability, which we explicitly state in Equation (47).
$p(O,S)={\pi}_{{\theta}_{0}}({s}_{0}){\displaystyle \prod _{t=0}^{T}}{p}_{{\theta}^{R}}({o}_{t}{s}_{t}){\displaystyle \prod _{t=1}^{T}}{p}_{{\theta}^{T}}({s}_{t}{s}_{t1},{a}_{t1})$  (47) 
Additionally, we characterize Equation (47) with the probabilistic graphical model in Figure 9 which does not exist in [7]. Compared to Figure 7, here the latent state $S$ is in discrete space instead of high dimension, and the observation is a high dimensional image instead of binary variable to indicate optimal action. By assuming a factorized form of the variational posterior in Equation (48),
$q({S}_{0:T}{O}_{0:T})={\displaystyle \prod _{t=0}^{T}}{q}_{\varphi}({S}_{t}{O}_{tk:t})$  (48) 
The author assume the episode length to be $T$, and default frame prior observation to be blank frames. The Evidence Lower Bound (ELBO) of the observed trajectory of Equation (47) could be easily represented by a Varitional AutoEncoder [8] like architecture, where the encoder ${q}_{\varphi}$, together with the reparametrization trick [8], maps the observed state $O$ into parameters for the Concrete distribution [17], so backprobagation could be used on deterministic variables to update the weight of the network based on the ELBO, which is decomposed into different parts of the reconstruction losses of the variational autoencoder like architecture. Like VIME [11], VaSt could be combined with other reinforcement learning algorithms. Here prioritized sweeping [30] is carried out on the Heviside activation of the encoder output directly, by counting the transition frequency, instead of waiting for the slowly learned environmental transition model ${p}_{{\theta}^{T}}({s}_{t}{s}_{t1},{a}_{t1})$ in Equation (47). A potential problem of doing so is aliasing between latent state $s$ and observed state $o$. To alleviate this problem, in [7], the author actively relabel the transition history in the replay memory once found the observable has been assigned a different latent discrete state.
VI Conclusion
As a tutorial survey, this paper introduces the application of Probabilistic Graphical Model and Variational Inference in Deep Reinforcement Learning. We reformulates some key concepts in Reinforcement Learning with Probabilistic Graphical Models, summarizes recent advances of Deep Reinforcement Learning and compares some representative methods from different aspects. We offer some detailed derivations and Probabilistic Graphical Models to those methods using variational inference, which are not included in the original contribution.
References
 [1] (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §IIE3, §IIE4.
 [2] (2017) A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866. Cited by: §I.
 [3] (2006) Pattern recognition and machine learning. springer. Cited by: 1st item, §IB, §IIA2, §III, §III, §IVA, §IVB, §IVB, §IVB.
 [4] (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: 2nd item, §IB.
 [5] (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §IB, §VA.
 [6] (2018) Selfconsistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813. Cited by: §IIA1.
 [7] (2018) Efficient modelbased deep reinforcement learning with variational state tabulation. arXiv preprint arXiv:1802.04325. Cited by: 1st item, §VB.
 [8] (2016) Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §VB.
 [9] (2017) Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1352–1361. Cited by: §IVA, §IVE, §IVE, §IVE.
 [10] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §IVE.
 [11] (2016) Vime: variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117. Cited by: 1st item, 2nd item, §VA, §VA, §VB.
 [12] (1998) Planning and acting in partially observable stochastic domains. Artificial i ntelligence 101 (12), pp. 99–134. Cited by: §IIA2.
 [13] (2002) Approximately optimal approximate reinforcement learning. In ICML, Vol. 2, pp. 267–274. Cited by: §IID, §IID.
 [14] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IB, §I.
 [15] (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: 2nd item, §IIA1, §IVA, §IVB, §IVB, §IVB, §IVB, §IVC, §IVC, §V.
 [16] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §IIE2.
 [17] (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §VB.
 [18] (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §IIE1.
 [19] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §I, §IIE2, §IIE4, §IIE.
 [20] (2004) Reinforcement learning with factored states and actions. Journal of Machine Learning Research 5 (Aug), pp. 1063–1088. Cited by: §III, §III, §IVA.
 [21] (2015) Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320. Cited by: §IIE3.
 [22] (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §IIE4.
 [23] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §I, §IID, §IIE1, §IIE6, §VA.
 [24] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IIE1.
 [25] (2014) Deterministic policy gradient algorithms. Cited by: §IIE2.
 [26] (2019) High dimensional restrictive federated model selection with multiobjective bayesian optimization over shifted distributions. arXiv preprint arXiv:1902.08999. Cited by: §IIE6.
 [27] (2019) Variational resampling based assessment of deep neural networks under distribution shift. Cited by: §IB.
 [28] (2019) ReinBo: machine learning pipeline search and configuration with bayesian optimization embedded reinforcement learning. Cited by: §I.
 [29] (2011) Planning to be surprised: optimal bayesian exploration in dynamic environments. In International Conference on Artificial General Intelligence, pp. 41–51. Cited by: §VA.
 [30] (1998) Introduction to reinforcement learning. Vol. 2, MIT press Cambridge. Cited by: 2nd item, §IIA1, §IIA2, §IIB, §IIB, §IIB, §IIC, §IIC, §III, §IVE, §VB.
 [31] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §IIC.
 [32] (2016) Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §IIE.
 [33] (2019) Maximum entropyregularized multigoal reinforcement learning. arXiv preprint arXiv:1905.08786. Cited by: §IIA1, §IIE3, §IIE4.