Abstract
The information bottleneck principle is an elegant and useful approach torepresentation learning. In this paper, we investigate the problem ofrepresentation learning in the context of reinforcement learning using theinformation bottleneck framework, aiming at improving the sample efficiency ofthe learning algorithms. %by accelerating the process of discarding irrelevantinformation when the %input states are extremely highdimensional. Weanalytically derive the optimal conditional distribution of the representation,and provide a variational lower bound. Then, we maximize this lower bound withthe Stein variational (SV) gradient method. We incorporate this framework inthe advantageous actor critic algorithm (A2C) and the proximal policyoptimization algorithm (PPO). Our experimental results show that our frameworkcan improve the sample efficiency of vanilla A2C and PPO significantly.Finally, we study the information bottleneck (IB) perspective in deep RL withthe algorithm called mutual information neural estimation(MINE) . Weexperimentally verify that the information extractioncompression process alsoexists in deep RL and our framework is capable of accelerating this process. Wealso analyze the relationship between MINE and our method, through thisrelationship, we theoretically derive an algorithm to optimize our IB frameworkwithout constructing the lower bound.
Quick Read (beta)
Learning Representations in Reinforcement Learning: An Information Bottleneck Approach
Abstract
The information bottleneck principle in [25] is an elegant and useful approach to representation learning. In this paper, we investigate the problem of representation learning in the context of reinforcement learning using the information bottleneck framework, aiming at improving the sample efficiency of the learning algorithms. We analytically derive the optimal conditional distribution of the representation, and provide a variational lower bound. Then, we maximize this lower bound with the Stein variational (SV) gradient method (originally developed in [14, 15]). We incorporate this framework in the advantageous actor critic algorithm (A2C)[16] and the proximal policy optimization algorithm (PPO) [21]. Our experimental results show that our framework can improve the sample efficiency of vanilla A2C and PPO significantly. Finally, we study the information bottleneck (IB) perspective in deep RL with the algorithm called mutual information neural estimation(MINE) [3]. We experimentally verify that the information extractioncompression process also exists in deep RL and our framework is capable of accelerating this process. We also analyze the relationship between MINE and our method, through this relationship, we theoretically derive an algorithm to optimize our IB framework without constructing the lower bound.
1 Introduction
In training a reinforcement learning algorithm, an agent interacts with the environment, explores the (possibly unknown) state space, and learns a policy from the exploration sample data. In many cases, such samples are quite expensive to obtain (e.g., requires interactions with the physical environment). Hence, improving the sample efficiency of the learning algorithm is a key problem in RL and has been studied extensively in the literature. Popular techniques include experience reuse/replay, which leads to powerful offpolicy algorithms (e.g., [17, 23, 26, 18, 7]), and modelbased algorithms (e.g., [11, 12]). Moreover, it is known that effective representations can greatly reduce the sample complexity in RL. This can be seen from the following motivating example: In the environment of a classical Atari game: Seaquest, it may take dozens of millions samples to converge to an optimal policy when the input states are raw images (more than 28,000 dimensions), while it takes less samples when the inputs are 128dimension predefined RAM data[24]. Clearly, the RAM data contain much less redundant information irrelevant to the learning process than the raw images. Thus, we argue that an efficient representation is extremely crucial to the sample efficiency.
In this paper, we try to improve the sample efficiency in RL from the perspective of representation learning using the celebrated information bottleneck framework [25]. In standard deep learning, the experiments in [22] show that during the training process, the neural network first ”remembers” the inputs by increasing the mutual information between the inputs and the representation variables, then compresses the inputs to efficient representation related to the learning task by discarding redundant information from inputs (decreasing the mutual information between inputs and representation variables). We call this phenomena ”information extractioncompression process”(information EC process). Our experiments shows that, similar to the results shown in [22], we first (to the best of our knowledge) observe the information extractioncompression phenomena in the context of deep RL (we need to use MINE[3] for estimating the mutual information). This observation motivates us to adopt the information bottleneck (IB) framework in reinforcement learning, in order to accelerate the extractioncompression process. The IB framework is intended to explicitly enforce RL agents to learn an efficient representation, hence improving the sample efficiency, by discarding irrelevant information from raw input data. Our technical contributions can be summarized as follows:

1.
We observe that the ”information extractioncompression process” also exists in the context of deep RL (using MINE[3] to estimate the mutual information).

2.
We derive the optimization problem of our information bottleneck framework in RL. In order to solve the optimization problem, we construct a lower bound and use the Stein variational gradient method developed in [15] to optimize the lower bound.

3.
We show that our framework can accelerate the information extractioncompression process. Our experimental results also show that combining actorcritic algorithms (such as A2C, PPO) with our framework is more sampleefficient than their original versions.

4.
We analyze the relationship between our framework and MINE, through this relationship, we theoretically derive an algorithm to optimize our IB framework without constructing the lower bound.
Finally, we note that our IB method is orthogonal to other methods for improving the sample efficiency, and it is an interesting future work to incorporate it in other offpolicy and modelbased algorithms.
2 Related Work
Information bottleneck framework was first introduced in [25]. They solve the framework by iterative Blahut Arimoto algorithm, which is infeasible to apply to deep neural networks. [22] tries to open the black box of deep learning from the perspective of information bottleneck, though the method they use to compute the mutual information is not precise. [2] derives a variational information bottleneck framework, yet apart from adding prior target distribution of the representation distribution $P(ZX)$, they also assume that $P(ZX)$ itself must be a Gaussian distribution, which limits the capabilities of the representation function. [20] extends this framework to variational discriminator bottleneck to improve GANs[9], imitation learning and inverse RL.
As for improving sampleefficiency, [17, 26, 18] mainly utilize the experiencereuse. Besides experiencereuse, [23, 8] tries to learn a deterministic policy, [7] seeks to mitigate the delay of offpolicy. [11, 12] learn the environment model. Some other powerful techniques can be found in [5].
State representation learning has been studied extensively, readers can find some classic works in the overview [13]. Apart from this overview, [19] shows a theoretical foundation of maintaining the optimality of representation space. [4] proposes a new perspective on representation learning in RL based on geometric properties of the space of value function. [1] learns representation via information bottleneck(IB) in imitation/apprenticeship learning. To the best of our knowledge, there is no work that intends to directly use IB in basic RL algorithms.
3 Preliminaries
A Markov decision process(MDP) is a tuple, $(\mathcal{X},\mathcal{A},\mathcal{R},\mathcal{P},\mu )$, where $\mathcal{X}$ is the set of states, $\mathcal{A}$ is the set of actions, $\mathcal{R}:\mathcal{X}\times \mathcal{A}\times \mathcal{X}\to \mathbb{R}$ is the reward function, $\mathcal{P}:\mathcal{X}\times \mathcal{A}\times \mathcal{X}\to [0,1]$ is the transition probability function(where $P({X}^{{}^{\prime}}X,a)$ is the probability of transitioning to state ${X}^{{}^{\prime}}$ given that the previous state is $X$ and the agent took action $a$ in $X$), and $\mu :\mathcal{X}\to [0,1]$ is the starting state distribution. A policy $\pi :\mathcal{X}\to \mathcal{P}(\mathcal{A})$ is a map from states to probability distributions over actions, with $\pi (aX)$ denoting the probability of choosing action $a$ in state $X$.
In reinforcement learning, we aim to select a policy $\pi $ which maximizes $K(\pi )={\mathbb{E}}_{\tau \sim \pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}\mathcal{R}({X}_{t},{a}_{t},{X}_{t+1})]$, with a slight abuse of notation we denote $\mathcal{R}({X}_{t},{a}_{t},{X}_{t+1})={r}_{t}$. Here $\gamma \in [0,1)$ is a discount factor, $\tau $ denotes a trajectory $({X}_{0},{a}_{0},{X}_{1},{a}_{1},\mathrm{\dots})$. Define the state value function as ${V}^{\pi}(X)={\mathbb{E}}_{\tau \sim \pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}{X}_{0}=X]$, which is the expected return by policy $\pi $ in state $X$. And the stateaction value function ${Q}^{\pi}(X,a)={\mathbb{E}}_{\tau \sim \pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}{X}_{0}=X,{a}_{0}=a]$ is the expected return by policy $\pi $ after taking action $a$ in state $X$.
Actorcritic algorithms take the advantage of both policy gradient methods and valuefunctionbased methods such as the wellknown A2C[16]. Specifically, in the case that policy $\pi (aX;\theta )$ is parameterized by $\theta $, A2C uses the following equation to approximate the real policy gradient ${\nabla}_{\theta}K(\pi )={\nabla}_{\theta}\widehat{J}(\theta )$:
$${\nabla}_{\theta}\widehat{J}(\theta )\approx \sum _{t=0}^{\mathrm{\infty}}{\nabla}_{\theta}[\mathrm{log}\pi ({a}_{t}{X}_{t};\theta )({R}_{t}b({X}_{t}))+{\alpha}_{2}H(\pi (\cdot {X}_{t}))]=\sum _{t=0}^{\mathrm{\infty}}{\nabla}_{\theta}\widehat{J}({X}_{t};\theta )$$  (1) 
where ${R}_{t}={\sum}_{i=0}^{\mathrm{\infty}}{\gamma}^{i}{r}_{t+i}$ is the accumulated return from time step $t$, $H(p)$ is the entropy of distribution $p$ and $b({X}_{t})$ is a baseline function, which is commonly replaced by ${V}^{\pi}({X}_{t})$.
A2C also includes the minimization of the mean square error between ${R}_{t}$ and value function ${V}^{\pi}({X}_{t})$. Thus in practice, the total objective function in A2C can be written as:
$J(\theta )\approx {\displaystyle \sum _{t=0}^{\mathrm{\infty}}}\mathrm{log}\pi ({a}_{t}{X}_{t};\theta )({R}_{t}{V}^{\pi}({X}_{t})){\alpha}_{1}\parallel {R}_{t}{V}^{\pi}({X}_{t}){\parallel}^{2}+{\alpha}_{2}H(\pi (\cdot {X}_{t}))={\displaystyle \sum _{t=0}^{\mathrm{\infty}}}J({X}_{t};\theta )$  (2) 
where ${\alpha}_{1},{\alpha}_{2}$ are two coefficients.
In the context of representation learning in RL, $J({X}_{t};\theta )$(including ${V}^{\pi}({X}_{t})$ and ${Q}^{\pi}({X}_{t},{a}_{t})$) can be replaced by $J({Z}_{t};\theta )$ where ${Z}_{t}$ is a learnable lowdimensional representation of state ${X}_{t}$. For example, given a representation function $Z\sim {P}_{\varphi}(\cdot X)$ with parameter $\varphi $, define ${{V}^{\pi}({Z}_{t};{X}_{t},\varphi )\mid}_{{Z}_{t}\sim {P}_{\varphi}(\cdot {X}_{t})}={V}^{\pi}({X}_{t})$. For simplicity, we write ${{V}^{\pi}({Z}_{t};{X}_{t},\varphi )\mid}_{{Z}_{t}\sim {P}_{\varphi}(\cdot {X}_{t})}$ as ${V}^{\pi}({Z}_{t})$.
4 Framework
4.1 Information Bottleneck in Reinforcement Learning
The information bottleneck framework is an information theoretical framework for extracting relevant information, or yielding a representation, that an input $X\in \mathcal{X}$ contains about an output $Y\in \mathcal{Y}$. An optimal representation of $X$ would capture the relevant factors and compress $X$ by diminishing the irrelevant parts which do not contribute to the prediction of $Y$. In a Markovian structure $X\to Z\to Y$ where $X$ is the input, $Z$ is representation of $X$ and $Y$ is the label of $X$, IB seeks an embedding distribution ${P}^{\star}(ZX)$ such that:
${P}^{\star}(ZX)$  $=\mathrm{arg}\underset{P(ZX)}{\mathrm{max}}I(Y,Z)\beta I(X,Z)=\mathrm{arg}\underset{P(ZX)}{\mathrm{max}}H(Y)H(YZ)\beta I(X,Z)$  
$=\mathrm{arg}\underset{P(ZX)}{\mathrm{max}}H(YZ)\beta I(X,Z)$  (3) 
for every $X\in \mathcal{X}$, which appears as the standard crossentropy loss^{1}^{1} 1 Mutual information $I(X,Y)$ is defined as $\int \mathit{d}X\mathit{d}ZP(X,Z)\mathrm{log}\frac{P(X,Z)}{P(X)P(Z)}$, conditional entropy $H(YZ)$ is defined as $\int \mathit{d}Y\mathit{d}ZP(Y,Z)\mathrm{log}P(YZ)$. In a binaryclassification problem, $\mathrm{log}P(YZ)=(1Y)\mathrm{log}(1\widehat{Y}(Z))Y\mathrm{log}(\widehat{Y}(Z))$. in supervised learning with a MIregularizer, $\beta $ is a coefficient that controls the magnitude of the regularizer.
Next we derive an information bottleneck framework in reinforcement learning. Just like the label $Y$ in the context of supervised learning as showed in (3), we assume the supervising signal $Y$ in RL to be the accurate value ${R}_{t}$ of a specific state ${X}_{t}$ for a fixed policy $\pi $, which can be approximated by an nstep bootstrapping function ${Y}_{t}={R}_{t}={\sum}_{i=0}^{n2}{\gamma}^{i}{r}_{t+i}+{\gamma}^{n1}{V}^{\pi}({Z}_{t+n1})$ in practice. Let $P(YZ)$ be the following distribution:
$$P({Y}_{t}{Z}_{t})\propto \mathrm{exp}(\alpha {({R}_{t}{V}^{\pi}({Z}_{t}))}^{2})$$  (4) 
.This assumption is heuristic but reasonable: If we have an input ${X}_{t}$ and its relative label ${Y}_{t}={R}_{t}$, we now have ${X}_{t}$’s representation ${Z}_{t}$, naturally we want to train our decision function ${V}^{\pi}({Z}_{t})$ to approximate the true label ${Y}_{t}$. If we set our target distribution to be $C\cdot \mathrm{exp}(\alpha {({R}_{t}{V}^{\pi}({Z}_{t}))}^{2})$, the probability decreases as ${V}^{\pi}({Z}_{t})$ gets far from ${Y}_{t}$ while increases as ${V}^{\pi}({Z}_{t})$ gets close to ${Y}_{t}$.
For simplicity, we just write $P(RZ)$ instead of $P({Y}_{t}{Z}_{t})$ in the following context.
With this assumption, equation (3) can be written as:
${P}^{\star}(ZX)$  $=\mathrm{arg}\underset{P(ZX)}{\mathrm{max}}{\mathbb{E}}_{X,R,Z\sim P(X,R,Z)}[\mathrm{log}P(RZ)]\beta I(X,Z)$  
$=\mathrm{arg}\underset{P(ZX)}{\mathrm{max}}{\mathbb{E}}_{X\sim P(X),Z\sim P(ZX),R\sim P(RZ)}[\alpha {(R{V}^{\pi}(Z))}^{2}]\beta I(X,Z)$  (5) 
The first term looks familiar with classic mean squared error in supervisd learning. In a network with representation parameter $\varphi $ and policyvalue parameter $\theta $, policy loss $\widehat{J}(Z;\theta )$ in equation(1) and IB loss in (5) can be jointly written as:
$$L(\theta ,\varphi )={\mathbb{E}}_{X\sim P(X),Z\sim {P}_{\varphi}(ZX)}[\underset{J(Z;\theta )}{\underset{\u23df}{\widehat{J}(Z;\theta )+{\mathbb{E}}_{R}[\alpha {(R{V}^{\pi}(Z;\theta ))}^{2}]}}]\beta I(X,Z;\varphi )$$  (6) 
where $I(X,Z;\varphi )$ denotes the MI between $X$ and $Z\sim {P}_{\varphi}(\cdot X)$. Notice that $J(Z;\theta )$ itself is a standard loss function in RL as showed in (2). Finally we get the ultimate formalization of IB framework in reinforcement learning:
$${P}_{{\varphi}^{*}}(ZX)=\mathrm{arg}\underset{{P}_{\varphi}(ZX)}{\mathrm{max}}{\mathbb{E}}_{X\sim P(X),Z\sim {P}_{\varphi}(ZX)}[J(Z;\theta )]\beta I(X,Z;\varphi )$$  (7) 
The following theorem shows that if the mutual information $I(X,Z)$ of our framework and common RL framework are close, then our framework is nearoptimality.
Theorem 1 (Nearoptimality theorem).
Policy ${\pi}^{r}\mathrm{=}{\pi}_{{\theta}^{r}}$, parameter ${\varphi}^{r}$, optimal policy ${\pi}^{\mathrm{\star}}\mathrm{=}{\pi}_{{\theta}^{\mathrm{\star}}}$ and its relevant representation parameter ${\varphi}^{\mathrm{\star}}$ are defined as following:
${\theta}^{r},{\varphi}^{r}=\mathrm{arg}\underset{\theta ,\varphi}{\mathrm{min}}{\mathbb{E}}_{{P}_{\varphi}(X,Z)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(ZX)}{{P}_{\varphi}(Z)}}{\displaystyle \frac{1}{\beta}}J(Z;\theta )]$  (8)  
${\theta}^{\star},{\varphi}^{\star}=\mathrm{arg}\underset{\theta ,\varphi}{\mathrm{min}}{\mathbb{E}}_{{P}_{\varphi}(X,Z)}[{\displaystyle \frac{1}{\beta}}J(Z;\theta )]$  (9) 
Define ${J}^{{\pi}^{r}}$ as ${\mathrm{E}}_{{P}_{{\varphi}^{r}}\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}}\mathit{}\mathrm{[}J\mathit{}\mathrm{(}Z\mathrm{;}{\theta}^{r}\mathrm{)}\mathrm{]}$ and ${J}^{{\pi}^{\mathrm{\star}}}$ as ${\mathrm{E}}_{{P}_{{\varphi}^{\mathrm{\star}}}\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}}\mathit{}\mathrm{[}J\mathit{}\mathrm{(}Z\mathrm{;}{\theta}^{\mathrm{\star}}\mathrm{)}\mathrm{]}$. Assume that for any $\u03f5\mathrm{>}\mathrm{0}$, $$, we have $$.
4.2 Target Distribution Derivation and Variational Lower Bound Construction
In this section we first derive the target distribution in (7) and then seek to optimize it by constructing a variational lower bound.
We would like to solve the optimization problem in (7):
$$\underset{{P}_{\varphi}(ZX)}{\mathrm{max}}{\mathbb{E}}_{X\sim P(X),Z\sim {P}_{\varphi}(ZX)}[\underset{{L}_{1}(\theta ,\varphi )}{\underset{\u23df}{J(Z;\theta )\beta \mathrm{log}{P}_{\varphi}(ZX)}}+\underset{{L}_{2}(\theta ,\varphi )}{\underset{\u23df}{\beta \mathrm{log}{P}_{\varphi}(Z)}}]$$  (10) 
Combining the derivative of ${L}_{1}$ and ${L}_{2}$ and setting their summation to 0, we can get that
$${P}_{\varphi}(ZX)\propto {P}_{\varphi}(Z)\mathrm{exp}(\frac{1}{\beta}J(Z;\theta ))$$  (11) 
We provide a rigorous derivation of (11) in the appendix(0.A.2). We note that though our derivation is over the representation space instead of the whole network parameter space, the optimization problem (10) and the resulting distribution (11) are quite similar to the one studied in [15] in the context of Bayesian inference. However, we stress that our formulation follows from the information bottleneck framework, and is mathematically different from that in [15]. In particular, the difference lies in the term ${L}_{2}$, which depends on the the distribution ${P}_{\varphi}(Z\mid X)$ we want to optimize (while in [15], the corresponding term is a fixed prior).
The following theorem shows that the distribution in (11) is an optimal target distribution (with respect to the IB objective $L$). The proof can be found in the appendix(0.A.3).
Theorem 2.
(Representation Improvement Theorem) Consider the objective function $L\mathit{}\mathrm{(}\theta \mathrm{,}\varphi \mathrm{)}\mathrm{=}{\mathrm{E}}_{X\mathrm{\sim}P\mathit{}\mathrm{(}X\mathrm{)}\mathrm{,}Z\mathrm{\sim}{P}_{\varphi}\mathit{}\mathrm{(}Z\mathrm{}X\mathrm{)}}\mathit{}\mathrm{[}J\mathit{}\mathrm{(}Z\mathrm{;}\theta \mathrm{)}\mathrm{]}\mathrm{}\beta \mathit{}I\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{;}\varphi \mathrm{)}$, given a fixed policyvalue parameter $\theta $, representation distribution ${P}_{\varphi}\mathit{}\mathrm{(}Z\mathrm{}X\mathrm{)}$ and state distribution $P\mathit{}\mathrm{(}X\mathrm{)}$. Define a new representation distribution: ${P}_{\widehat{\varphi}}\mathit{}\mathrm{(}Z\mathrm{}X\mathrm{)}\mathrm{\propto}{P}_{\varphi}\mathit{}\mathrm{(}Z\mathrm{)}\mathit{}\mathrm{exp}\mathit{}\mathrm{(}\frac{\mathrm{1}}{\beta}\mathit{}J\mathit{}\mathrm{(}Z\mathrm{;}\theta \mathrm{)}\mathrm{)}$. We have $L\mathit{}\mathrm{(}\theta \mathrm{,}\widehat{\varphi}\mathrm{)}\mathrm{\ge}L\mathit{}\mathrm{(}\theta \mathrm{,}\varphi \mathrm{)}$.
Though we have derived the optimal target distribution, it is still difficult to compute ${P}_{\varphi}(Z)$. In order to resolve this problem, we construct a variational lower bound with a distribution $U(Z)$ which is independent of $\varphi $. Notice that $\int \mathit{d}Z{P}_{\varphi}(Z)\mathrm{log}{P}_{\varphi}(Z)\ge \int \mathit{d}Z{P}_{\varphi}(Z)\mathrm{log}U(Z)$. Now, we can derive a lower bound of $L(\theta ,\varphi )$ in (6) as follows:
$L(\theta ,\varphi )$  $={\mathbb{E}}_{X,Z}[J(Z;\theta )\beta \mathrm{log}{P}_{\varphi}(ZX)]+\beta {\displaystyle \int \mathit{d}Z{P}_{\varphi}(Z)\mathrm{log}{P}_{\varphi}(Z)}$  
$\ge {\mathbb{E}}_{X,Z}[J(Z;\theta )\beta \mathrm{log}{P}_{\varphi}(ZX)]+\beta {\displaystyle \int \mathit{d}Z{P}_{\varphi}(Z)\mathrm{log}U(Z)}$  
$={\mathbb{E}}_{X\sim P(X),Z\sim {P}_{\varphi}(ZX)}[J(Z;\theta )\beta \mathrm{log}{P}_{\varphi}(ZX)+\beta \mathrm{log}U(Z)]=\widehat{L}(\theta ,\varphi )$  (12) 
Naturally the target distribution of maximizing the lower bound is:
$${P}_{\varphi}(ZX)\propto U(Z)\mathrm{exp}(\frac{1}{\beta}J(Z;\theta ))$$  (13) 
4.3 Optimization by Stein Variational Gradient Descent
Stein variational gradient descent(SVGD) is a nonparametric variational inference algorithm that leverages efficient deterministic dynamics to transport a set of particles ${\{{Z}_{i}\}}_{i=1}^{n}$ to approximate given target distributions $Q(Z)$. We choose SVGD to optimize the lower bound because of its ability to handle unnormalized target distributions such as (13).
Briefly, SVGD iteratively updates the “particles” ${\{{Z}_{i}\}}_{i=1}^{n}$ via a direction function ${\mathrm{\Phi}}^{\star}(\cdot )$ in the unit ball of a reproducing kernel Hilbert space (RKHS) $\mathscr{H}$:
$${Z}_{i}\leftarrow {Z}_{i}+\u03f5{\mathrm{\Phi}}^{\star}({Z}_{i})$$  (14) 
where ${\mathrm{\Phi}}^{*}(\cdot )$ is chosen as a direction to maximally decrease^{2}^{2} 2 In fact, ${\mathrm{\Phi}}^{*}$ is chosen to maximize the directional derivative of $F(P)={\mathbb{D}}_{KL}(PQ)$, which appears to be the ”gradient” of $F$ the KL divergence between the particles’ distribution $P(Z)$ and the target distribution $Q(Z)=\frac{\widehat{Q}(Z)}{C}$($\widehat{Q}$ is unnormalized distribution, $C$ is normalized coefficient) in the sense that
$${\mathrm{\Phi}}^{\star}\leftarrow \mathrm{arg}\underset{\varphi \in \mathscr{H}}{\mathrm{max}}\{\frac{d}{d\u03f5}{\mathbb{D}}_{KL}({P}_{[\u03f5\varphi ]}Q)\mathit{\hspace{1em}}s.t.\mathit{\hspace{1em}}\parallel \mathrm{\Phi}{\parallel}_{\mathscr{H}}\le 1\}$$  (15) 
where ${P}_{[\u03f5\mathrm{\Phi}]}$ is the distribution of $Z+\u03f5\mathrm{\Phi}(Z)$ and $P$ is the distribution of $Z$. [14] showed a closed form of this direction:
$$\mathrm{\Phi}({Z}_{i})={\mathbb{E}}_{{Z}_{j}\sim P}[{\mathcal{K}({Z}_{j},{Z}_{i}){\nabla}_{\widehat{Z}}\mathrm{log}\widehat{Q}(\widehat{Z})\mid}_{\widehat{Z}={Z}_{j}}+{{\nabla}_{\widehat{Z}}\mathcal{K}(\widehat{Z},{Z}_{i})\mid}_{\widehat{Z}={Z}_{j}}]$$  (16) 
where $\mathcal{K}$ is a kernel function(typically an RBF kernel function). Notice that $C$ has been omitted.
In our case, we seek to minimize ${\mathbb{D}}_{KL}({P}_{\varphi}(\cdot X)\frac{U(\cdot )\mathrm{exp}(\frac{1}{\beta}J(\cdot ;\theta ))}{C}){\mid}_{C={\scriptscriptstyle \int \mathit{d}ZU(Z)\mathrm{exp}({\scriptscriptstyle \frac{1}{\beta}}J(Z;\theta ))}}$ , which is equivalent to maximize $\widehat{L}(\theta ,\varphi )$, the greedy direction yields:
$$\mathrm{\Phi}({Z}_{i})={\mathbb{E}}_{{Z}_{j}\sim {P}_{\varphi}(\cdot X)}[{\mathcal{K}({Z}_{j},{Z}_{i}){\nabla}_{\widehat{Z}}(\frac{1}{\beta}J(\widehat{Z};\theta )+\mathrm{log}U(\widehat{Z}))\mid}_{\widehat{Z}={Z}_{j}}+{{\nabla}_{\widehat{Z}}\mathcal{K}(\widehat{Z},{Z}_{i})\mid}_{\widehat{Z}={Z}_{j}}]$$  (17) 
In practice we replace $\mathrm{log}U(\widehat{Z})$ with $\zeta \mathrm{log}U(\widehat{Z})$ where $\zeta $ is a coefficient that controls the magnitude of ${\nabla}_{\widehat{Z}}\mathrm{log}U(\widehat{Z})$. Notice that $\mathrm{\Phi}({Z}_{i})$ is the greedy direction that ${Z}_{i}$ moves towards $\widehat{L}(\theta ,\varphi )$’s target distribution as showed in (13)(distribution that maximizes $\widehat{L}(\theta ,\varphi )$). This means $\mathrm{\Phi}({Z}_{i})$ is the gradient of $\widehat{L}({Z}_{i},\theta ,\varphi )$: $\frac{\partial \widehat{L}({Z}_{i},\theta ,\varphi )}{\partial {Z}_{i}}\propto \mathrm{\Phi}({Z}_{i})$.
Since our ultimate purpose is to update $\varphi $, by the chain rule, $\frac{\partial \widehat{L}({Z}_{i},\theta ,\varphi )}{\partial \varphi}\propto \mathrm{\Phi}({Z}_{i})\frac{\partial {Z}_{i}}{\partial \varphi}$. Then for $\widehat{L}(\theta ,\varphi )={\mathbb{E}}_{{P}_{\varphi}(X,Z)}[\widehat{L}(Z,\theta ,\varphi )]$:
$$\frac{\partial \widehat{L}(\theta ,\varphi )}{\partial \varphi}\propto {\mathbb{E}}_{X\sim P(X),{Z}_{i}\sim {P}_{\varphi}(\cdot X)}[\mathrm{\Phi}({Z}_{i})\frac{\partial {Z}_{i}}{\partial \varphi}]$$  (18) 
$\mathrm{\Phi}({Z}_{i})$ is given in equation(17). In practice we update the policyvalue parameter $\theta $ by common policy gradient algorithm since:
$$\frac{\partial \widehat{L}(\theta ,\varphi )}{\partial \theta}={\mathbb{E}}_{{P}_{\varphi}(X,Z)}[\frac{\partial J(Z;\theta )}{\partial \theta}]$$  (19) 
and update representation parameter $\varphi $ by (18).
4.4 Verify the information EC process with MINE
This section we verify that the information EC process exists in deep RL with MINE and our framework accelerates this process.
Mutual information neural estimation(MINE) is an algorithm that can compute mutual information(MI) between two high dimensional random variables more accurately and efficiently. Specifically, for random variables X and Z, assume $T$ to be a function of $X$ and $Z$, the calculation of $I(X,Z)$ can be transformed to the following optimization problem:
$$I(X,Z)=\underset{T}{\mathrm{max}}{\mathbb{E}}_{P(X,Z)}[T]\mathrm{log}({\mathbb{E}}_{P(X)\otimes P(Z)}[{\mathrm{exp}}^{T}])$$  (20) 
The optimal function ${T}^{\star}(X,Z)$ can be approximated by updating a neural network $T(X,Z;\eta )$.
With the aid of this powerful tool, we would like to visualize the mutual information between input state $X$ and its relative representation $Z$: Every a few update steps, we sample a batch of inputs and their relevant representations ${\{{X}_{i},{Z}_{i}\}}_{i=1}^{n}$ and compute their MI with MINE, every time we train MINE(update $\eta $) we just shuffle ${\{{Z}_{i}\}}_{i=1}^{n}$ and roughly assume the shuffled representations ${\{{Z}_{i}^{\text{shuffled}}\}}_{i=1}^{n}$ to be independent with ${\{{X}_{i}\}}_{i=1}^{n}$:
$$I(X,Z)\approx \underset{\eta}{\mathrm{max}}\frac{1}{n}\sum _{i=1}^{n}[T({X}_{i},{Z}_{i};\eta )]\mathrm{log}(\frac{1}{n}\sum _{i=1}^{n}[{\mathrm{exp}}^{T({X}_{i},{Z}_{i}^{\text{shuffled}};\eta )}])$$  (21) 
Figure(1) is the tensorboard graph of mutual information estimation between $X$ and $Z$ in Atari game Pong, xaxis is update steps and yaxis is MI estimation. More details and results can be found in appendix(0.A.6) and (0.A.7). As we can see, in both A2C with our framework and common A2C, the MI first increases to encode more information from inputs(”remember” the inputs), then decreases to drop irrelevant information from inputs(”forget” the useless information). And clearly, our framework extracts faster and compresses faster than common A2C as showed in figure(1)(b).
After completing the visualization of MI with MINE, we analyze the relationship between our framework and MINE. According to [3], the optimal function ${T}^{*}$ in (20) goes as follows:
$${\mathrm{exp}}^{{T}^{*}(X,Z;\eta )}=C\frac{{P}_{\varphi}(X,Z)}{P(X){P}_{\varphi}(Z)}\mathit{\hspace{1em}}s.t.\mathit{\hspace{1em}}C={\mathbb{E}}_{P(X)\otimes {P}_{\varphi}(Z)}[{\mathrm{exp}}^{{T}^{*}}]$$  (22) 
Combining the result with Theorem(2), we get:
${\mathrm{exp}}^{{T}^{*}(X,Z;\eta )}=C{\displaystyle \frac{{P}_{\varphi}(ZX)}{{P}_{\varphi}(Z)}}\propto \mathrm{exp}({\displaystyle \frac{1}{\beta}}J(Z;\theta ))$  (23) 
Through this relationship, we theoretically derive an algorithm that can directly optimize our framework without constructing the lower bound, we put this derivation in the appendix(0.A.5).
5 Experiments
In the experiments we show that our framework can improve the sample efficiency of basic RL algorithms(typically A2C and PPO). Other results can be found in last two appendices.
In A2C with our framework, we sample $Z$ by a network $\varphi (X,\u03f5)$ where $\u03f5\sim \mathcal{N}(\cdot ;0,0.1)$ and the number of samples from each state $X$ is $32$, readers are encouraged to take more samples if the computation resources are sufficient. We set the IB coefficient as $\beta =0.001$. We choose two prior distributions $U(Z)$ of our framework, the first one is uniform distribution, apparently when $U(Z)$ is the uniform distribution, ${{\nabla}_{\widehat{Z}}\mathrm{log}U(\widehat{Z})\mid}_{\widehat{Z}=Z}$ can be omitted. The second one is a Gaussian distribution, which is defined as follows: for a given state ${X}_{i}$, sample a batch of ${\{{Z}_{j}^{i}\}}_{j=1}^{n=32}$, then: $U(Z)=\mathcal{N}(Z;\mu =\frac{1}{n}{\sum}_{j=1}^{n}{Z}_{j}^{i},{\sigma}^{2}=\frac{1}{n}{\sum}_{j=1}^{n}{({Z}_{j}^{i}\mu )}^{2})$.
We also set $\zeta $ as ${0.005\parallel {\nabla}_{\widehat{Z}}\frac{1}{\beta}J(\widehat{Z};\theta )/{\nabla}_{\widehat{Z}}\mathrm{log}U(\widehat{Z})\parallel \mid}_{\widehat{Z}=Z}$ to control the magnitude of ${{\nabla}_{\widehat{Z}}\mathrm{log}U(\widehat{Z})\mid}_{\widehat{Z}=Z}$. Following [15], the kernel function in (17) we used is the Gaussian RBF kernel $\mathcal{K}({Z}_{i},{Z}_{j})=\mathrm{exp}({\parallel {Z}_{i}{Z}_{j}\parallel}^{2}/h)$ where $h=me{d}^{2}/2\mathrm{log}(n+1)$, $med$ denotes the median of pairwise distances between the particles ${\{{Z}_{j}^{i}\}}_{i=1}^{n}$. As for the hyperparameters in RL, we simply choose the default parameters in A2C of Openaibaselines. In summary, we implement the following four algorithms:
A2C with uniform SVIB: Use $\varphi (X,\u03f5)$ as the embedding function, optimize by our framework(algorithm(0.A.4)) with $U(Z)$ being uniform distribution.
A2C with Gaussian SVIB: Use $\varphi (X,\u03f5)$ as the embedding function, optimize by our framework(algorithm(0.A.4)) with $U(Z)$ being Gaussian distribution.
A2C:Regular A2C in Openaibaselines with $\varphi (X)$ as the embedding function.
A2C with noise(For fairness):A2C with the same embedding function $\varphi (X,\u03f5)$ as A2C with our framework.
Figure(2)(a)(e) show the performance of four A2Cbased algorithms in $5$ gym Atari games. We can see that A2C with our framework is more sampleefficient than both A2C and A2C with noise in nearly all 5 games.
Notice that in SpaceInvaders, A2C with Gaussian SVIB is worse. We suspect that this is because the agent excessively drops information from inputs that it misses some information related to the learning process. There is a more detailed experimental discussion about this phenomena in appendix(0.A.7) . We also implement four PPObased algorithms whose experimental settings are same as A2C except that we set the number of samples as $26$ for the sake of computation efficiency. Results can be found in the in figure(2)(f)(h).
6 Conclusion
We study the information bottleneck principle in RL: We propose an optimization problem for learning the representation in RL based on the informationbottleneck framework and derive the optimal form of the target distribution. We construct a lower bound and utilize Stein Variational gradient method to optimize it. Finally, we verify that the information extraction and compression process also exists in deep RL, and our framework can accelerate this process. We also theoretically derive an algorithm based on MINE that can directly optimize our framework and we plan to study it experimentally in the future work.
7 Acknowledgement
We thank professor Jian Li for paper writing correction.
References
 [1] (2019) State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, pp. 3134–3142. Cited by: §2.
 [2] (2016) Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations. Cited by: §2.
 [3] (2018) MINE: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: Learning Representations in Reinforcement Learning: An Information Bottleneck Approach, item 1, §1, §4.4.
 [4] (2019) A geometric perspective on optimal representations for reinforcement learning. International Conference on Learning Representations. Cited by: §2.
 [5] (2019) Reinforcement learning, fast and slow. Trends in cognitive sciences 23 (5), pp. 408–422. Cited by: §2.
 [6] (2018) Reward constrained policy optimization. arXiv preprint arXiv:1805.11074. Cited by: Proof.
 [7] (2018) IMPALA: scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1, §2.
 [8] (2018) Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §2.
 [9] (2014) Generative adversarial nets. In International Conference on Neural Information Processing Systems, pp. 2672–2680. Cited by: §2.
 [10] (2017) Reinforcement learning with deep energybased policies. Proceedings of the 34th International Conference on Machine Learning 70, pp. 1352–1361. Cited by: §4.3.
 [11] (2018) Learning latent dynamics for planning from pixels. International Conference on Machine Learning, pp. 2555–2565. Cited by: §1, §2.
 [12] (2019) Modelbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §1, §2.
 [13] (2018) State representation learning for control: an overview. Neural Networks 108, pp. S0893608018302053–. Cited by: §2.
 [14] (2016) Stein variational gradient descent: a general purpose bayesian inference algorithm. Advances in Neural Information Processing Systems 29, pp. 2378–2386. Cited by: Learning Representations in Reinforcement Learning: An Information Bottleneck Approach, §4.3, §4.3.
 [15] (2017) Stein variational policy gradient. arXiv preprint arXiv:1704.02399. Cited by: Learning Representations in Reinforcement Learning: An Information Bottleneck Approach, item 2, §4.2, §4.3, §5.
 [16] (2016) Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning, pp. 1928–1937. Cited by: Learning Representations in Reinforcement Learning: An Information Bottleneck Approach, §3.
 [17] (2013) Playing atari with deep reinforcement learning. Computer Science. Cited by: §1, §2.
 [18] (2018) Dataefficient hierarchical reinforcement learning. Neural Information Processing Systems, pp. 3307–3317. Cited by: §1, §2.
 [19] (2018) Nearoptimal representation learning for hierarchical reinforcement learning. International Conference on Learning Representations. Cited by: §2.
 [20] (2018) Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. International Conference on Learning Representations. Cited by: §2.
 [21] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Learning Representations in Reinforcement Learning: An Information Bottleneck Approach.
 [22] (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §0.A.6, §1, §2.
 [23] (2014) Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395. Cited by: §1, §2.
 [24] (2016) Learning from the memory of atari 2600. arXiv preprint arXiv:1605.01335. Cited by: §1.
 [25] (2000) The information bottleneck method. University of Illinois 411 (2930), pp. 368–377. Cited by: Learning Representations in Reinforcement Learning: An Information Bottleneck Approach, §1, §2.
 [26] (2015) Deep reinforcement learning with double qlearning. Computer ScienceInternational Conference on Machine Learning, pp. 2094–2100. Cited by: §1, §2.
Appendix 0.A Appendix
0.A.1 Proof of Theorem 1
Theorem.
(Theorem 1 restated)Policy ${\pi}^{r}\mathrm{=}{\pi}_{{\theta}^{r}}$, parameter ${\varphi}^{r}$, optimal policy ${\pi}^{\mathrm{\star}}\mathrm{=}{\pi}_{{\theta}^{\mathrm{\star}}}$ and its relevant representation parameter ${\varphi}^{\mathrm{\star}}$ are defined as following:
${\theta}^{r},{\varphi}^{r}=\mathrm{arg}\underset{\theta ,\varphi}{\mathrm{min}}{\mathbb{E}}_{{P}_{\varphi}(X,Z)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(ZX)}{{P}_{\varphi}(Z)}}{\displaystyle \frac{1}{\beta}}J(Z;\theta )]$  (24)  
${\theta}^{\star},{\varphi}^{\star}=\mathrm{arg}\underset{\theta ,\varphi}{\mathrm{min}}{\mathbb{E}}_{{P}_{\varphi}(X,Z)}[{\displaystyle \frac{1}{\beta}}J(Z;\theta )]$  (25) 
Define ${J}^{{\pi}^{r}}$ as ${\mathrm{E}}_{{P}_{{\varphi}^{r}}\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}}\mathit{}\mathrm{[}J\mathit{}\mathrm{(}Z\mathrm{;}{\theta}^{r}\mathrm{)}\mathrm{]}$ and ${J}^{{\pi}^{\mathrm{\star}}}$ as ${\mathrm{E}}_{{P}_{{\varphi}^{\mathrm{\star}}}\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}}\mathit{}\mathrm{[}J\mathit{}\mathrm{(}Z\mathrm{;}{\theta}^{\mathrm{\star}}\mathrm{)}\mathrm{]}$. Assume that for any $\u03f5\mathrm{>}\mathrm{0}$, $$, we have $$. Specifically, in valuebased algorithm, this theorem also holds between expectation of two value functions.
Proof.
From equation(24) we can get:
$$I(X,Z;{\varphi}^{\star})\frac{1}{\beta}{J}^{{\pi}^{\star}}\ge I(X,Z;{\varphi}^{r})\frac{1}{\beta}{J}^{{\pi}^{r}}$$  (26) 
From equation(25) we can get:
$$\frac{1}{\beta}{J}^{{\pi}^{r}}\ge \frac{1}{\beta}{J}^{{\pi}^{\star}}$$  (27) 
These two equations give us the following inequality:
$$\beta (I(X,Z;{\varphi}^{\star})I(X,Z;{\varphi}^{r}))\ge {J}^{{\pi}^{\star}}{J}^{{\pi}^{r}}\ge 0$$  (28) 
According to the assumption, naturally we have:
$$  (29) 
Notice that if we use our IB framework in valuebased algorithm, then the objective function ${J}^{\pi}$ can be defined as:
${J}^{\pi}={V}^{\pi}$  $={(1\gamma )}^{1}{\displaystyle {\int}_{X}}\mathit{d}X{d}^{\pi}(X){R}^{\pi}(X)$  
$={(1\gamma )}^{1}{\displaystyle {\int}_{X}}\mathit{d}X{d}^{\pi}(X)[{\displaystyle {\int}_{Z}}\mathit{d}Z{P}_{\varphi}(ZX){R}^{\pi}(Z)]$  (30) 
where ${R}^{\pi}\mathit{}\mathrm{(}Z\mathrm{)}\mathrm{=}{\mathrm{\int}}_{X\mathrm{\in}\mathrm{\{}{X}^{{}^{\mathrm{\prime}}}\mathrm{:}\varphi \mathit{}\mathrm{(}{X}^{{}^{\mathrm{\prime}}}\mathrm{)}\mathrm{=}Z\mathrm{\}}}\mathit{d}X\mathit{}{R}^{\pi}\mathit{}\mathrm{(}X\mathrm{)}$ and ${d}^{\pi}$ is the discounted future state distribution, readers can find detailed definition of ${d}^{\pi}$ in the appendix of [6]. We can get:
$$  (31) 
0.A.2 Target Distribution Derivation
We show the rigorous derivation of the target distribution in (11).
Denote $P$ as the distribution of $X$, ${P}_{\varphi}^{Z}(Z)={P}_{\varphi}(Z)$ as the distribution of $Z$. We use ${P}_{\varphi}$ as the short hand notation for the conditional distribution ${P}_{\varphi}(ZX)$. Moreover, we write $L(\theta ,\varphi )=L(\theta ,{P}_{\varphi})$ and ${\u27e8p,q\u27e9}_{X}$ = $\int \mathit{d}Xp(X)q(X)$. Notice that ${P}_{\varphi}^{Z}(Z)={\u27e8P(\cdot ),{P}_{\varphi}(Z\cdot )\u27e9}_{X}$. Take the functional derivative with respect to ${P}_{\varphi}$ of the first term ${L}_{1}$:
${\u27e8{\displaystyle \frac{\delta {L}_{1}(\theta ,{P}_{\varphi})}{\delta {P}_{\varphi}}},\mathrm{\Phi}\u27e9}_{XZ}$  $={\displaystyle \int \mathit{d}Z\mathit{d}X\frac{\delta {L}_{1}(\theta ,{P}_{\varphi}(ZX))}{\delta {P}_{\varphi}(ZX)}\mathrm{\Phi}(Z,X)}={\left[{\displaystyle \frac{d}{d\u03f5}}{L}_{1}(\theta ,{P}_{\varphi}+\u03f5\mathrm{\Phi})\right]}_{\u03f5=0}$  
$={\left[{\displaystyle \frac{d}{d\u03f5}}{\displaystyle \int}dXP(X){\u27e8{P}_{\varphi}(\cdot X)+\u03f5\mathrm{\Phi}(\cdot ,X),J(\cdot ;\theta )\beta \mathrm{log}({P}_{\varphi}(\cdot X)+\u03f5\mathrm{\Phi}(\cdot ,X))\u27e9}_{Z}\right]}_{\u03f5=0}$  
$={\displaystyle \int}dXP(X)[\u27e8\mathrm{\Phi}(\cdot ,X),J(\cdot ;\theta )\beta \mathrm{log}{P}_{\varphi}(\cdot X)\u27e9+{\u27e8{P}_{\varphi}(\cdot X),\beta {\displaystyle \frac{\mathrm{\Phi}(\cdot ,X)}{{P}_{\varphi}(\cdot X)}}\u27e9}_{Z}]$  
$={\u27e8P(\cdot )[J(\cdot ;\theta )\beta \mathrm{log}{P}_{\varphi}(\cdot \cdot )\beta ],\mathrm{\Phi}(\cdot ,\cdot )\u27e9}_{XZ}$ 
Hence, we can see that
$$\frac{\delta {L}_{1}(\theta ,{P}_{\varphi})}{\delta {P}_{\varphi}(ZX)}=P(X)[J(Z;\theta )\beta \mathrm{log}{P}_{\varphi}(ZX)\beta ].$$ 
Then we consider the second term. By the chain rule of functional derivative, we have that
$\frac{\delta {L}_{2}(\theta ,{P}_{\varphi})}{\delta {P}_{\varphi}(ZX)}$  $={\u27e8{\displaystyle \frac{\delta {L}_{2}(\theta ,{P}_{\varphi})}{\delta {P}_{\varphi}^{Z}(\cdot )}},{\displaystyle \frac{\delta {P}_{\varphi}^{Z}(\cdot )}{\delta {P}_{\varphi}(ZX)}}\u27e9}_{\widehat{Z}}=\beta {\u27e81+\mathrm{log}{P}_{\varphi}^{Z}(\cdot ),{\displaystyle \frac{\delta {P}_{\varphi}^{Z}(\cdot )}{\delta {P}_{\varphi}(ZX)}}\u27e9}_{\widehat{Z}}$  
$=\beta {\displaystyle \int \mathit{d}\widehat{Z}(1+\mathrm{log}{P}_{\varphi}^{Z}(\widehat{Z}))\delta (\widehat{Z}Z)P(X)}=\beta P(X)(1+\mathrm{log}{P}_{\varphi}^{Z}(Z))$  (32) 
Combining the derivative of ${L}_{1}$ and ${L}_{2}$ and setting their summation to 0, we can get that
$${P}_{\varphi}(ZX)\propto {P}_{\varphi}(Z)\mathrm{exp}(\frac{1}{\beta}J(Z;\theta ))$$  (33) 
0.A.3 Proof of Theorem 2
Theorem.
(Theorem 2 restated)For $L\mathit{}\mathrm{(}\theta \mathrm{,}\varphi \mathrm{)}\mathrm{=}{\mathrm{E}}_{X\mathrm{\sim}P\mathit{}\mathrm{(}X\mathrm{)}\mathrm{,}Z\mathrm{\sim}{P}_{\varphi}\mathit{}\mathrm{(}Z\mathrm{}X\mathrm{)}}\mathit{}\mathrm{[}J\mathit{}\mathrm{(}Z\mathrm{;}\theta \mathrm{)}\mathrm{]}\mathrm{}\beta \mathit{}I\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{;}\varphi \mathrm{)}$, given a fixed policyvalue parameter $\theta $, representation distribution ${P}_{\varphi}\mathit{}\mathrm{(}Z\mathrm{}X\mathrm{)}$ and state distribution $P\mathit{}\mathrm{(}X\mathrm{)}$, define a new representation distribution:${P}_{\widehat{\varphi}}\mathit{}\mathrm{(}Z\mathrm{}X\mathrm{)}\mathrm{\propto}{P}_{\varphi}\mathit{}\mathrm{(}Z\mathrm{)}\mathit{}\mathrm{exp}\mathit{}\mathrm{(}\frac{\mathrm{1}}{\beta}\mathit{}J\mathit{}\mathrm{(}Z\mathrm{;}\theta \mathrm{)}\mathrm{)}$, we have $L\mathit{}\mathrm{(}\theta \mathrm{,}\widehat{\varphi}\mathrm{)}\mathrm{\ge}L\mathit{}\mathrm{(}\theta \mathrm{,}\varphi \mathrm{)}$.
Proof.
Define $I\mathit{}\mathrm{(}X\mathrm{)}$ as:
$$I(X)={\int}_{Z}\mathit{d}Z{P}_{\widehat{\varphi}}(ZX)={\int}_{Z}\mathit{d}Z{P}_{\varphi}(Z)\mathrm{exp}(\frac{1}{\beta}J(Z;\theta ))$$  (34) 
$L(\theta ,\widehat{\varphi})$  $={\mathbb{E}}_{X}\{{\mathbb{E}}_{Z\sim {P}_{\widehat{\varphi}}(ZX)}[J(Z;\theta )]\beta {\mathbb{E}}_{Z\sim {P}_{\widehat{\varphi}}(ZX)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(Z)\mathrm{exp}(\frac{1}{\beta}J(Z;\theta ))}{I(X){P}_{\widehat{\varphi}}(Z)}}]\}$  
$={\mathbb{E}}_{X}\{\beta {\mathbb{E}}_{Z\sim {P}_{\widehat{\varphi}}(ZX)}[\mathrm{log}I(X)]\beta {\mathbb{E}}_{Z\sim {P}_{\widehat{\varphi}}(ZX)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(Z)}{{P}_{\widehat{\varphi}}(Z)}}]\}$  
$=\beta {\mathbb{E}}_{X}[\mathrm{log}I(X)]\beta {\mathbb{E}}_{X,Z\sim {P}_{\widehat{\varphi}}(X,Z)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(Z)}{{P}_{\widehat{\varphi}}(Z)}}]$  
$=\beta {\mathbb{E}}_{X}[\mathrm{log}I(X)]\beta {\mathbb{E}}_{Z\sim {P}_{\widehat{\varphi}}(Z)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(Z)}{{P}_{\widehat{\varphi}}(Z)}}]$  
$=\beta {\mathbb{E}}_{X\sim P(X)}[\mathrm{log}I(X)]+\beta {\mathbb{D}}_{KL}({P}_{\widehat{\varphi}}(Z){P}_{\varphi}(Z))$  (35)  
$L(\theta ,\varphi )$  $={\mathbb{E}}_{X}\{\beta {\mathbb{E}}_{Z\sim {P}_{\varphi}(ZX)}[\mathrm{log}\mathrm{exp}({\displaystyle \frac{1}{\beta}}J(Z;\theta ))]+\beta {\mathbb{E}}_{Z\sim {P}_{\varphi}(ZX)}\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(Z)}{{P}_{\varphi}(ZX)}}\}$  
$={\mathbb{E}}_{X}\{\beta {\mathbb{E}}_{Z\sim {P}_{\varphi}(ZX)}[\mathrm{log}{\displaystyle \frac{{P}_{\varphi}(Z)\mathrm{exp}(\frac{1}{\beta}J(Z;\theta ))}{{P}_{\varphi}(ZX)I(X)}}]+\beta \mathrm{log}I(X)\}$  
$=\beta {\mathbb{E}}_{X}[\mathrm{log}I(X)]+\beta {\mathbb{E}}_{X\sim P(X),Z\sim {P}_{\varphi}(ZX)}[\mathrm{log}{\displaystyle \frac{{P}_{\widehat{\varphi}}(ZX)}{{P}_{\varphi}(ZX)}}]$  
$=\beta {\mathbb{E}}_{X\sim P(X)}[\mathrm{log}I(X)]\beta {\mathbb{E}}_{X\sim P(X)}[{\mathbb{D}}_{KL}({P}_{\varphi}(ZX){P}_{\widehat{\varphi}}(ZX))]$  (36)  
$L(\theta ,\widehat{\varphi})$  $L(\theta ,\varphi )=\beta {\mathbb{D}}_{KL}({P}_{\widehat{\varphi}}(Z){P}_{\varphi}(Z))+\beta {\mathbb{E}}_{X\sim P(X)}[{\mathbb{D}}_{KL}({P}_{\varphi}(ZX){P}_{\widehat{\varphi}}(ZX))]$  (37) 
According to the positivity of the KLdivergence, we have $L\mathit{}\mathrm{(}\theta \mathrm{,}\widehat{\varphi}\mathrm{)}\mathrm{\ge}L\mathit{}\mathrm{(}\theta \mathrm{,}\varphi \mathrm{)}$.
0.A.4 Algorithm
{algorithm}\[email protected]@algorithmic \STATE$\theta ,\varphi \leftarrow $ initialize network parameters \STATE$\beta ,\zeta \leftarrow $ initialize hyperparameters in (17) \STATE$\u03f5\leftarrow $ learning rate \STATE$M\leftarrow $ number of samples from ${P}_{\varphi}(\cdot X)$ \REPEAT\STATEDraw a batch of data ${\{{X}_{t},{a}_{t},{R}_{t},{X}_{t+1}\}}_{t=1}^{n}$ from environment \FOReach ${X}_{t}\in {\{{X}_{t}\}}_{t=1}^{n}$ \STATEDraw M samples ${\{{Z}_{i}^{t}\}}_{i=1}^{M}$ from ${P}_{\varphi}(\cdot {X}_{t})$ \ENDFOR\STATEGet the batch of data $\mathcal{D}={\{{X}_{t},{\{{Z}_{i}^{t}\}}_{i=1}^{M},{a}_{t},{R}_{t},{X}_{t+1}\}}_{t=1}^{n}$ \STATECompute the representation gradients ${\nabla}_{\varphi}L(\theta ,\varphi )$ in $\mathcal{D}$ according to (18) \STATECompute the RL gradients ${\nabla}_{\theta}L(\theta ,\varphi )$ in $\mathcal{D}$ according to (19) \STATEUpdate $\varphi $: $\varphi \leftarrow \varphi +\u03f5{\nabla}_{\varphi}L(\theta ,\varphi )$ \STATEUpdate $\theta $: $\theta \leftarrow \theta +\u03f5{\nabla}_{\theta}L(\theta ,\varphi )$ \UNTILConvergence
0.A.5 Integrate MINE to our framework
MINE can also be applied to the problem of minimizing the MI between Z and X where Z is generated by a neural network ${P}_{\varphi}(\cdot X)$:
$${I}^{\star}(X,Z)=\underset{\varphi}{\mathrm{min}}\underset{\eta}{\mathrm{max}}{\mathbb{E}}_{{P}_{\varphi}(X,Z)}[T(X,Z;\eta )]\mathrm{log}({\mathbb{E}}_{P(X)\otimes {P}_{\varphi}(Z)}[{\mathrm{exp}}^{T(X,Z;\eta )}])$$  (38) 
Apparently ${I}^{\star}(X,Z)$ is $0$ without any constraints, yet if we use MINE to optimize our IB framework in (6), ${I}^{\star}(X,Z)$ might not be $0$. With the help of MINE, like what people did in supervised learning, the objective function in (6) can be written as:
$L(\theta ,\varphi ,\eta )=$  $\underset{\eta}{\mathrm{min}}\{\underset{\text{RL loss term}}{\underset{\u23df}{{\mathbb{E}}_{{P}_{\varphi}(X,Z)}[J(Z;\theta )]}}\beta {\mathbb{E}}_{{P}_{\varphi}(X,Z)}[T(X,Z;\eta )]$  
$+\beta \mathrm{log}({\mathbb{E}}_{P(X)\otimes {P}_{\varphi}(Z)}[{\mathrm{exp}}^{T(X,Z;\eta )}])\}$  (39) 
The key steps to optimize $L(\theta ,\varphi ,\eta )$ is to update $\theta ,\varphi ,\eta $ iteratively as follows:
${\eta}_{t+1}\leftarrow \mathrm{arg}\underset{{\eta}_{t}}{\mathrm{min}}\underset{\text{mutual information term}}{\underset{\u23df}{{\mathbb{E}}_{{P}_{{\varphi}_{t}}(X,Z)}[T(X,Z;{\eta}_{t})]+\mathrm{log}({\mathbb{E}}_{P(X)\otimes {P}_{{\varphi}_{t}}(Z)}[{\mathrm{exp}}^{T(X,Z;{\eta}_{t})}])}}$  (40)  
${\theta}_{t+1},{\varphi}_{t+1}\leftarrow \mathrm{arg}\underset{{\theta}_{t},{\varphi}_{t}}{\mathrm{max}}L({\theta}_{t},{\varphi}_{t},{\eta}_{t+1})$  (41) 
Yet in our experiment, these updates of $\theta ,\varphi ,\eta $ did not work at all. We suspect it might be caused by the unstability of $T(X,Z;\eta )$. We have also found that the algorithm always tends to push parameter $\varphi $ to optimize the mutual information term to 0 regardless of the essential RL loss $J(Z;\theta )$: In the early stage of training process, policy $\pi $ is so bad that the agent is unable to get high reward, which means that the RL loss is extremely hard to optimize. Yet the mutual information term is relatively easier to optimize. Thus the consequence is that these updates tend to push parameter $\varphi $ to optimize the mutual information term to 0 in the early training stage.
However, MINE’s connection with our framework makes it possible to optimize $T(X,Z;\eta )$ in a more deterministic way and put the RL loss into mutual information term.
According to equation(23), $T(X,Z;\eta )=\frac{1}{\beta}J(Z;\theta )+{C}^{{}^{\prime}}$, this implies that we could introduce another function $\widehat{T}$ in place of $T(X,Z;\eta )$ for the sake of variance reduction:
$$T(X,Z;\eta )\leftarrow \frac{1}{\beta}J(Z;\theta )+\widehat{T}(X,Z;\eta )$$  (42) 
in the sense that parameter $\eta $ only needs to approximate the constant ${C}^{{}^{\prime}}(T)$, the optimization steps turn out to be:
${\theta}_{t+1},{\varphi}_{t+1},{\eta}_{t+1}\leftarrow $  $\mathrm{arg}\underset{{\theta}_{t},{\varphi}_{t}}{\mathrm{max}}\mathrm{arg}\underset{{\eta}_{t}}{\mathrm{min}}\{\beta {\mathbb{E}}_{{P}_{{\varphi}_{t}}(X,Z)}[\widehat{T}(X,Z;{\eta}_{t})]+$  
$\beta \mathrm{log}({\mathbb{E}}_{P(X)\otimes {P}_{{\varphi}_{t}}(Z)}[{\mathrm{exp}}^{\widehat{T}(X,Z;{\eta}_{t})+\frac{1}{\beta}J(Z;{\theta}_{t})}])\}$  (43) 
This algorithm has the following three potential advantages in summary:

1.
We’re able to optimize $T(X,Z;\eta )$ in a more deterministic way instead of the form like equation(40), which is hard to converge in reinforcement learning.

2.
We prevent excessive optimization in mutual information term by putting RL loss $J(Z;\theta )$ into this term.

3.
We’re able to directly optimize the IB framework without constructing a variational lower bound.
Here we just analyze and derive this algorithm theoretically, implementation and experiments will be left as the future work.
Notice that we say if we directly optimize our IB framework with MINE, one of the problems is that the function $T$ might be unstable in RL, yet in section(4.4) and experiments(0.A.6), we directly use MINE to visualize the MI. This is because when we optimize our framework, we need to start to update $T$ every training step. While when we visualize the MI, we start to update $T$ every 2000 training steps. Considering the computation efficiency, every time we start to update $T$ when we use MINE to optimize our framework, we must update $T$ in one step or a few steps. While when visualizing the MI, we update $T$ in 256 steps. Besides, we reset parameter $\eta $ every time we begin to update $T$ when we visualize the MI, clearly we can’t do this when optimizing our framework since it’s a minmax optimization problem.
0.A.6 Study the informationbottleneck perspective in RL
Now we introduce the experimental settings of MI visualization. And we show that the agent in RL usually tends to follow the information EC process.
We compare the MI($I(X,Z)$) between A2C and A2C with our framework. Every 2000 update steps($2560$ frames each step), we reinitialize the parameter $\eta $, then sample a batch of inputs and their relevant representations ${\{{X}_{i},{Z}_{i}=\varphi ({X}_{i},\u03f5)\}}_{i=1}^{n},n=64,$ and compute the MI with MINE. The learning rate of updating $\eta $ is same as openaibaselines’ A2C: $0.0007$, training steps is $256$ and the network architecture can be found in our code file ”policy.py”.
Figure(3) is the MI visualization in game Qbert. Note that there is a certain degree of fluctuations in the curve. This is because that unlike supervised learning, the distribution of datasets and learning signals ${R}^{\pi}(X)$ keep changing in reinforcement learning: ${R}^{\pi}(X)$ changes with policy $\pi $ and when $\pi $ gets better, the agent might find new states, in this case, $I(X,Z)$ might increase again because the agent needs to encode information from new states in order to learn a better policy. Yet finally, the MI always tends to decrease. Thus we can say that the agent in RL usually tends to follow the information EC process.
We argue that it’s unnecessary to compute $I(Z,Y)$ like [22]: According to (3), if the training loss continually decreases in supervised learning(Reward continually increases as showed in figure(2)(a) in reinforcement learning), $I(Z,Y)$ must increase gradually.
We also add some additional experimental results of MI visualization in the appendix(0.A.7).
0.A.7 Additional experimental results of performance and MI visualization
This section we add some additional experimental results about our framework.
Notice that in game MsPacman, performance of A2C with our framework is worse than regular A2C. According to the MI visualization of MsPacman in figure(5)(b), we suspect that this is because A2C with our framework drops the information from inputs so excessively that it misses some information relative to the learning process. To see it accurately, in figure(5)(b), the orange curve, which denotes A2C with our framework, from step(xaxis) 80 to 100, suddenly drops plenty of information. Meanwhile, in figure(4)(b), from step(xaxis) 80 to 100, the rewards of orange curve start to decrease.
As showed in figure(6), unlike Pong, Breakout, Qbert and some other shooting games, the frame of MsPacman contains much more information related to the reward: The walls, the ghosts and the tiny beans everywhere. Thus if the agent drops information too fast, it may hurt the performance.