A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment

  • 2019-10-08 15:25:58
  • Felix Leibfried, Sergio Pascual-Diaz, Jordi Grau-Moya
  • 0

Abstract

Empowerment is an information-theoretic method that can be used tointrinsically motivate learning agents. It attempts to maximize an agent'scontrol over the environment by encouraging visiting states with a large numberof reachable next states. Empowered learning has been shown to lead to complexbehaviors, without requiring an explicit reward signal. In this paper, weinvestigate the use of empowerment in the presence of an extrinsic rewardsignal. We hypothesize that empowerment can guide reinforcement learning (RL)agents to find good early behavioral solutions by encouraging highly empoweredstates. We propose a unified Bellman optimality principle for empowered rewardmaximization. Our empowered reward maximization approach generalizes bothBellman's optimality principle as well as recent information-theoreticalextensions to it. We prove uniqueness of the empowered values and showconvergence to the optimal solution. We then apply this idea to developoff-policy actor-critic RL algorithms which we validate in high-dimensionalcontinuous robotics domains (MuJoCo). Our methods demonstrate improved initialand competitive final performance compared to model-free state-of-the-arttechniques.

 

Quick Read (beta)

A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment

Felix Leibfried,   Sergio Pascual-Díaz,   Jordi Grau-Moya
PROWLER.io
Cambridge, UK
{felix,sergio.diaz,jordi}@prowler.io
Abstract

Empowerment is an information-theoretic method that can be used to intrinsically motivate learning agents. It attempts to maximize an agent’s control over the environment by encouraging visiting states with a large number of reachable next states. Empowered learning has been shown to lead to complex behaviors, without requiring an explicit reward signal. In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. We hypothesize that empowerment can guide reinforcement learning (RL) agents to find good early behavioral solutions by encouraging highly empowered states. We propose a unified Bellman optimality principle for empowered reward maximization. Our empowered reward maximization approach generalizes both Bellman’s optimality principle as well as recent information-theoretical extensions to it. We prove uniqueness of the empowered values and show convergence to the optimal solution. We then apply this idea to develop off-policy actor-critic RL algorithms which we validate in high-dimensional continuous robotics domains (MuJoCo). Our methods demonstrate improved initial and competitive final performance compared to model-free state-of-the-art techniques.

 

A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment


  Felix Leibfried,   Sergio Pascual-Díaz,   Jordi Grau-Moya PROWLER.io Cambridge, UK {felix,sergio.diaz,jordi}@prowler.io

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

In reinforcement learning [61] (RL), agents identify policies to collect as much reward as possible in a given environment. Recently, leveraging parametric function approximators has led to tremendous success in applying RL to high-dimensional domains such as Atari games [39] or robotics [55]. In such domains, inspired by the policy gradient theorem [62, 13], actor-critic approaches [35, 40] attain state-of-the-art results by learning both a parametric policy and a value function.

Empowerment is an information-theoretic framework where agents maximize the mutual information between an action sequence and the state that is obtained after executing this action sequence from some given initial state [26, 27, 52]. It turns out that the mutual information is highest for such initial states where the number of reachable next states is largest. Policies that aim for high empowerment can lead to complex behavior, e.g. balancing a pole in the absence of any explicit reward signal [23].

Despite progress on learning empowerment values with function approximators [41, 12, 48], there has been little attempt in the combination with reward maximization, let alone in utilizing empowerment for RL in the high-dimensional domains it has become applicable just recently. We therefore propose a unified principle for reward maximization and empowerment, and demonstrate that empowered signals can boost RL in large-scale domains such as robotics. In short, our contributions are:

  • a generalized Bellman optimality principle for joint reward maximization and empowerment,

  • a proof for unique values and convergence to the optimal solution for our novel principle,

  • empowered actor-critic methods boosting RL in MuJoCo compared to model-free baselines.

2 Background

2.1 Reinforcement Learning

In the discrete RL setting, an agent, being in state 𝒔𝒮, executes an action 𝒂𝒜 according to a behavioral policy πbehave(𝒂|𝒔) that is a conditional probability distribution πbehave:𝒮×𝒜[0,1]. The environment, in response, transitions to a successor state 𝒔𝒮 according to a (probabilistic) state-transition function 𝒫(𝒔|𝒔,𝒂), where 𝒫:𝒮×𝒜×𝒮[0,1]. Furthermore, the environment generates a reward signal r=(𝒔,𝒂) according to a reward function :𝒮×𝒜. The agent’s aim is to maximize its expected future cumulative reward with respect to the behavioral policy maxπbehave𝔼πbehave,𝒫[t=0γtrt], with t being a time index and γ(0,1) a discount factor. Optimal expected future cumulative reward values for a given state 𝒔 obey then the following recursion:

V(𝒔)=max𝒂((𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)])=:max𝒂Q(𝒔,𝒂), (1)

referred to as Bellman’s optimality principle [4], where V and Q are the optimal value functions.

2.2 Empowerment

Empowerment is an information-theoretic method where an agent executes a sequence of k actions 𝒂𝒜k when in state 𝒔𝒮 according to a policy πempower(𝒂|𝒔) which is a conditional probability distribution πempower:𝒮×𝒜k[0,1]. This is slightly more general than in the RL setting where only a single action is taken upon observing a certain state. The agent’s aim is to identify an optimal policy πempower that maximizes the mutual information I[𝑨,𝑺|𝒔] between the action sequence 𝒂 and the state 𝒔 to which the environment transitions after executing 𝒂 in 𝒔, formulated as:

E(𝒔)=maxπempowerI[𝑨,𝑺|𝒔]=maxπempower𝔼πempower(𝒂|𝒔)𝒫(k)(𝒔|𝒔,𝒂)[logp(𝒂|𝒔,𝒔)πempower(𝒂|𝒔)]. (2)

Here, E(𝒔) refers to the optimal empowerment value and 𝒫(k)(𝒔|𝒔,𝒂) to the probability of transitioning to 𝒔 after executing the sequence 𝒂 in state 𝒔, where 𝒫(k):𝒮×𝒜k×𝒮[0,1]. Importantly, p(𝒂|𝒔,𝒔)=𝒫(k)(𝒔|𝒔,𝒂)πempower(𝒂|𝒔)𝒂𝒫(k)(𝒔|𝒔,𝒂)πempower(𝒂|𝒔) is the inverse dynamics model of πempower. The implicit dependency of p on the optimization argument πempower renders the problem non-trivial.

From an information-theoretic perspective, optimizing for empowerment is equivalent to maximizing the capacity [57] of an information channel 𝒫(k)(𝒔|𝒔,𝒂) with input 𝒂 and output 𝒔 w.r.t. the input distribution πempower(𝒂|𝒔), as outlined in the following [11, 10]. Define the functional If(πempower,𝒫(k),q):=𝔼πempower(𝒂|𝒔)𝒫(k)(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)πempower(𝒂|𝒔)], where q is a conditional probability q:𝒮×𝒮×𝒜k[0,1]. Then the mutual information is recovered as a special case of If with I[𝑨,𝑺|𝒔]=maxqIf(πempower,𝒫(k),q) for a given πempower. The maximum argument

q(𝒂|𝒔,𝒔)=𝒫(k)(𝒔|𝒔,𝒂)πempower(𝒂|𝒔)𝒂𝒫(k)(𝒔|𝒔,𝒂)πempower(𝒂|𝒔) (3)

is the true Bayesian posterior p(𝒂|𝒔,𝒔)—see [10] Lemma 10.8.1 for details. Similarly, maximizing If(πempower,𝒫(k),q) with respect to πempower for a given q leads to:

πempower(𝒂|𝒔)=exp(𝔼𝒫(k)(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)])𝒂exp(𝔼𝒫(k)(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)]). (4)

As explained e.g. in [10] page 335 similar to [45]. The above yields the subsequent proposition.

Proposition 1

Maximum Channel Capacity. Iterating through Equations (3) and (4) by computing q given πempower and vice versa in an alternating fashion converges to an optimal pair (q,πempower) that maximizes the mutual information maxπempowerI[𝐀,𝐒|𝐬]=If(πempower,P(k),q). The convergence rate is O(1/N), where N is the number of iterations, for any initial πempowerini with support in Ak𝐬—see [10] Chapter 10.8 and [11, 16]. This is known as Blahut-Arimoto algorithm [2, 7].

Remark. Empowerment is similar to curiosity concepts of predictive information that focus on the mutual information between the current and the subsequent state [6, 47, 68, 60, 42, 53].

3 Motivation: Combining Reward Maximization with Empowerment

The Blahut-Arimoto algorithm presented in the previous section solves empowerment for low-dimensional discrete settings but does not readily scale to high-dimensional or continuous state-action spaces. While there has been progress on learning empowerment values with parametric function approximators [41], how to combine it with reward maximization or RL remains open. In principle, there are two possibilities for utilizing empowerment. The first is to directly use the policy πempower obtained in the course of learning empowerment values E(𝒔). The second is to train a behavioral policy to take an action in each state such that the expected empowerment value of the next state is highest (requiring E-values as a prerequisite). Note that the two possibilities are conceptually different. The latter seeks states with a large number of reachable next states [23]. The first, on the other hand, aims for high mutual information between actions and the subsequent state, which is not necessarily the same as seeking highly empowered states [41].

We hypothesize empowered signals to be beneficial for RL, especially in high-dimensional environments and at the beginning of the training process when the initial policy is poor. In this work, we therefore combine reward maximization with empowerment inspired by the two behavioral possibilities outlined in the previous paragraph. Hence, we focus on the cumulative RL setting rather than the non-cumulative setting that is typical for empowerment. We furthermore use one-step empowerment as a reference, i.e. k=1, because cumulative one-step empowerment learning leads to high values in such states where the number of possibly reachable next states is high, and preserves hence the original empowerment intuition without requiring a multi-step policy—see Section 4.3. The first idea is to train a policy that trades off reward maximization and learning cumulative empowerment:

maxπbehave𝔼πbehave,𝒫[t=0γt(α(𝒔t,𝒂t)+βlogp(𝒂t|𝒔t+1,𝒔t)πbehave(𝒂t|𝒔t))], (5)

where α0 and β0 are scaling factors, and p indicates the inverse dynamics model of πbehave in line with Equation (3). Note that p depends on the optimization argument πbehave, similar to ordinary empowerment, leading to a non-trivial Markov decision problem (MDP).

The second idea is to learn cumulative empowerment values a priori by solving Equation (5) with α=0 and β=1. The outcome of this is a policy πempower (and its inverse dynamics model p) that can be used to construct an intrinsic reward signal which is then added to the external reward:

maxπbehave𝔼πbehave,𝒫[t=0γt(α(𝒔t,𝒂t)+β𝔼πempower(𝒂|𝒔t)𝒫(𝒔|𝒔t,𝒂)[logp(𝒂|𝒔,𝒔t)πempower(𝒂|𝒔t)])]. (6)

Importantly, Equation (6) poses an ordinary MDP since the reward signal is merely extended by another stationary state-dependent signal.

Both proposed ideas require to solve the novel MDP as specified in Equation (5). In Section 4, we therefore prove the existence of unique values and convergence of the corresponding value iteration scheme (including a grid world example). We also show how our formulation generalizes existing formulations from the literature. In Section 5, we carry our ideas over to high-dimensional continuous state-action spaces by devising off-policy actor-critic-style algorithms inspired by the proposed MDP formulation. We evaluate our novel actor-critic-style algorithms in MuJoCo demonstrating better initial and competitive final performance compared to model-free state-of-the-art baselines.

4 Joint Reward Maximization and Empowerment Learning in MDPs

We state our main theoretical result in advance, proven in the remainder of this section (an intuition follows): the solution to the MDP from Equation (5) implies unique optimal values V obeying the Bellman recursion

V(𝒔)=maxπbehave𝔼πbehave,𝒫[t=0γt(α(𝒔t,𝒂t)+βlogp(𝒂t|𝒔t+1,𝒔t)πbehave(𝒂t|𝒔t))|𝒔0=𝒔]=maxπbehave,q𝔼πbehave(𝒂|𝒔)[α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[βlogq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)+γV(𝒔)]]=βlog𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)]), (7)

where

q(𝒂|𝒔,𝒔)=𝒫(𝒔|𝒔,𝒂)πbehave(𝒂|𝒔)𝒂𝒫(𝒔|𝒔,𝒂)πbehave(𝒂|𝒔)=p(𝒂|𝒔,𝒔) (8)

is the inverse dynamics model of the optimal behavioral policy πbehave that assumes the form:

πbehave(𝒂|𝒔)=exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)])𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)]), (9)

where the denominator is just exp((1/β)V(𝒔)). While the remainder of this section explains how Equations (7) to (9) are derived in detail, it can be insightful to understand at a high level what makes our formulation non-trivial. The difficulty is that the inverse dynamics model p=q depends on the optimal policy πbehavioral and vice versa leading to a non-standard optimal value identification problem. Proving the existence of V-values and how to compute them poses therefore our main theoretical contribution, and implies the existence of at least one (q,πbehave)-pair that satisfies the recursive relationship of Equations (8) and (9). This proof is given in Section 4.1 and leads naturally to a value iteration scheme to compute optimal values in practice. The convergence of this scheme is proven in Section 4.2 and we also demonstrate value learning in a grid world example—see Section 4.3. In Section 4.4, we elucidate how our formulation generalizes and relates to existing MDP formulations.

4.1 Existence of Unique Optimal Values

Following the second line from Equation (7), let’s define the Bellman operator B:|𝒮||𝒮| as

BV(𝒔):=maxπbehave,q𝔼πbehave(𝒂|𝒔)[α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[βlogq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)+γV(𝒔)]]. (10)
Theorem 1

Existence of Unique Optimal Values. Assuming a bounded reward function R, the optimal value vector V as given in Equation (7) exists and is a unique fixed point V=BV of the Bellman operator B from Equation (10).

Proof. The proof of Theorem 1 comprises three steps. First, we prove for a given (q,πbehave)-pair the existence of unique values V(q,πbehave) which obey the following recursion

V(q,πbehave)(𝒔)=𝔼πbehave(𝒂|𝒔)[α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[βlogq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)+γV(q,πbehave)(𝒔)]]. (11)

This result is obtained through Proposition 2 following [5, 50, 18] where we show that the value vector V(q,πbehave) is a unique fixed point of the operator Bq,πbehave:|𝒮||𝒮| given by

Bq,πbehaveV(𝒔):=𝔼πbehave(𝒂|𝒔)[α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[βlogq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)+γV(𝒔)]]. (12)

Second, we prove in Proposition 3 that solving the right hand side of Equation (10) for the pair (q,πbehave) can be achieved with a Blahut-Arimoto-style algorithm in line with [16]. Third, we complete the proof in Proposition 4 based on Proposition 2 and 3 by showing that V=maxπbehave,qV(q,πbehave), where the vector-valued max-operator is well-defined because both πbehave and q are conditioned on 𝒔. The proof completion follows again [5, 50, 18].

Proposition 2

Existence of Unique Values for a Given (q,πbehave)-Pair. Assuming a bounded reward function R, the value vector V(q,πbehave) as given in Equation (11) exists and is a unique fixed point V(q,πbehave)=Bq,πbehaveV(q,πbehave) of the Bellman operator Bq,πbehave from Equation (12).

As opposed to the Bellman operator B, the operator Bq,πbehave does not include a max-operation that incurs a non-trivial recursive relationship between optimal arguments. The proof for existence of unique values follows hence standard methodology [5, 50, 18] and is given in Appendix A.1.

Proposition 3

Blahut-Arimoto for One Value Iteration Step. Assuming that R is bounded, the maximization problem maxπbehave,q from Equation (10) in the Bellman operator B can be solved for (q,πbehave) by iterating through the following two equations in an alternating fashion:

q(m)(𝒂|𝒔,𝒔)=𝒫(𝒔|𝒔,𝒂)πbehave(m)(𝒂|𝒔)𝒂𝒫(𝒔|𝒔,𝒂)πbehave(m)(𝒂|𝒔), (13)
πbehave(m+1)(𝒂|𝒔)=exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(m)(𝒂|𝒔,𝒔)+γβV(𝒔)])𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(m)(𝒂|𝒔,𝒔)+γβV(𝒔)]), (14)

where m is the iteration index. The convergence rate is O(1/M) for arbitrary initial πbehave(0) with support in A𝐬. M is the total number of iterations. The complexity for a single 𝐬 is O(M|S||A|).

Proof Outline. The problem in Proposition 3 is mathematically similar to the maximum channel capacity problem [57] from Proposition 1 and proving convergence follows similar steps that we outline here—details can be found in Appendix A.2. First, we prove that optimizing the right-hand side of Equation (10) w.r.t. q for a given πbehave results in Equation (13) according to [10] Lemma 10.8.1. Second, we prove that optimizing w.r.t. πbehave for a given q results in Equation (14) following standard techniques from variational calculus and Lagrange multipliers. Third, we prove convergence to a global maximum when iterating alternately through Equations (13) and (14) following [16].

Proposition 4

Completing the Proof of Theorem 1. The optimal value vector is given by V=maxπbehave,qV(q,πbehave) and is a unique fixed point V=BV of the Bellman operator B.

Completing the proof of Theorem 1 requires two ingredients: the existence of unique V(q,πbehave)-values for any (q,πbehave)-pair as proven in Proposition 2, and the fact that the optimal Bellman operator can be expressed as B=maxπbehave,qBq,πbehave where maxπbehave,q is the max-operator from Proposition 3. The proof follows then standard methodology [5, 50, 18], see Appendix A.3.

4.2 Value Iteration and Convergence to Optimal Values

In the previous section, we have proven the existence of unique optimal values V that are a fixed point of the Bellman operator B. This section devises a value iteration scheme based on the operator B and proves its convergence. We commence by a corollary to express B more concisely.

Corollary 1

Optimal Bellman Operator. The operator B from Equation (10) can be written as

BV(𝒔)=βlog𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logqconverged(𝒂|𝒔,𝒔)+γβV(𝒔)]), (15)

where qconverged(𝐚|𝐬,𝐬) is the result of the converged Blahut-Arimoto scheme from Proposition 3.

This result is obtained by plugging the converged solution πbehaveconverged from Equation (14) into Equation (10) and leads naturally to a two-level value iteration algorithm that proceeds as follows: the outer loop updates the values V by applying Equation (15) repeatedly; the inner loop applies the Blahut-Arimoto algorithm from Proposition 3 to identify qconverged required for the outer value update.

Theorem 2

Convergence to Optimal Values. Assuming bounded R and let ϵR be a positive number such that ϵ<η1-γ where η=αmax𝐬,𝐚|R(𝐬,𝐚)|+βlog|A|. If the value iteration scheme with initial values of V(𝐬)=0𝐬 is run for ilogγϵ(1-γ)η iterations, then V-B(i)Vϵ, where the notation B(i)V means to apply B to V i-times consecutively.

Proof. Via a sequence of inequalities, one can show that the following holds true: V-B(i)VγV-B(i-1)VγiV-Vγi11-γη—see Appendix A.4 for a more detailed derivation. This implies that if ϵγi11-γη then ilogγϵ(1-γ)η presupposing ϵ<η1-γ.

Conclusion. Together, Theorems 1 and 2 prove that our proposed value iteration scheme convergences to optimal values V in combination with a corresponding optimal pair (q,πbehave) as described at the beginning of this section in the third line of Equation (7) and in Equations (8) and (9) respectively. The overal complexity is O(iM|S|2|A|) where i and M refer to outer and inner iterations.

Remark. Our value iteration is required for both objectives from Section 3 to combine reward maximization with empowerment. Equation (5) motivated our scheme in the first place, whereas Equation (6) requires cumulative empowerment values without reward maximization (α=0, β=1).

4.3 Practical Verification in a Grid World Example

In order to practically verify our value iteration scheme from the previous section, we conduct experiments on a grid world example. The outcome is shown in Figure 1 demonstrating how different configurations for α and β, that steer cumulative reward maximization versus empowerment learning, affect optimal values V. Importantly, the experiments show that our proposal to learn cumulative one-step empowerment values recovers the original intuition of empowerment in the sense that high values are assigned to states where many other states can be reached and low values to states where the number of reachable next states is low, but without the necessity to maintain a multi-step policy.

Figure 1: Value Iteration for a Grid World Example. The agent aims to arrive at the goal ’G’ in the lower left—detailed information regarding the setup can be found in Appendix C.1. The plots show optimal values for different α and β: α increases from left to right while β decreases. The leftmost values show raw cumulative empowerment learning (α=0.0, β=1.0). High values are assigned to states where many other states can be reached, i.e. the upper right; and low values to states where the number of reachable next states is low, i.e. close to corners and dead ends. The rightmost values recover ordinary cumulative reward maximization (α=1.0, β=0.0) assigning high values to states close to the goal and low values to states far away from the goal.

4.4 Generalization of and Relation to Existing MDP formulations

Our Bellman operator B from Equation (10) relates to prior work as follows (see also Appendix A.5).

  • Ordinary value iteration [51] is recovered as a special case for α=1 and β=0.

  • Cumulative one-step empowerment is recovered as a special case for α=0 and β=1, with non-cumulative one-step empowerment [29] as a further special case of the latter (γ0).

  • When setting q(𝒂|𝒔,𝒔)=q(𝒂|𝒔), using a distribution that is not conditioned on 𝒔 and omitting maximizing w.r.t. q, one recovers as a special case the soft Bellman operator presented e.g. in [50]. Note that this soft Bellman operator also occurred in numerous other work on MDP formulations and RL [3, 14, 44, 54, 32].

  • As a special case of the previous, when q(𝒂|𝒔,𝒔)=𝒰(𝒜) is the uniform distribution in action space, one recovers cumulative entropy regularization [69, 43, 33] that inspired algorithms such as soft Q-learning [20] and soft actor-critic [21, 22].

  • When dropping the conditioning on 𝒔 and 𝒔 by setting q(𝒂|𝒔,𝒔)=q(𝒂) but without omitting maximization w.r.t. q, one recovers a formulation similar to [64] based on mutual-information regularization [58, 59, 17, 31] that spurred RL algorithms such as [30, 19].

  • When replacing q(𝒂|𝒔,𝒔) with q(𝒂|𝒔,𝒂), where 𝒔 and 𝒂 refers to the state-action pair of the previous time step, one recovers a formulation similar to [63] based on the information-theoretic principle of directed information [37, 28, 38].

5 Scaling to High-Dimensional Environments

In the previous section, we presented a novel Bellman operator in combination with a value iteration scheme to combine reward maximization and empowerment. In this section, by leveraging parametric function approximators, we validate our ideas in high-dimensional state-action spaces and when there is no prior knowledge of the state-transition function. In Section 5.1, we devise novel actor-critic algorithms for RL based on our MDP formulation since they are naturally capable of handling both continuous state and action spaces. In Section 5.2, we practically confirm that empowerment can boost RL in the high-dimensional robotics simulator domain of MuJoCo using deep neural networks.

5.1 Empowered Off-Policy Actor-Critic Methods with Parametric Function Approximators

Contemporary off-policy actor-critic approaches for RL [35, 1, 15] follow the policy gradient theorem [62, 13] and learn two parametric function approximators: one for the behavioral policy πϕ(𝒂|𝒔) with parameters ϕ, and one for the state-action value function Qθ(𝒔,𝒂) of the parametric policy πϕ with parameters θ. The policy learning objective usually assumes the form: maxϕ𝔼𝒔𝒟[𝔼πϕ(𝒂|𝒔)[Qθ(𝒔,𝒂)]], where 𝒟 refers to a replay buffer [36] that stores collected state transitions from the environment. Following [21], Q-values are learned most efficiently by introducing another function approximator Vψ for state values of πϕ with parameters ψ using the objective:

minθ𝔼𝒔,𝒂,r,𝒔𝒟[(Qθ(𝒔,𝒂)-(αr+γVψ(𝒔)))2], (16)

where (𝒔,𝒂,r,𝒔) refers to an environment interaction sampled from the replay buffer (r stands for the observed reward signal). We multiply r by the scaling factor α from our formulation because Equation (16) can be directly used for the parametric methods we propose. Learning policy parameters ϕ and value parameters ψ requires however novel objectives with two additional approximators: one for the inverse dynamics model pχ(𝒂|𝒔,𝒔) of πϕ, and one for the transition function 𝒫ξ(𝒔|𝒔,𝒂) (with parameters χ and ξ respectively). While the necessity for pχ is clear, e.g. from inspecting Equation (5), the necessity for 𝒫ξ will fall into place shortly as we move forward.

In order to preserve a clear view, let’s define the quantity f(𝒔,𝒂):=𝔼𝒫ξ(𝒔|𝒔,𝒂)[logpχ(𝒂|𝒔,𝒔)]-logπϕ(𝒂|𝒔), which is short-hand notation for the empowerment-induced addition to the reward signal—compare to Equation (5). We then commence with the objective for value function learning:

minψ𝔼𝒔𝒟[(Vψ(𝒔)-𝔼πϕ(𝒂|𝒔)[Qθ(𝒔,𝒂)+βf(𝒔,𝒂)])2], (17)

which is similar to the standard value objective but with the added term βf(𝒔,𝒂) as a result of joint cumulative empowerment learning. At this point, the necessity for a transition model 𝒫ξ becomes apparent. In the above equation, new actions 𝒂 need to be sampled from the policy πϕ for a given 𝒔. However, the inverse dynamics model (inside f) depends on the subsequent state 𝒔 as well, requiring therefore a prediction for the next state. Note also that (𝒔,𝒂,r,𝒔)-tuples from the replay buffer as in Equation (16) can’t be used here, because the expectation over 𝒂 is w.r.t. to the current policy whereas tuples from the replay buffer come from a mixture of policies at an earlier stage of training.

Extending the ordinary actor-critic policy objective with the empowerment-induced term f yields:

maxϕ𝔼𝒔𝒟[𝔼πϕ(𝒂|𝒔)[Qθ(𝒔,𝒂)+βf(𝒔,𝒂)]]. (18)

The remaining parameters to be optimized are χ and ξ from the inverse dynamics model pχ and the transition model 𝒫ξ. Both problems are supervised learning problems that can be addressed by log-likelihood maximization using samples from the replay buffer, leading to maxχ𝔼𝒔𝒟[𝔼πϕ(𝒂|𝒔)𝒫ξ(𝒔|𝒔,𝒂)[logpχ(𝒂|𝒔,𝒔)]] and maxξ𝔼𝒔,𝒂,𝒔𝒟[log𝒫ξ(𝒔|𝒔,𝒂)].

Coming back to our motivation from Section 3, we propose two novel empowerment-inspired actor-critic approaches based on the optimization objectives specified in this section. The first combines cumulative reward maximization and empowerment learning following Equation (5) which we refer to as empowered actor-critic. The second learns cumulative empowerment values to construct intrinsic rewards following Equation (6) which we refer to as actor-critic with intrinsic empowerment.

Empowered Actor-Critic (EAC). In line with standard off-policy actor-critic methods [35, 15, 21], EAC interacts with the environment iteratively storing transition tuples (𝒔,𝒂,r,𝒔) in a replay buffer. After each interaction, a training batch {(𝒔,𝒂,r,𝒔)(b)}b=1B𝒟 of size B is sampled from the buffer to perform a single gradient update on the objectives from Equation (16) to (18) as well as the log likelihood objectives for the inverse dynamics and transition model—see Appendix B for pseudocode.

Actor-Critic with Intrinsic Empowerment (ACIE). By setting α=0 and β=1, EAC can train an agent merely focusing on cumulative empowerment learning. Since EAC is off-policy, it can learn with samples obtained from executing any policy in the real environment, e.g. the actor of any other reward-maximizing actor-critic algorithm. We can then extend external rewards rt at time t of this actor-critic algorithm with intrinsic rewards 𝔼πϕ(𝒂|𝒔t)𝒫ξ(𝒔|𝒔t,𝒂)[logpχ(𝒂|𝒔,𝒔t)πϕ(𝒂|𝒔t)] according to Equation (6), where (ϕ,ξ,χ) are the result of concurrent raw empowerment learning with EAC. This idea is similar to the preliminary work of [29] using non-cumulative empowerment as intrinsic motivation for deep value-based RL with discrete actions in the Atari game Montezuma’s Revenge.

5.2 Experiments with Deep Function Approximators in MuJoCo

We validate EAC and ACIE in the robotics simulator MuJoCo [65, 8] with deep neural nets under the same setup for each experiment following [66, 25, 49, 24, 55, 35, 67, 56, 1, 9, 15, 21]—see Appendix C.2 for details. While EAC is a standalone algorithm, ACIE can be combined with any RL algorithm (we use the model-free state of the art SAC [21]). We compare against DDPG [35] and PPO [56] from RLlib [34] as well as SAC on the MuJoCo v2-environments (ten seeds per run [46]).

The results in Figure 2 confirm that both EAC and ACIE can attain better initial performance compared to model-free baselines. While this holds true for both approaches on the pendulum benchmarks (balancing and swing up), our empowered methods can also boost RL in demanding environments like Hopper, Ant and Humanoid (the latter two being amongst the most difficult MuJoCo tasks). EAC significantly improves initial learning in Ant, whereas ACIE boosts SAC in Hopper and Humanoid. While EAC outperforms PPO and DDPG in almost all tasks, it is not consistently better then SAC. Similarly, the added intrinsic reward from ACIE to SAC does not always help. This is not unexpected as it cannot be in general ruled out that reward functions assign high (low) rewards to lowly (highly) empowered states, in which case the two learning signals may become partially conflicting.

Figure 2: MuJoCo Experiments. The plots show maximum episodic rewards (averaged over the last 100 episodes) achieved so far [9] versus steps—non-maximum episodic reward plots can be found in Figure 3. EAC and ACIE are compared to DDPG, PPO and SAC (DDPG did not work in Ant, see [21] and Appendix C.2 for an explanation). Shaded areas refer to the standard error. Both EAC and ACIE improve initial learning over baselines in the three pendulum tasks (upper row). In demanding problems like Hopper, Ant and Humanoid, our methods can boost RL. In terms of final performance, EAC is competitive with the baselines: it consistently outperforms DDPG and PPO on all tasks except Hopper, but is not always better than SAC. Similarly, the ACIE-signal does not always help SAC. This is not unexpected as extrinsic and empowered rewards may partially conflict.

For the sake of completeness, we report Figure 3 which is similar to Figure 2 but shows episodic rewards and not maximum episodic rewards obtained so far [9]. Also, limits of y-axes are preserved for the pendulum tasks. Note that our SAC baseline is comparable with the SAC from [22] on Hopper-v2, Walker2d-v2, Ant-v2 and Humanoid-v2 after 5105 steps (the SAC from [21] uses the earlier v1-versions of Mujoco and is hence not an optimal reference). However, there is a discrepancy on HalfCheetah-v2. This was earlier noted by others who tried to reproduce SAC results in HalfCheetah-v2 but failed to obtain episodic rewards as high as in [21, 22], leading to a GitHub issue https://github.com/rail-berkeley/softlearning/issues/75. The final conclusion of this issue was that differences in performance are caused by different seed settings and are therefore of statistical nature (comparing all algorithms under the same seed settings is hence valid).

Figure 3: Raw Results of MuJoCo Experiments. The plots are similar to the plots from Figure 2, but report episodic rewards (averaged over the last 100 episodes) versus steps—not maximum episodic rewards seen so far as in [9]. For the pendulum tasks, the limits of the y-axes are preserved.

6 Conclusion

This paper provides a theoretical contribution via a unified formulation for reward maximization and empowerment that generalizes Bellman’s optimality principle and recent information-theoretic extensions to it. We proved the existence of and convergence to unique optimal values, and practically validated our ideas by devising novel parametric actor-critic algorithms inspired by our formulation. These were evaluated on the high-dimensional MuJoCo benchmark demonstrating that empowerment can boost RL in challenging robotics tasks (e.g. Ant and Humanoid).

The most promising line of future research is to investigate scheduling schemes that dynamically trade off rewards vs. empowerment with the prospect of obtaining better asymptotic performance. Empowerment could also be particularly useful in a multi-task setting where task transfer could benefit from initially empowered agents.

Acknowledgments

We thank Haitham Bou-Ammar for pointing us in the direction of empowerment.

References

  • [1] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In Proceedings of the International Conference on Learning Representations, 2018.
  • [2] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20, 1972.
  • [3] M. G. Azar, V. Gomez, and H. J. Kappen. Dynamic policy programming with function approximation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2011.
  • [4] R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.
  • [5] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Springer, 1996.
  • [6] W. Bialek, I. Nemenman, and N. Tishby. Predictability, complexity, and learning. Neural Computation, 13(11):2409–2463, 2001.
  • [7] R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972.
  • [8] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI gym. arXiv, 2016.
  • [9] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018.
  • [10] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley & Sons, 2006.
  • [11] I. Csiszar and G. Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Suppl.1:205–237, 1984.
  • [12] I. M. de Abril and R. Kanai. A unified strategy for implementing curiosity and empowerment driven reinforcement learning. arXiv, 2018.
  • [13] T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. In Proceedings of the International Conference on Machine Learning, 2012.
  • [14] R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2016.
  • [15] S. Fujimoto, H. van Hoof, and D. Meger. Adressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, 2018.
  • [16] R. G. Gallager. The Arimoto-Blahut algorithm for finding channel capacity. Technical report, Massachusetts Institute of Technology, USA, 1994.
  • [17] T. Genewein, F. Leibfried, J. Grau-Moya, and D. A. Braun. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI, 2(27), 2015.
  • [18] J. Grau-Moya, F. Leibfried, T. Genewein, and D. A. Braun. Planning with information-processing constraints and model uncertainty in Markov decision processes. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.
  • [19] J. Grau-Moya, F. Leibfried, and P. Vrancx. Soft Q-learning with mutual-information regularization. In Proceedings of the International Conference on Learning Representations, 2019.
  • [20] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. Proceedings of the International Conference on Machine Learning, 2017.
  • [21] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, 2018.
  • [22] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actor-critic algorithms and applications. arXiv, 2019.
  • [23] T. Jung, D. Polani, and P. Stone. Empowerment for continuous agent-environment systems. Adaptive Behavior, 19(1):16–39, 2011.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
  • [25] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, 2014.
  • [26] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Empowerment: A universal agent-centric measure of control. In IEEE Congress on Evolutionary Computation, 2005.
  • [27] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Keep your options open: An information-based driving principle for sensorimotor systems. PloS ONE, 3(12):p.e4018, 2008.
  • [28] G. Kramer. Directed information for channels with feedback. PhD thesis, University of Manitoba, Canada, 1998.
  • [29] N. M. Kumar. Empowerment-driven exploration using mutual information estimation. In NIPS Workshop, 2018.
  • [30] F. Leibfried and D. A. Braun. A reward-maximizing spiking neuron as a bounded rational decision maker. Neural Computation, 27(8):1686–1720, 2015.
  • [31] F. Leibfried and D. A. Braun. Bounded rational decision-making in feedforward neural networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2016.
  • [32] F. Leibfried, J. Grau-Moya, and H. Bou-Ammar. An information-theoretic optimality principle for deep reinforcement learning. In NIPS Workshop, 2018.
  • [33] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv, 2018.
  • [34] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica. RLlib: Abstractions for distributed reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.
  • [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2016.
  • [36] L.-J. Lin. Reinforcement learning for robots using neural networks. PhD thesis, Carnegie Mellon University, USA, 1993.
  • [37] H. Marko. The bidirectional communication theory–a generalization of information theory. IEEE Transactions on Communications, 21(12):1345–1351, 1973.
  • [38] J. L. Massey and P. C. Massey. Conversion of mutual and directed information. In Proceedings of the International Symposium on Information Theory, 2005.
  • [39] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [40] V. Mnih, A. Puigdomenech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
  • [41] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, 2015.
  • [42] G. Montufar, K. Ghazi-Zahedi, and N. Ay. Information theoretically aided reinforcement learning for embodied agents. arXiv, 2016.
  • [43] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. Advances in Neural Information Processing Systems, 2017.
  • [44] G. Neu, V. Gomez, and A. Jonsson. A unified view of entropy-regularized Markov decision processes. arXiv, 2017.
  • [45] P. A. Ortega and D. A. Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A, 469(2153), 2013.
  • [46] J. Pineau. Reproducible, reusable, and robust reinforcement learning. NIPS Invited Talk, 2018.
  • [47] M. Prokopenko, V. Gerasimov, and I. Tanev. Evolving spatiotemporal coordination in a modular robotic system. In Proceedings of the International Conference on the Simulation of Adaptive Behavior, 2006.
  • [48] A. H. Qureshi, B. Boots, and M. C. Yip. Adversarial imitation via variational inverse reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2019.
  • [49] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, 2014.
  • [50] J. Rubin, O. Shamir, and N. Tishby. Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers, chapter 3. Springer, 2012.
  • [51] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson Education Limited, 2016.
  • [52] C. Salge, C. Glackin, and D. Polani. Empowerment–an introduction. In Guided Self-Organization: Inception, chapter 4. Springer, 2014.
  • [53] J. Schossau, C. Adami, and A. Hintze. Information-theoretic neuro-correlates boost evolution of cognitive systems. Entropy, 18(1):6, 2016.
  • [54] J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-learning. arXiv, 2017.
  • [55] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, 2015.
  • [56] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. In arXiv, 2017.
  • [57] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423,623–656, 1948.
  • [58] C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. Institute of Radio Engineers, International Convention Record, 7:142–163, 1959.
  • [59] C. A. Sims. Implications of rational inattention. Journal of Monetary Economics, 50(3):665–690, 2003.
  • [60] S. Still and D. Precup. An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012.
  • [61] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • [62] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000.
  • [63] S. Tiomkin and N. Tishby. A unified Bellman equation for causal information and value in Markov decision processes. In arXiv, 2018.
  • [64] N. Tishby and D. Polani. Information theory of decisions and actions. In Perception-Action Cycle, chapter 19. Springer, 2011.
  • [65] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.
  • [66] H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems, 2010.
  • [67] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
  • [68] K. Zahedi, N. Ay, and R. Der. Higher coordination with less control-A result of information maximization in the sensorimotor loop. Adaptive Behavior, 18(3-4):338–355, 2010.
  • [69] B. D. Ziebart. Modeling purposeful adaptive behavior wih the principle of maximum causal entropy. PhD thesis, Carnegie Mellon University, USA, 2010.

Appendix A Theoretical Analysis

This section provides more details regarding the theoretical analysis of the main paper to prove the existence of unique optimal values as well as convergence of the value iteration scheme.

A.1 Proof of Proposition 2 from the Main Paper

Proof. Following [5, 50, 18], let’s start by defining Pπbehave:𝒮×𝒮[0,1] and gq,πbehave:𝒮:

Pπbehave(𝒔,𝒔):=𝔼πbehave(𝒂|𝒔)[𝒫(𝒔|𝒔,𝒂)],
gq,πbehave(𝒔):=𝔼πbehave(𝒂|𝒔)[α(𝒔,𝒂)+β𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)]].

We can then express the Bellman operator Bq,πbehave in vectorized form yielding Bq,πbehaveV=gq,πbehave+γPπbehaveV. Defining Bq,πbehave(i) as short-hand notation for applying Bq,πbehave to a value vector V i-times consecutively (i=0 leaves V unaffected), we arrive at:

V(q,πbehave):=limiBq,πbehave(i)V=limit=0i-1γtPπbehavetgq,πbehave+γiPπbehaveiV0,

where Pπbehavet denotes the t-times multiplication of Pπbehave with itself (Pπbehave0 is the identity matrix). This means that the convergence of Bq,πbehave does not depend on the initial value vector V, therefore:

Bq,πbehaveV(q,πbehave)=gq,πbehave+γPπbehavelimit=0i-1γtPπbehavetgq,πbehave=γ0Pπbehave0gq,πbehave+limit=1iγtPπbehavetgq,πbehave=limit=0i-1γtPπbehavetgq,πbehave+γiPπbehaveigq,πbehave0=V(q,πbehave),

proving that V(q,πbehave) is a fixed point of Bq,πbehave. The uniqueness proof follows next. Assume there was another fixed point V of Bq,πbehave, then limiBq,πbehave(i)V=V(q,πbehave) because the convergence behavior of Bq,πbehave does not depend on the initial V, hence V=V(q,πbehave).

A.2 Proof of Proposition 3 from the Main Paper

Proof. Proving Proposition 3 from the main paper is similar to the maximum channel capacity problem from information theory [58, 10, 16]. The proof follows hence similar steps as the one for Proposition 1 from the background section on empowerment in the main paper, in the following accomplished via Lemma 12 and 3.

Lemma 1

Inverse Dynamics. Maximizing the right-hand side of the Bellman operator BV(𝐬)=maxπbehave,qBq,πbehaveV(𝐬) w.r.t. to q for a given πbehave yields:

argmaxqBq,πbehaveV(𝒔)=𝒫(𝒔|𝒔,𝒂)πbehave(𝒂|𝒔)𝒂𝒫(𝒔|𝒔,𝒂)πbehave(𝒂|𝒔).

Proof. It holds that argmaxqBq,πbehaveV(𝒔)=argmaxq𝔼πbehave(𝒂|𝒔)𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)] because neither nor V depends on q. It then follows that

𝔼πbehave(𝒂|𝒔)𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)]qI(𝑨,𝑺|𝒔)=𝔼πbehave(𝒂|𝒔)𝒫(𝒔|𝒔,𝒂)[logp(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)],

where p is the true Bayesian posterior—see [10] Lemma 10.8.1.

Lemma 2

Optimal Policy. Maximizing the right-hand side of the Bellman operator BV(𝐬)=maxπbehave,qBq,πbehaveV(𝐬) w.r.t. to πbehave for a given q yields:

argmaxπbehaveBq,πbehaveV(𝒔)=exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)])𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)]).

Proof. Maximizing Bq,πbehaveV(𝒔) w.r.t. πbehave subject to the constraint 𝒂πbehave(𝒂|𝒔)=1 yields the Lagrangian:

L(πbehave,λ)=Bq,πbehaveV(𝒔)-λ((𝒂πbehave(𝒂|𝒔))-1),

where λ is a Lagrange multiplier. The derivatives of the Lagrangian w.r.t. πbehave(𝒂~|𝒔), where 𝒂~ refers to a specific action, and λ are given by:

L(πbehave,λ)πbehave(𝒂~|𝒔)=α(𝒔,𝒂~)+𝔼𝒫(𝒔|𝒔,𝒂~)[βlogq(𝒂~|𝒔,𝒔)πbehave(𝒂~|𝒔)+γV(𝒔)]-β-λ,
L(πbehave,λ)λ=-((𝒂πbehave(𝒂|𝒔))-1).

Equating the first derivative with 0 and resolving w.r.t. πbehave(𝒂~|𝒔), one arrives at:

πbehave(𝒂~|𝒔)=exp(αβ(𝒔,𝒂~)+𝔼𝒫(𝒔|𝒔,𝒂~)[logq(𝒂~|𝒔,𝒔)+γβV(𝒔)]-β+λβ).

Plugging this result into the second derivative and equating with 0 yields:

exp(-β+λβ)=(𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)]))-1.

Plugging the latter back into the result for πbehave(𝒂~|𝒔) completes the proof.

Lemma 3

Blahut-Arimoto. Assuming bounded R, iterating through Equations (13) and (14) from the main paper converges to argmaxπbehave,qBq,πbehaveV(𝐬) at a rate of O(1/M) for arbitrary initial πbehave(0) having support in A𝐬, with M being the total number of iterations.

Proof. Evaluating the operator Bq,πbehaveV(𝒔) at the pair (q(m),πbehave(m+1)), we obtain:

Bq(m),πbehave(m+1)V(𝒔)=βlog𝒂exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(m)(𝒂|𝒔,𝒔)+γβV(𝒔)]).

Due to Lemma 4, we know that maxπbehave,qBq,πbehaveV(𝒔) is upper bounded:

maxπbehave,qBq,πbehaveV(𝒔)𝔼πbehave``(𝒂|𝒔)[α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[βlogq(m)(𝒂|𝒔,𝒔)+γV(𝒔)]-βlogπbehave(m)(𝒂|𝒔)]=𝔼πbehave``(𝒂|𝒔)[βlog(exp(αβ(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(m)(𝒂|𝒔,𝒔)+γβV(𝒔)]))-βlogπbehave(m)(𝒂|𝒔)],

where the notation `` indicates optimality of a single value iteration step, as opposed to the notation (q,πbehave) from the main paper that refers to optimality after the entire value iteration scheme has converged—see Lemma 4.

By using the definition of πbehave(m+1)(𝒂|𝒔) from Equation (14), the upper two equations enable us to derive the following upper bound:

maxπbehave,qBq,πbehaveV(𝒔)-Bq(m),πbehave(m+1)V(𝒔)β𝔼πbehave``(𝒂|𝒔)[logπbehave(m+1)(𝒂|𝒔)πbehave(m)(𝒂|𝒔)].

From there it follows that for M steps of the Blahut-Arimoto scheme

1Mm=0M-1(maxπbehave,qBq,πbehaveV(𝒔)-Bq(m),πbehave(m+1)V(𝒔))1Mβ𝔼πbehave``(𝒂|𝒔)[logπbehave(M)(𝒂|𝒔)πbehave(0)(𝒂|𝒔)]1Mβ𝔼πbehave``(𝒂|𝒔)[log1πbehave(0)(𝒂|𝒔)]1Mβmax𝒂[log1πbehave(0)(𝒂|𝒔)].

However, since the upper term is lower-bounded by 0 and since Bq(0),πbehave(0)V(𝒔)Bq(0),πbehave(1)V(𝒔)Bq(1),πbehave(1)V(𝒔) because of the alternating optimization procedure, this implies convergence at a rate of 𝒪(1/M).

Lemma 4

Upper Value Bound for One Value Iteration Step. Let’s introduce the following notation (q``,πbehave``):=argmaxπbehave,qBq,πbehaveV(𝐬) where the symbol `` indicates optimality of a single value iteration step, as opposed to the notation (q,πbehave) from the main paper that refers to optimality after the entire value iteration scheme has converged. Let’s define κ(m)(𝐬,𝐚):=αR(𝐬,𝐚)+EP(𝐬|𝐬,𝐚)[βlogq(m)(𝐚|𝐬,𝐬)+γV(𝐬)]. It then holds that maxπbehave,qBq,πbehaveV(𝐬)Eπbehave``(𝐚|𝐬)[κ(m)(𝐬,𝐚)-βlogπbehave(m)(𝐚|𝐬)].

Proof. Let’s first note that (q``,πbehave``) exists because Bq,πbehaveV is bounded. Bq,πbehaveV is bounded because it is a sum of three weighted terms that are bounded—see Equation (12) of the main paper:

  • 𝔼πbehave(𝒂|𝒔)[(𝒔,𝒂)] is bounded because the reward is bounded by assumption,

  • 𝔼πbehave(𝒂|𝒔)𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)πbehave(𝒂|𝒔)] is a lower bound to the mutual information I(𝑨,𝑺|𝒔) (which is bounded) according to [10] Lemma 10.8.1,

  • and V(𝒔) is bounded when the value iteration schemes (both using B and Bq,πbehave) are initialized, and remains bounded in each value iteration step because Bq,πbehaveV(𝒔) is bounded due to the previous two points and initial bounded V(𝒔).

It then holds that

maxπbehave,qBq,πbehaveV(𝒔)=Bq``,πbehave``V(𝒔)=𝔼πbehave``(𝒂|𝒔)[α(𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)]+β𝔼𝒫(𝒔|𝒔,𝒂)[log𝒫(𝒔|𝒔,𝒂)𝒂𝒫(𝒔|𝒔,𝒂)πbehave``(𝒂|𝒔)]]𝔼πbehave``(𝒂|𝒔)[α(𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)]+β𝔼𝒫(𝒔|𝒔,𝒂)[log𝒫(𝒔|𝒔,𝒂)𝒂𝒫(𝒔|𝒔,𝒂)πbehave(m)(𝒂|𝒔)]],

where the equality is obtained by plugging in q`` using Equation (13), and where the inequality leverages one more time [10] Lemma 10.8.1.

At the same time, we can plug Equation (13) from the main paper into κ(m)(𝒔,𝒂), yielding:

κ(m)(𝒔,𝒂)=α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[βlog𝒫(𝒔|𝒔,𝒂)πbehave(m)(𝒂|𝒔)𝒂𝒫(𝒔|𝒔,𝒂)πbehave(m)(𝒂|𝒔)+γV(𝒔)].

Rearranging the upper equation results in:

β𝔼𝒫(𝒔|𝒔,𝒂)[log𝒫(𝒔|𝒔,𝒂)𝒂𝒫(𝒔|𝒔,𝒂)πbehave(m)(𝒂|𝒔)]=κ(m)(𝒔,𝒂)-βlogπbehave(m)(𝒂|𝒔)-α(𝒔,𝒂)-γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)].

Plugging the latter result into the earlier derived inequality completes the proof.

A.3 Proof of Proposition 4 from the Main Paper

Proof. The mechanics of the proof are in line with [5, 50, 18]. Let’s denote (q,πbehave)=argmaxπbehave,qV(q,πbehave) and V=V(q,πbehave). It then holds that

V=Bq,πbehaveVmaxπbehave,qBq,πbehaveV=:Bq,πbehaveVV(q,πbehave),

where the last inequality is because of the consistency of values as proven in Lemma 5. But by definition it holds that V=maxπbehave,qV(q,πbehave)V(q,πbehave). This implies that V=V(q,πbehave). The latter means that V=maxπbehave,qBq,πbehaveV=BV which proves that V is a fixed point of the operator B.

The uniqueness of values proof comes next. Assume there was another fixed point of the operator B denoted as V=V(q,πbehave), then

V=Bq,πbehaveV=maxπbehave,qBq,πbehaveVBq,πbehaveVV(q,πbehave)=V,

where the last inequality is again because of Lemma 5. One can show similarly that VV, which does hence imply that V=V.

Lemma 5

Value Consistency for the Evaluation Operator. If VBq,πbehaveV then Bq,πbehave(i)VV(q,πbehave)iN, and similarly if VBq,πbehaveV then Bq,πbehave(i)VV(q,πbehave)iN.

Proof. The proof follows via induction. The base case is V()Bq,πbehaveV. The inductive step is as follows. If Bq,πbehave(i-1)V()Bq,πbehave(i)V then

Bq,πbehave(i+1)V=gq,πbehave+γPπbehaveBq,πbehave(i)V()gq,πbehave+γPπbehaveBq,πbehave(i-1)V=Bq,πbehave(i)V,

which completes the induction with help of the concise notation from Appendix A.1.

A.4 Proof Details of Theorem 2 from the Main Paper

This section is to shed more light on the proof of Theorem 2 from the main paper to show that B is a contraction map via the subsequent proposition.

Proposition 5

Contraction Map. Assuming bounded R and let ηR+ be a positive constant η=αmax𝐬,𝐚|R(𝐬,𝐚)|+βlog|A|. Then V-B(i)Vγi11-γη with initial V(𝐬)=0𝐬.

Proof. The proposition is proven by the following sequence of inequalities:

V-B(i)V=:|V(𝒔)-B(i)V(𝒔)|=|maxπbehave,qBq,πbehaveV(𝒔)-maxπbehave,qBq,πbehaveB(i-1)V(𝒔)|maxπbehave,q|Bq,πbehaveV(𝒔)-Bq,πbehaveB(i-1)V(𝒔)|=maxπbehave|γ𝔼πbehave(𝒂|𝒔)𝒫(𝒔|𝒔,𝒂)[V(𝒔)]-γ𝔼πbehave(𝒂|𝒔)𝒫(𝒔|𝒔,𝒂)[B(i-1)V(𝒔)]|γV-B(i-1)VrecursionγiV-V=V is 0γiVγi11-γη,

where η is a positive constant to upper-bound V-values, see Corollary 2.

Corollary 2

Upper Value Bound for Optimal Values. Optimal values are upper-bounded according to |V(𝐬)|11-γ(αmax𝐬,𝐚|R(𝐬,𝐚)|+βlog|A|)𝐬.

This follows straightforwardly from worst-case assumptions and properties of the geometric series and the mutual information. The empowerment-induced addition to the reward signal is upper-bounded by a mutual information term, which is upper-bounded by the worst-case entropy in action space.

Remark. A contraction proof for B with any two initial value vectors V and V follows similar steps as outlined in Proposition 5 by replacing V accordingly.

A.5 Limit Cases of Equation (7)

In the following, we consider limit cases of Equation (7).

A.5.1 Value Iteration Recovered

Here, we consider α=1 and β0. While one can easily recover value iteration as a special case by inspecting Equation (5) from the main paper simply by setting α=1 and β=0, it can be insightful how to obtain Bellman’s classical optimality principle as a limit case from Equation (7):

limβ0V(𝒔)=limβ0βlog𝒂exp(1β(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)])=L’Hospital if max𝒂((𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)])>0limβ0𝒂exp(1β(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)])(-1β2)((𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)])(-1β2)𝒂exp(1β(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γβV(𝒔)])=max𝒂((𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)]).

The above is true if ((𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)])>0 for at least one action 𝒂 given the state 𝒔, because numerator and denominator are then dominated by the maximum sum element. If max𝒂((𝒔,𝒂)+γ𝔼𝒫(𝒔|𝒔,𝒂)[V(𝒔)])0 given 𝒔, then one needs to focus on the second line of the above expression because L’Hospital does not apply anymore. In this case, the maximum element will dominate the sum dwarfing the non-maximum elements. As a consequence log and exp cancel each other and β cancels with (1/β). β hence only multiplies with the intrinsic motivation term induced by empowerment. The latter is going to therefore vanish since β0, resulting in the same expression as in the last line above.

A.5.2 Cumulative One-Step Empowerment Recovered

Here we consider α0 and β=1. In line with the previous section, recovering cumulative one-step empowerment can be easily obtained from Equation (5) by setting α=0 and β=1. The limit case of Equation (7) is trivially given by:

limα0V(𝒔)=limα0log𝒂exp(α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γV(𝒔)])=log𝒂exp(𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γV(𝒔)]).

A.5.3 Non-Cumulative One-Step Empowerment Recovered

In addition to α0 and β=1 from the former section, we consider here γ0 in the following:

limα0,γ0V(𝒔)=limα0,γ0log𝒂exp(α(𝒔,𝒂)+𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)+γV(𝒔)])=log𝒂exp(𝔼𝒫(𝒔|𝒔,𝒂)[logq(𝒂|𝒔,𝒔)]).

The latter can be also obtained by running one-step empowerment (k=1) according to the Blahut-Arimoto scheme from the main paper’s background section in Proposition 1 until convergence, and subsequently plugging the converged solution πempower from Equation (4) into Equation (2).

Appendix B Pseudocode for the Empowered Actor-Critic (EAC)

Let’s restate the optimization objectives from Section 5.1 as functions of the optimization parameters and a batch ={(𝒔,𝒂,r,𝒔)(b)}b=1B sampled from the replay buffer, where B is the batch size:

JQ(θ,)=1Bb=1B(Qθ(𝒔(b),𝒂(b))-(αr(b)+γVψ(𝒔(b))))2,
JV(ψ,)=1Bb=1B(Vψ(𝒔(b))-𝔼πϕ(𝒂|𝒔(b))[Qθ(𝒔(b),𝒂)+βf(𝒔(b),𝒂)])2,
Jπ(ϕ,)=-1Bb=1B𝔼πϕ(𝒂|𝒔(b))[Qθ(𝒔(b),𝒂)+βf(𝒔(b),𝒂)],
Jp(χ,)=-1Bb=1B𝔼πϕ(𝒂|𝒔(b))𝒫ξ(𝒔|𝒔(b),𝒂)[logpχ(𝒂|𝒔,𝒔(b))],
J𝒫(ξ,)=-1Bb=1Blog𝒫ξ(𝒔(b)|𝒔(b),𝒂(b)).

Denoting the corresponding learning rates as δθ,δψ,δϕ,δχ and δξ, we can phrase pseudocode for the empowered actor-critic conveniently. {algorithm} Empowered Actor-Critic (EAC) {algorithmic} \Stateinitialize θ, ψ, ϕ, χ and ξ \Foreach episode \State𝒔0 reset environment \Foreach environment step t \State# environment interaction \State𝒂tπϕ(𝒂t|𝒔t) \Commentsample an action from the policy \Statert(𝒔t,𝒂t) \Commentevaluate the action \State𝒔t+1𝒫(𝒔t+1|𝒔t,𝒂t) \Commentexecute the action \State𝒟𝒟{(𝒔t,𝒂t,rt,𝒔t+1)} \Commentadd the transition to the replay buffer \State# gradient updates \State𝒟 \Commentdraw a transition batch from the replay buffer \Stateθθ-δθθJQ(θ,) \Commentupdate the Q-critic \Stateψψ-δψψJV(ψ,) \Commentupdate the V-critic \Stateϕϕ-δϕϕJπ(ϕ,) \Commentupdate the policy \Stateχχ-δχχJp(χ,) \Commentupdate the inverse dynamics \Stateξξ-δξξJ𝒫(ξ,) \Commentupdate the transition model \EndFor\EndFor

Note that practically when updating the Q-value parameters θ, we recommend replacing the value target Vψ with an exponentially averaged value target Vψ¯ instead where ψ¯(1-τ)ψ¯+τψ with horizon parameter τ —see [21].

Note also that our second proposed method, actor-critic with intrinsic empowerment (ACIE), can use the same algorithm for learning parametric function approximators by setting α=0 and β=1. Since Algorithm B is an off-policy method that uses a replay buffer, it can be combined with any other actor-critic algorithm whose actor is collecting samples from the environment. An ACIE-agent can hence be trained concurrently and used to generate intrinsic rewards according to Equation (6) from the main paper. These intrinsic rewards are then added to the extrinsic rewards of the agent that collects samples from the environment to encourage visiting states with high cumulative empowerment.

Appendix C Experiments

The following subsections provide a detailed description of the setups that we used for the grid world and MuJoCo experiments.

C.1 Grid World

In the grid world setting from the main paper (Section 4.3), the agent has to reach a goal in the lower left of a 16×16 grid, which is rewarded with +2. The agent can execute nine actions in each grid cell: left, right, up, down, as well as diagonally or stay in place. The transition function is deterministic. The discount factor γ was set to γ=0.95 in the experiments. The stopping criterion for the value iteration scheme was when the infinity norm of two consecutive value vectors dropped below ϵouter=510-4. The stopping criterion for the inner Blahut-Arimoto scheme for each value iteration step was when the maximum absolute difference between the probability values in consecutive q and πbehave dropped below the threshold ϵinner=510-4.

Below is another grid world example similar to the one from the main paper, where the agent has to reach a goal in the upper right of a 16×16 grid. Reaching the goal is rewarded with +1 and terminates the episode whereas every step is penalized with -1. The transition function is probabilistic. Whenever the agent takes a step, the agent ends up at the intended next grid cell with only a 20%-chance. There is either a 30%-chance of a horizontal perturbation by one step, or a 30%-chance of a vertical perturbation by one step, or a 20%-chance of a diagonal perturbation by one step. The discount factor γ was set to γ=0.6 (leading to more myopic policies).

Figure 4: Value Iteration for another Grid World Example. The figure is similar to Figure 1 from the main paper. The agent aims to arrive at the goal ’G’ in the upper right. The plots show optimal values for different α and β ranging from raw cumulative empowerment learning to reward maximization. Raw cumulative empowerment learning (α=0.0, β=1.0, see second plot) assigns high values to states where many other states can be reached, i.e. the middle of the upper and lower room as well as the door connecting them; and low values to states where the number of reachable next states is low, i.e. close to walls and corners as well as in the bottom right dead end and the goal (because it terminates the episode). Ordinary cumulative reward maximization (α=1.0, β=0.0, see rightmost plot) assigns high values to states close to the goal and low values to states that are far away.

C.2 MuJoCo

For all our MuJoCo experiments, we followed standard literature regarding hyperparameter settings [21]. We used Adam [24] as optimizer for all parametric functions with a learning rate δ=310-4. The discount factor γ was set to γ=0.99, the replay buffer size was 5105 and the batch size for training was 256. All neural networks were implemented in PyTorch. The critic and policy networks had two hidden layers whereas the transition and inverse dynamics model networks had three hidden layers. The number of units per hidden layer was 256 using ReLU activations. In line with [21], we used an exponentially averaged V-value target for updating Q-value parameters with a horizon parameter τ=0.01—explained at the end of Appendix B. Our specific trade-off parameters α and β were set to α=10 and β=0.1 respectively (both for EAC and ACIE experiments) as determined through initial experiments on InvertedDoublePendulum-v2 and HalfCheetah-v2. ACIE-generated intrinsic rewards were furthermore clipped to not exceed an absolute value of 20.

Both policy and inverse dynamics model assume that actions are distributed according to a multivariate Gaussian with diagonal covariance. They receive as input the (concatenated) vectors of 𝒔 and (𝒔,𝒔) respectively. They output the mean and the log standard deviation vectors from which real-valued actions can be sampled. The real-valued actions are subsequently squashed through a sigmoid function because MuJoCo has bounded action spaces. We used tanh [21] scaled by the environment-specific bounds. The transition network assumes that states are distributed according to a multivariate isotropic Gaussian with a given standard deviation of 10-5. It receives as input the concatenated vectors of (𝒔,𝒂) and outputs the mean of 𝒔. The value networks merely output a single real number for cumulative reward prediction given the input. The input to the Q-value network are the concatenated vectors of (𝒔,𝒂) whereas the input to the V-value network is 𝒔.

Following [66, 67, 15, 21], we used a twin Q-critic rather than a single Q-critic. This means that two Q-critic networks Qθ1(,) and Qθ2(,) are trained. When updating the V-critic and the policy, Qθ(,) is replaced with min{Qθ1(,),Qθ2(,)} to prevent value overestimation. To train the policy parameters, we applied the reparameterization trick on the actions [25, 49]—see [21] Appendix C. We also found it helpful to bound the log standard deviation of the policy and inverse dynamics networks according to [9] Appendix A.1 to make our implementation more stable.

We compare against an SAC baseline with hyperparameters chosen according to the original paper [21], except using a reward scale of 10 to ensure comparability with our methods EAC and ACIE. We furthermore compare against the DDPG and PPO baselines from RLlib [34] using hyperparameters settings following [15] and [56], but with the same neural network architectures as used in EAC, ACIE and SAC to ensure a fair comparison.

Note that in neither Figure 2 nor Figure 3 from the main paper do we report results from DDPG on Ant because the RLlib baseline implementation of that algorithm was not able to learn with our experimental protocol in that specific environment. In initial trials, we observed that DDPG in Ant leads to a rapid drop in performance to large negative values after the very first few episodes and never recovers from there within the next 5105 environment steps. This performance pattern is in line with the experiments conducted in previous literature and can be seen by carefully inspecting Figure 1(d) from the SAC-paper [21].