Hierarchical Expert Networks for Meta-Learning

  • 2019-11-14 17:03:56
  • Heinke Hihn, Daniel A. Braun
  • 0


The goal of meta-learning is to train a model on a variety of learning tasks,such that it can adapt to new problems within only a few iterations. Here wepropose a principled information-theoretic model that optimally partitions theunderlying problem space such that the resulting partitions are processed byspecialized expert decision-makers. To drive this specialization we impose thesame kind of information processing constraints both on the partitioning andthe expert decision-makers. We argue that this specialization leads toefficient adaptation to new tasks. To demonstrate the generality of ourapproach we evaluate on three meta-learning domains: image classification,regression, and reinforcement learning.


Quick Read (beta)


The goal of meta-learning is to train a model on a variety of learning tasks, such that it can adapt to new problems within only a few iterations. Here we propose a principled information-theoretic model that optimally partitions the underlying problem space such that the resulting partitions are processed by specialized expert decision-makers. To drive this specialization we impose the same kind of information processing constraints both on the partitioning and the expert decision-makers. We argue that this specialization leads to efficient adaptation to new tasks. To demonstrate the generality of our approach we evaluate on three meta-learning domains: image classification, regression, and reinforcement learning.




Hierarchical Expert Networks for Meta-Learning


Heinke Hihn                        Daniel A. Braun

Institute for Neural Information Processing Ulm University Ulm, Germany                        Institute for Neural Information Processing Ulm University Ulm, Germany

1 Introduction

Recent machine learning research has shown impressive results on incredibly diverse tasks from problem classes such as pattern recognition, reinforcement learning, and generative model learning [Devlin et al., 2018, Mnih et al., 2015, Schmidhuber, 2015]. These success stories typically have two computational luxuries in common: a large data base with thousands or even millions of training samples and a very long and extensive training period. However, applying these pre-trained models to new tasks naïvely usually leads to very poor performance, as with each new incoming batch of data, expensive and slow re-learning is required. In contrast to this, humans are able to learn from very few examples and excel at adapting quickly [Jankowski et al., 2011], for example in motor tasks [Braun et al., 2009] or at learning new visual concepts [Lake et al., 2015].

Sample-efficient adaptation to new tasks can be regarded as a form of meta-learning or “learning to learn” [Thrun and Pratt, 2012, Schmidhuber et al., 1997, Caruana, 1997] and is an ongoing and active field of research–see e.g. [Koch et al., 2015, Vinyals et al., 2016, Finn et al., 2017, Ravi and Larochelle, 2017, Ortega et al., 2019, Botvinick et al., 2019, Yao et al., 2019]. Meta-learning can be defined in different ways, but a common point is that the system learns on two levels, each with different time scales: slow learning across different tasks on a meta-level, and fast learning to adapt to each task individually.

Here, we propose a novel learning paradigm for hierarchical meta learning systems. Our method finds an optimal soft partitioning of the problem space by imposing information-theoretic constraints on both the process of expert selection and on the expert specialization. We argue that these constraints drive an efficient division of labor in systems that are bounded in their respective information processing power, where we make use of information-theoretic bounded rationality [Ortega and Braun, 2013]. When the model is presented with previously unseen tasks it assigns them to experts specialized on similar tasks – see Figure 1. Additionally, expert networks specializing on only a subset of the problem space allows for smaller neural network architectures with only few units per layer. In order to split the problem space and to assign the partitions to experts, we learn to represent tasks through a common latent embedding, that is then used by a selector network to distribute the tasks to the experts.

The outline of this paper is as follows: first we introduce bounded rationality and meta learning, next we introduce our novel approach and derive applications to classification, regression, and reinforcement learning. Finally, we conclude.

Figure 1: The selector assigns the new input encoding ω to one of the three experts θ0, θ1 or θ2, depending on the similarity of the input to previous inputs seen by the experts.

2 Background

2.1 Bounded Rational Decision Making

An important concept in decision making is the notion of utility [Von Neumann and Morgenstern, 2007], where an agent picks an action as*𝒜 such that it maximizes their utility in some context s𝒮, i.e. as*=argmaxa𝐔(s,a), where the utility is given by a function 𝐔(s,a) and the states distribution p(s) is known and fixed. Trying to solve this optimization problem naïvely leads to an exhaustive search over all possible (a,s) pairs, which is in general a prohibitive strategy. Instead of finding an optimal strategy, a bounded-rational decision-maker optimally trades off expected utility and the processing costs required to adapt. In this study we consider the information-theoretic free-energy principle [Ortega and Braun, 2013] of bounded rationality, where the decision-maker’s resources are modeled by an upper bound on the Kullback-Leibler divergence DKL(p(a|s)||p(a))=ap(a|s)logp(a|s)p(a) between the agent’s prior distribution p(a) and the posterior policy p(a|s), resulting in the following constrained optimization problem:

maxp(a|s)s,ap(s)p(a|s)𝐔(s,a) (1)
s.t. 𝔼p(s)[DKL(p(a|s)||p(a))]B. (2)

This constraint can also be interpreted as a regularization on p(a|s). We can transform this into an unconstrained variational problem by introducing a Lagrange multiplier β+:

maxp(a|s)𝔼p(s|a)[𝐔(s,a)]-1β𝔼p(s)[DKL(p(a|s)||p(a))]. (3)

For β we recover the maximum utility solution and for β0 the agent can only act according to the prior. The optimal prior in this case is given by the marginal p(a)=sSp(s)p(a|s) [Ortega and Braun, 2013].

2.1.1 Hierarchical Decision Making

Aggregating several bounded-rational agents by a selection policy allows for solving optimization problems that exceed the capabilities of the individual decision-makers [Genewein et al., 2015]. To achieve this, the search space is split into partitions such that each partition can be solved by a decision-maker. A two stage mechanism is introduced: The first stage is an expert selection policy p(x|s) that chooses an expert x given a state s and the second stage chooses an action according to the expert’s posterior policy p(a|s,x). The optimization problem given by (3) can be extended to incorporate a trade-off between computational costs and utility in both stages:

maxp(a|s,x),p(x|s)𝔼[𝐔(s,a)]-1β1I(S;X)-1β2I(S;A|X) (4)

where β1 is the resource parameter for the expert selection stage and β2 for the experts. I(;) is the mutual information between the two random variables. The solution can be found by iterating the following set of equations [Genewein et al., 2015]:

{p(x|s)=1Z(s)p(x)exp(β1ΔFpar(s,x))p(x)=sp(s)p(x|s)p(a|s,x)=1Z(s,x)p(a|x)exp(β2𝐔(s,a))p(a|x)=sp(s|x)p(a|s,x) (5)

where Z(s) and Z(s,x) are normalization factors and ΔFpar(s,x)=𝔼p(a|s,x)[𝐔(s,a)]-1β2DKL(p(a|s,x)||p(a|x)) is the free energy of the action selection stage. Thus the marginal distribution p(a|x) defines a mixture-of-experts policy given by the posterior distributions p(a|s,x) weighted by the responsibilities determined by the Bayesian posterior p(s|x). Note that p(s|x) is not determined by a given likelihood model, but is the result of the optimization process (4).

Figure 2: Our proposed method consists of three main stages. First, the training dataset Dtrain, is passed through a convolutional autoencoder to find a latent representation z(di) for each diDtrain, which we get by flattening the preceding convolutional layer (labeled as flattening layer in the figure). This image embedding is then pooled and fed forward selection network.

2.2 Meta Learning

Meta-learning algorithms can be divided roughly into Metric-Learning [Koch et al., 2015, Vinyals et al., 2016, Snell et al., 2017], Optimizer Learning [Ravi and Larochelle, 2017, Finn et al., 2017, Zintgraf et al., 2018], and Task Decomposition Models [Lan et al., 2019, Vezhnevets et al., 2019]. Our approach depicted in Figure 2 can be seen as a member of the latter group.

2.2.1 Meta Supervised Learning

In a supervised learning task we are usually interested in a dataset consisting of multiple input and output pairs 𝒟={(xi,yi)}i=0N and the learner is tasked with finding a function f(x) that maps from input to output, for example through a deep neural network. To do this, we split the dataset into training and test sets and fit a set of parameters θ on the training data and evaluate on test data using the learned function fθ(x). In meta-learning, we are instead working with meta-datasets D𝒟, each containing regular datasets split into training and test sets. We thus have different meta-sets for meta-training, meta-validation, and meta-test (Dmeta-train,Dmeta-val, and Dmeta-test, respectively). On Dmeta-train, we are interested in training a learning procedure (the meta-learner) that can take as input one of its training sets Dtrain and produce a classifier (the learner) that achieves low prediction error on its corresponding test set Dtest.

A special case of meta-learning for classification are K-Shot N-way tasks. In this setting, we are given for each dataset D a training set consisting of K labeled examples of each of the N classes (KN examples per dataset) and corresponding test sets. In our study, we focus on the following variation of K-Shot 2-Way tasks: the meta-learner is presented with 2K samples (K positive and K negative examples) and must assign this dataset to an expert learner. Note that the negative examples may be drawn from any of the remaining classes.

2.2.2 Meta Reinforcement Learning

We model sequential decision problems by defining a Markov Decision Process as a tuple (𝒮,𝒜,P,r), where 𝒮 is the set of states, 𝒜 the set of actions, P:𝒮×𝒜×𝒮[0,1] is the transition probability, and r:𝒮×𝒜 is a reward function. The aim is to find the parameter θ of a policy πθ that maximizes the expected reward:

θ*=argmaxθ𝔼πθ[t=0r(st,at)]J(πθ). (6)

We define r(τ)=t=0r(st,at) as the cumulative reward of trajectory τ={(st,at)}i=0, which is sampled by acting according to the policy π, i.e. (s,a)π(|s), and st+1P(|st,at). Learning in this environment can then be modeled by reinforcement learning [Sutton and Barto, 2018], where an agent interacts with an environment over a number of (discrete) time steps t. At each time step t, the agent finds itself in a state st and selects an action at according to the policy π(at|st). In return, the environment transitions to the next state st+1 and generates a scalar reward rt. This process continues until the agent reaches a terminal state after which the process restarts. The goal of the agent is to maximize the expected return from each state st, which is typically defined as the infinite horizon discounted sum of the rewards. A common choice to achieving this is Q-Learning [Watkins and Dayan, 1992], where we make use of an action-value function that is defined as the discounted sum of rewards Q(τ)=t=0γtr(st,at), where γ(0,1] is a discount factor. Learning the optimal policy can be achieved in many ways. Here, we consider Policy gradient methods [Sutton et al., 2000] which are a popular choice to tackle continuous reinforcement learning problems. The main idea is to directly manipulate the parameters θ of the policy in order to maximize the objective J(πθ) by taking steps in the direction of the gradient θJ(πθ).

In meta reinforcement learning the problem is given by a set of tasks tiT, where each task ti is defined by an MDP ti=(𝒮,𝒜,Pi,ri) as described earlier. We are now interested in finding a set of policies Θ that maximizes the average cumulative reward across all tasks in T and generalizes well to new tasks sampled from a different set of tasks T.


[t!] Expert Networks for Supervised Meta-Learning {algorithmic}[1] \StateInput: Data Distribution p(𝒟), number of samples K, batch-size M, training episodes N \StateHyper-parameters: resource parameters β1, β2, learning rates ηs, ηx for selector and experts \StateInitialize parameters θ,ϑ \Fori = 0, 1, 2, …, N \StateSample batch of M datasets Dip(𝒟), each consisting of a training dataset Dmeta-train and a meta-validation dataset Dmeta-val with 2K samples each   \ForDDi \StateFind Latent Embedding z(Dmeta-train)\StateSelect expert xpθ(x|z(Dmeta-train))\StateCompute ΔFD,x of x on Dmeta-val\EndFor \StateUpdate selection parameters θ with ΔFD,x\StateUpdate Autoencoder with pos. samples in Di\StateUpdate experts x with assigned Dmeta-train\EndFor \Statereturn θ, ϑ

3 Expert Networks for Meta-Learning

Information-theoretic bounded rationality postulates that hierarchies and abstractions emerge when agents have only limited access computational resources [Genewein et al., 2015, Gottwald and Braun, 2019b, Gottwald and Braun, 2019a], e.g. limited sampling complexity [Hihn et al., 2018] or limited representational power [Hihn et al., 2019]. We will show that forming such abstractions equips an agent with the ability of learning the underlying problem structure and thus enables learning of unseen but similar concepts. The method we propose comes out of a unified optimization principle and has the following important features:

  1. 1.

    A regularization mechanism to enforce the emergence of expert policies.

  2. 2.

    A task compression mechanism to extract relevant task information.

  3. 3.

    A selection mechanism to find the most efficient expert for a given task.

  4. 4.

    A regularization mechanism to improve generalization capabilities.

3.1 Latent Task Embeddings

Note that the selector assigns a complete dataset to an expert and that this can be seen as a meta-learning task, as described in [Ravi and Larochelle, 2017]. To do so, we must find a feature vector z(d) of the dataset d. This feature vector must fulfill the following desiderata: 1) invariance against permutation of data points in d, 2) high representational capacity, 3) efficient computability, and 4) constant dimensionality regardless of sample size K. In the following we propose such features for image classification, regression, and reinforcement learning problems.

For image classification we propose to pass the positive images in the dataset through a convolutional autoencoder and use the respective outputs of the bottleneck layer. Convolutional Autoencoders are generative models that learn to reconstruct their inputs by minimizing the Mean-Squared-Error between the input and the reconstructed image (see e.g. [Chen et al., 2019]). In this way we get similar embeddings z(d) for similar inputs belonging to the same class. The latent representation is computed for each positive sample in d and then passed through a pooling function h(z(d)) to find a single embedding for the complete dataset–see figure 2 for an overview of our proposed model. While in principle functions such as mean, max, and min can be used, we found that max pooling yields the best results. The authors of [Yao et al., 2019] propose a similar feature set.

For regression we define a similar feature vector. The K training data points are transformed into a feature vector z(d) by binning the points into N bins according to their respective x value and collecting the respective y value. If more than one point falls into the same bin the y values are averaged, thus providing invariance against the order of the data points in Dmeta-train. We use this feature vector to assign each data set to an expert according to pθ(x|z(d)).

In the reinforcement learning setting we use a dynamic recurrent neural network (RNN) with LSTM units [Hochreiter and Schmidhuber, 1997] to classify trajectories. We feed the RNN with (st,at,rt,t) tuples to describe the underlying Markov Decision Process describing the task. At t=0 we sample the expert x according to the learned prior distribution p(x), as there is no information available so far. The authors of [Lan et al., 2019] propose a similar feature set.

Omniglot Few-Shot Classification Results
Number of Experts
K 2 4 8 16
% Acc I(X;W) % Acc I(X;W) % Acc I(X;W) % Acc I(X;W)
1 76.2 (± 0.02) 0.99 (± 0.01) 86.7 (± 0.02) 1.96 (± 0.01) 90.1 (± 0.01) 2.5 (± 0.20) 92.9 (± 0.01) 3.2 (± 0.3)
5 67.3 (± 0.01) 0.93 (± 0.01) 75.5 (± 0.01) 1.95 (± 0.10) 78.4 (± 0.01) 2.7 (± 0.10) 81.2 (± 0.01) 3.3 (± 0.2)
10 66.4 (± 0.04) 0.95 (± 0.30) 75.8 (± 0.01) 1.90 (± 0.03) 77.3 (± 0.01) 2.8 (± 0.15) 77.8 (± 0.01) 3.1 (± 0.2)
Table 1: Classification results for the omniglot data set [Lake et al., 2011]. We evaluate our system by splitting the dataset into training and validation data (80% - 20%) and train the system as described in Algorithm 2.2.2 and report the classification accuracy on the validation, i.e. classes and samples that are novel to the model. We trained for 50.000 episodes each with a batch of 32 datasets and set β1=15 and β2=5.

3.2 Hierarchical On-line Meta-Learning

As discussed in section 2.1, the aim of the selection network is to find an optimal partition of the experts pθ(x|s), such that the selector’s expected utility xpθ(x|s)ΔFpar(s,x) is maximized under an information-theoretic constraint DKL(pθ(x|s)||p(x)), where θ are the selector’s parameters (e.g. weights in a neural network), x the expert and s is an input. Each expert x follows a policy pϑ(a|s,x) that maximizes their expected utility p(s|x)pθ(a|s,x)𝐔(s,a). We introduce our gradient based on-line learning algorithm to find the optimal partitioning and the expert parameters in the following. Rewriting the optimization problem (4) as

maxpϑ(a|s,x),pθ(x|s)s,x,ap(s)pθ(x|s)pϑ(a|s,x)J(s,x,a) (7)

where the objective J(s,x,a) is given by

J(s,x,a)=𝐔(s,a)-1β1logpθ(x|s)p(x)-1β2logpϑ(a|s,x)p(a|x), (8)

and θ,ϑ are the parameters of the selection policy and the expert policies, respectively. Note that each expert policy has a distinct set of parameters ϑx, i.e. ϑ={ϑx}x, but we drop the x index for readability. In the following we will show how we can apply this formulation to classification, regression and reinforcement learning.

3.2.1 Application to Supervised Learning

Combining multiple experts can often be beneficial [Kuncheva, 2004], e.g. in Mixture-of-Experts [Yuksel et al., 2012] or Multiple Classifier Systems [Bellmann et al., 2018]. Our method can be interpreted as a member of this family of algorithms.

In accordance with Section 2.1 we define the utility as the negative prediction loss, i.e. U(fx(d),y)=-(fx(d),y), where fx(d) is the prediction of the expert x given the input data point d (in the following we will use the shorthand y^x) and y is the ground truth. We define the cross-entropy loss (y^x,y,)=-iyilogyi^x as a performance measure for classification and the mean squared error (y^x,y)=i(y^i-yi)2 for regression. The objective for expert selection thus is given by

maxθ𝔼pθ(x|d)[f^-1β1logpθ(x|d)p(x)], (9)

where f^=𝔼pϑ(y^x|x,s)[-(y^x,y)-1β2logpϑ(y^x|s,x)p(y^x|x)], i.e. the free energy of the expert and θ,ϑ are the parameters of the selection policy and the expert policies, respectively. Analogously, the action selection objective for each expert x is defined by

maxϑ𝔼pϑ(y^x|x,s)[-(y^x,y)-1β2logpϑ(y^x|s,x)p(y^x|x)]. (10)
Figure 3: Here we show the soft-partition found by the selection policy for the sine prediction problem y=asin(x+b), where a,b are chosen uniformly at each trial. To generate these plots we train a system on K=1,5 or 10 respectively, sample a,b and K points and feed the data set to the selection policy. Each color represents a different expert. We can see that the selection policy becomes increasingly more precise as we provide more points per data set (denoted by K) to the system. We set β1=100 and β2=10.

3.2.2 Application to Reinforcement Learning

In the reinforcement learning setup the utility 𝐔(s,a) is given by the reward function r(s,a). In maximum entropy RL the regularization penalizes deviation from a fixed uniform prior, but in a more general setting we can discourage deviation from an arbitrary prior policy by determining the optimal policy p*(a|s) as

argmaxp𝔼p[t=0γt(r(st,at)-1βlogp(at|st)p(at))]. (11)

As discussed in Section 2.1, the optimal prior is the marginal of the posterior policy given by p(a)=sp(s)p(a|s). We approximate the prior distributions p(x) and p(a|x) by exponential running mean averages of the posterior policies.

To optimize the objective we define two separate value functions: one to estimate the discounted sum of rewards and one to estimate the free energy of the expert policies. The discounted reward for the experts is Rt=l=0Tγlr(st+l,at+l), which we learn by parameterizing the value function with a neural network. Similar to the discounted reward Rt we can now define the discounted free energy Ft as Ft=l=0Tγlf(st+l,xt+l,at+l), where f(s,x,a)=r(s,a)-1β2logpϑ(a|s,x)p(a|x). The value function Ft is learned by parameterizing it with a neural network and performing regression on the mean-squared-error.

3.2.3 Expert Selection

The selector network learns a policy pθ(x|s) that assigns states s to expert policies x optimally. The resource parameter β1 constrains the information-processing in this step. For β10 the selection assigns each state completely randomly to an expert, while for β1 the selection becomes deterministic, always choosing the most promising expert x. The selector optimizes the following objective:

maxθ𝔼pθ(x|s)[f^(s,x)-1β1logpθ(x|s)p(x)], (12)

where f^(s,x)=𝔼pϑ(a|s,x)[f(s,x,a)], which is the free energy of the expert. The gradient of J(θ) is then given (up to an additive constant) by

𝔼[θlogpθ(x|s)(f^(s,x)-1β1logpθ(x|s)p(x))]. (13)
Figure 4: The single expert system is not able to learn the underlying structure of the sine wave, where the two expert system is already able to capture the periodic structure. Adding more experts improves adaption further, as the results show. We trained for 10.000 episodes each with a batch of 32 data sets.
Figure 5: Analogously to the rate-distortion curve in rate-distortion theory [Blahut, 1972, Arimoto, 1972], we can interpret this curve as the rate-utility showing the trade-off between information processing and expected utility (transparent area represents the standard deviation). Increasing the processing power of the selection stage I(W;X) (i.e. adding more experts) improves adaption.

The double expectation can be replaced by Monte Carlo estimates, where in practice we use a single (s,x,a) tuple for f^(s,x). This formulation is known as the policy gradient method [Sutton et al., 2000] and is prone to producing high variance gradients, but can be reduced by using an advantage function instead of the reward [Schulman et al., 2015]. The advantage function A(at,st) is a measure of how well a certain action a performs in a state s compared to the average performance in that state, i.e. A(a,s)=Q(s,a)-V(s). Here, V(s) is called the value function and captures the expected cumulative reward when in state s, and Q(s,a) is an estimate of the expected cumulative reward achieved in state s when choosing a particular action a. Thus the advantage is an estimate of how advantageous it is to pick a in state s in relation to a baseline performance V(s). Instead of learning V and Q, we can approximate the advantage function

A^tF=f(st,xt,at)+γVϕ(st+1)Q(st,at)-Vϕ(st), (14)

such that we can get away with just learning a single value function Vϕ(st). Both the selection network and the selector value network are implemented as recurrent neural networks with LSTM cells [Hochreiter and Schmidhuber, 1997]. Both networks share the recurrent cell followed by independent feed forward layers.

Figure 6: In each Meta-Update Step we sample N tasks from the training task set T and update the agents. After training is completed we evaluate their respective performance on a tasks from the meta test set T. Rewards are normalized to [0,1] and the episode horizon is 500 time steps. Results are averaged over 10 random seeds and trained for 1000 episodes each with a batch of 64 environments.

3.2.4 Action Selection

The actions a is sampled from the posterior action distribution of the experts. Each expert x maintains a policy for each of the world states s and updates those according to the utility/cost trade-off. The advantage function for each expert is given as

A^t=r(st,at)+γVφ(st+1)-Vφ(st). (15)

The objective of this stage is then to maximize the expected advantage 𝔼pϑ(a|s,x)[A^t].

4 Empirical Results

4.1 Sinusoid Regression

We adopt this task from [Finn et al., 2017]. In this K-shot problem, each task consists of learning to predict a function of the form y=asin(x+b), with both a[0.1,5] and b[0,2π] chosen uniformly, and the goal of the learner is to find y given x based on only K pairs of (x,y). Given that the underlying function changes in each iteration it is impossible to solve this problem with a single learner. Our results show that by combing expert networks, we are able to reduce the generalization error iteratively as we add more experts to our system–see Figures 5 for K=5 and K=10 settings. In Figure 4 we show how the system is able to capture the underlying problem structure as we add more experts and in Figure 3 we visualize how the selector’s partition of the problem space looks like.

4.2 Few-Shot Classification

The Omniglot dataset [Lake et al., 2011] consists of over 1600 characters from 50 alphabets. As each character has merely 20 samples each drawn by a different person, this forms a difficult learning task and is thus often referred to as the ”transposed MNIST” dataset. The Omniglot dataset is regarded as a standard meta learning benchmark, see e.g. [Finn et al., 2017, Vinyals et al., 2016, Ravi and Larochelle, 2017].

We train the learner on a subset of the dataset (80%, i.e. 1300 classes) and evaluate on the remaining 300 classes, thus investigating the ability to generalize to new data. In each round we build the datasets Dmeta-train and Dmeta-test by selecting a target class ct and sample K positive and K negative samples. To generate negative samples we draw K images randomly out of the remaining N-1 classes. We present the selection network with the feature presentation of the K positive training samples (see Figure 2), but evaluate the experts’ performance on the 2K test samples in Dmeta-test. In this way the free energy of the experts becomes a measure of how well the expert is able to generalize to new samples of the target class and distinguish them from negative examples. Using this optimization scheme, we train the expert networks to become experts in recognizing a subset of classes. After a suitable expert is selected we train that expert using the 2K samples from the training dataset—see Figure 5 and Table 1 for results. To generate this figure, we ran a 10-fold cross-validation on the whole dataset and show the averaged performance metric and the respective standard-deviation across the folds. In both settings ”0 bits” corresponds to a single expert, i.e. a single neural network trained on the task.

4.3 Meta Reinforcement Learning

Task Distribution
Paramater T T
Distance Penalty [10-3,10-1] [10-3,10-2]
Goal Position [0.3, 0.4] [0, 3]
Start Position [-0.15, 0.15] [-0.25, 0.25]
Motor Torques [0,5] [0,3]
Motor Actuation [185, 215] [175, 225]
Inverted Control p=0.5 p=0.5
Gravity [0.01, 4.9] [4.9, 9.8]
Table 2: All parameters are sampled uniformly from the specified range for each environment. T is used for training and T for meta evaluation.

We create a set of RL tasks by sampling the parameters for the Inverted Double Pendulum problem [Sutton, 1996] implemented in OpenAI Gym [Brockman et al., 2016]. The task is to balance a two-link pendulum in an upward position. We modify inertia, motor torques, reward function, goal position and invert the control signal – see Table 2 for details. The control signal a is continuous in the interval [-1,1] is generated by neural network that outputs μ and log(σ) of a gaussian. The action is sampled by re-parameterizing the distribution to p(a)=μ+exp(σ)ϵ, where ϵ𝒩(0,1), so that the distribution is differentiable w.r.t to the network outputs.

The meta task set T is based on the same environment, but the parameter distribution and range is different, providing new but similar reinforcement learning problems. In each episode M environments are sampled and the system is updated accordingly. After training is concluded the system is evaluated on tasks sampled from T. We trained the system for 1000 Episodes with 64 tasks from T and evaluate for 100 system updates on tasks from T. We report the results in Figure 6, where we can see improving performance as more experts are added and the mutual information in the selection stage indicates that the tasks can be assigned to their respective expert policy.

5 Discussion

We have introduced and evaluated a novel information-theoretic approach to meta learning. In particular we leveraged an information-theoretic approach to bounded rationality [Leibfried et al., 2017, Grau-Moya et al., 2019, Hihn et al., 2019, Schach et al., 2018, Gottwald and Braun, 2019b, Lindig-Leon et al., 2019]. Our results show that our method is able to identify sub-regions of the problem set with expert networks. In effect, this equips the system with several initializations covering the problem space and thus enables it to adapt quickly to new but similar tasks. To reliably identify such tasks, we have proposed feature extraction methods for classification, regression and reinforcement learning, that could be simply be replaced and improved in the future. The strength of our model is that it follows from simple principles that can be applied to a large range of problems. Moreover, the system performance can be interpreted in terms of the information processing of the selection stage and the expert decision-makers.

Most other methods for meta learning such as [Finn et al., 2017] and [Ravi and Larochelle, 2017] try to find a initial parametrization of a single learner, such that it is able to adapt quickly to new problems. This initialization can be interpreted as compression of the most common task properties over all tasks. Our method however learns to identify task properties over a subset of tasks and provide several initializations. Task specific information is thus directly available instead of a delayed availability after several iterations as in [Finn et al., 2017] and [Ravi and Larochelle, 2017]. In principle, this can help to adapt within fewer iterations. Thus our method can be seen as the general case of such monolithic meta-learning algorithms.

Another hierarchical approach to meta-learning is the work of [Yao et al., 2019], where the focus is on learning similarities between completely different problems (e.g. different classification datasets). In this way the portioning is largely governed by the different tasks. Our study however focuses on discovering meta-information within the same task family, where the meta-partitioning is determined solely by the optimization process and can thus potentially discover unknown dynamics and relations within a task family.

Although our method is widely applicable, it suffers from low sample efficiency in the RL domain. An interesting research direction would be to combine our system with model-based RL which is known improve sample efficiency. Another research direction would be to investigate our systems performance in continual adaption tasks, such as in [Yao et al., 2019]. There the system is continuously provided with data sets (e.g. additional classes and samples). Another limitation is the restriction to binary meta classification tasks, which we leave for feature work.


This work was supported by the European Research Council Starting Grant BRISC, ERC-STG-2015, Project ID 678082.


  • [Arimoto, 1972] Arimoto, S. (1972). An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20.
  • [Bellmann et al., 2018] Bellmann, P., Thiam, P., and Schwenker, F. (2018). Multi-classifier-systems: Architectures, algorithms and applications. In Computational Intelligence for Pattern Recognition, pages 83–113. Springer.
  • [Blahut, 1972] Blahut, R. E. (1972). Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473.
  • [Botvinick et al., 2019] Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in cognitive sciences.
  • [Braun et al., 2009] Braun, D. A., Aertsen, A., Wolpert, D. M., and Mehring, C. (2009). Motor task variation induces structural learning. Current Biology, 19(4):352–357.
  • [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
  • [Caruana, 1997] Caruana, R. (1997). Multitask learning. Machine learning, 28(1):41–75.
  • [Chen et al., 2019] Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang, J.-B. (2019). A closer look at few-shot classification. International Conference on Representation Learning.
  • [Devlin et al., 2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • [Finn et al., 2017] Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org.
  • [Genewein et al., 2015] Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D. A. (2015). Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI, 2:27.
  • [Gottwald and Braun, 2019a] Gottwald, S. and Braun, D. A. (2019a). Bounded rational decision-making from elementary computations that reduce uncertainty. Entropy, 21(4).
  • [Gottwald and Braun, 2019b] Gottwald, S. and Braun, D. A. (2019b). Systems of bounded rational agents with information-theoretic constraints. Neural computation, 31(2):440–476.
  • [Grau-Moya et al., 2019] Grau-Moya, J., Leibfried, F., and Vrancx, P. (2019). Soft q-learning with mutual-information regularization. International Conference on Learning Representations.
  • [Hihn et al., 2018] Hihn, H., Gottwald, S., and Braun, D. A. (2018). Bounded rational decision-making with adaptive neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pages 213–225. Springer.
  • [Hihn et al., 2019] Hihn, H., Gottwald, S., and Braun, D. A. (2019). An information-theoretic on-line learning principle for specialization in hierarchical decision-making systems. arXiv preprint arXiv:1907.11452.
  • [Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
  • [Jankowski et al., 2011] Jankowski, N., Duch, W., and Grkabczewski, K. (2011). Meta-learning in computational intelligence, volume 358. Springer Science & Business Media.
  • [Koch et al., 2015] Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2.
  • [Kuncheva, 2004] Kuncheva, L. I. (2004). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.
  • [Lake et al., 2011] Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J. (2011). One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33.
  • [Lake et al., 2015] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
  • [Lan et al., 2019] Lan, L., Li, Z., Guan, X., and Wang, P. (2019). Meta reinforcement learning with task embedding and shared policy. International Joint Conference on Artificial Intelligence.
  • [Leibfried et al., 2017] Leibfried, F., Grau-Moya, J., and Ammar, H. B. (2017). An information-theoretic optimality principle for deep reinforcement learning. Deep Reinforcement Learning Workshop NIPS 2018.
  • [Lindig-Leon et al., 2019] Lindig-Leon, C., Gottwald, S., and Braun, D. A. (2019). Analyzing abstraction and hierarchical decision-making in absolute identification by information-theoretic bounded rationality. Frontiers in Neuroscience, 13:1230.
  • [Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
  • [Ortega and Braun, 2013] Ortega, P. A. and Braun, D. A. (2013). Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 469(2153).
  • [Ortega et al., 2019] Ortega, P. A., Wang, J. X., Rowland, M., Genewein, T., Kurth-Nelson, Z., Pascanu, R., Heess, N., Veness, J., Pritzel, A., Sprechmann, P., et al. (2019). Meta-learning of sequential strategies. arXiv preprint arXiv:1905.03030.
  • [Ravi and Larochelle, 2017] Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. International Conference on Learning Representations.
  • [Schach et al., 2018] Schach, S., Gottwald, S., and Braun, D. A. (2018). Quantifying motor task performance by bounded rational decision theory. Frontiers in neuroscience, 12.
  • [Schmidhuber, 2015] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61:85–117.
  • [Schmidhuber et al., 1997] Schmidhuber, J., Zhao, J., and Wiering, M. (1997). Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105–130.
  • [Schulman et al., 2015] Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. International Conference on Learning Representations.
  • [Snell et al., 2017] Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087.
  • [Sutton, 1996] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044.
  • [Sutton and Barto, 2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • [Sutton et al., 2000] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
  • [Thrun and Pratt, 2012] Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
  • [Vezhnevets et al., 2019] Vezhnevets, A. S., Wu, Y., Leblond, R., and Leibo, J. (2019). Options as responses: Grounding behavioural hierarchies in multi-agent rl. arXiv preprint arXiv:1906.01470.
  • [Vinyals et al., 2016] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
  • [Von Neumann and Morgenstern, 2007] Von Neumann, J. and Morgenstern, O. (2007). Theory of games and economic behavior (commemorative edition). Princeton university press.
  • [Watkins and Dayan, 1992] Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279–292.
  • [Yao et al., 2019] Yao, H., Wei, Y., Huang, J., and Li, Z. (2019). Hierarchically structured meta-learning. In International Conference on Machine Learning, pages 7045–7054.
  • [Yuksel et al., 2012] Yuksel, S. E., Wilson, J. N., and Gader, P. D. (2012). Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193.
  • [Zintgraf et al., 2018] Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., and Whiteson, S. (2018). Caml: Fast context adaptation via meta-learning. Internationcal Conference on Learning Representations.