Abstract
The goal of metalearning is to train a model on a variety of learning tasks,such that it can adapt to new problems within only a few iterations. Here wepropose a principled informationtheoretic model that optimally partitions theunderlying problem space such that the resulting partitions are processed byspecialized expert decisionmakers. To drive this specialization we impose thesame kind of information processing constraints both on the partitioning andthe expert decisionmakers. We argue that this specialization leads toefficient adaptation to new tasks. To demonstrate the generality of ourapproach we evaluate on three metalearning domains: image classification,regression, and reinforcement learning.
Quick Read (beta)
Abstract
The goal of metalearning is to train a model on a variety of learning tasks, such that it can adapt to new problems within only a few iterations. Here we propose a principled informationtheoretic model that optimally partitions the underlying problem space such that the resulting partitions are processed by specialized expert decisionmakers. To drive this specialization we impose the same kind of information processing constraints both on the partitioning and the expert decisionmakers. We argue that this specialization leads to efficient adaptation to new tasks. To demonstrate the generality of our approach we evaluate on three metalearning domains: image classification, regression, and reinforcement learning.
fit,positioning,quotes,arrows.meta
Hierarchical Expert Networks for MetaLearning
Heinke Hihn Daniel A. Braun
Institute for Neural Information Processing Ulm University Ulm, Germany Institute for Neural Information Processing Ulm University Ulm, Germany
1 Introduction
Recent machine learning research has shown impressive results on incredibly diverse tasks from problem classes such as pattern recognition, reinforcement learning, and generative model learning [Devlin et al., 2018, Mnih et al., 2015, Schmidhuber, 2015]. These success stories typically have two computational luxuries in common: a large data base with thousands or even millions of training samples and a very long and extensive training period. However, applying these pretrained models to new tasks naïvely usually leads to very poor performance, as with each new incoming batch of data, expensive and slow relearning is required. In contrast to this, humans are able to learn from very few examples and excel at adapting quickly [Jankowski et al., 2011], for example in motor tasks [Braun et al., 2009] or at learning new visual concepts [Lake et al., 2015].
Sampleefficient adaptation to new tasks can be regarded as a form of metalearning or “learning to learn” [Thrun and Pratt, 2012, Schmidhuber et al., 1997, Caruana, 1997] and is an ongoing and active field of research–see e.g. [Koch et al., 2015, Vinyals et al., 2016, Finn et al., 2017, Ravi and Larochelle, 2017, Ortega et al., 2019, Botvinick et al., 2019, Yao et al., 2019]. Metalearning can be defined in different ways, but a common point is that the system learns on two levels, each with different time scales: slow learning across different tasks on a metalevel, and fast learning to adapt to each task individually.
Here, we propose a novel learning paradigm for hierarchical meta learning systems. Our method finds an optimal soft partitioning of the problem space by imposing informationtheoretic constraints on both the process of expert selection and on the expert specialization. We argue that these constraints drive an efficient division of labor in systems that are bounded in their respective information processing power, where we make use of informationtheoretic bounded rationality [Ortega and Braun, 2013]. When the model is presented with previously unseen tasks it assigns them to experts specialized on similar tasks – see Figure 1. Additionally, expert networks specializing on only a subset of the problem space allows for smaller neural network architectures with only few units per layer. In order to split the problem space and to assign the partitions to experts, we learn to represent tasks through a common latent embedding, that is then used by a selector network to distribute the tasks to the experts.
The outline of this paper is as follows: first we introduce bounded rationality and meta learning, next we introduce our novel approach and derive applications to classification, regression, and reinforcement learning. Finally, we conclude.
2 Background
2.1 Bounded Rational Decision Making
An important concept in decision making is the notion of utility [Von Neumann and Morgenstern, 2007], where an agent picks an action ${a}_{s}^{*}\in \mathcal{A}$ such that it maximizes their utility in some context $s\in \mathcal{S}$, i.e. ${a}_{s}^{*}={\mathrm{arg}\mathrm{max}}_{a}\mathbf{U}(s,a)$, where the utility is given by a function $\mathbf{U}(s,a)$ and the states distribution $p(s)$ is known and fixed. Trying to solve this optimization problem naïvely leads to an exhaustive search over all possible $(a,s)$ pairs, which is in general a prohibitive strategy. Instead of finding an optimal strategy, a boundedrational decisionmaker optimally trades off expected utility and the processing costs required to adapt. In this study we consider the informationtheoretic freeenergy principle [Ortega and Braun, 2013] of bounded rationality, where the decisionmaker’s resources are modeled by an upper bound on the KullbackLeibler divergence ${\text{D}}_{\text{KL}}(p(as)p(a))={\sum}_{a}p(as)\mathrm{log}\frac{p(as)}{p(a)}$ between the agent’s prior distribution $p(a)$ and the posterior policy $p(as)$, resulting in the following constrained optimization problem:
$\underset{p(as)}{\mathrm{max}}{\displaystyle \sum _{s,a}}p(s)p(as)\mathbf{U}(s,a)$  (1)  
$\text{s.t.}{\mathbb{E}}_{p(s)}\left[{\text{D}}_{\text{KL}}(p(as)p(a))\right]\le \text{B}.$  (2) 
This constraint can also be interpreted as a regularization on $p(as)$. We can transform this into an unconstrained variational problem by introducing a Lagrange multiplier $\beta \in {\mathbb{R}}^{+}$:
$$\underset{p(as)}{\mathrm{max}}{\mathbb{E}}_{p(sa)}\left[\mathbf{U}(s,a)\right]\frac{1}{\beta}{\mathbb{E}}_{p(s)}\left[{\text{D}}_{\text{KL}}(p(as)p(a))\right].$$  (3) 
For $\beta \to \mathrm{\infty}$ we recover the maximum utility solution and for $\beta \to 0$ the agent can only act according to the prior. The optimal prior in this case is given by the marginal $p(a)={\sum}_{s\in S}p(s)p(as)$ [Ortega and Braun, 2013].
2.1.1 Hierarchical Decision Making
Aggregating several boundedrational agents by a selection policy allows for solving optimization problems that exceed the capabilities of the individual decisionmakers [Genewein et al., 2015]. To achieve this, the search space is split into partitions such that each partition can be solved by a decisionmaker. A two stage mechanism is introduced: The first stage is an expert selection policy $p(xs)$ that chooses an expert $x$ given a state $s$ and the second stage chooses an action according to the expert’s posterior policy $p(as,x)$. The optimization problem given by (3) can be extended to incorporate a tradeoff between computational costs and utility in both stages:
$$\underset{p(as,x),p(xs)}{\mathrm{max}}\mathbb{E}[\mathbf{U}(s,a)]\frac{1}{{\beta}_{1}}I(S;X)\frac{1}{{\beta}_{2}}I(S;AX)$$  (4) 
where ${\beta}_{1}$ is the resource parameter for the expert selection stage and ${\beta}_{2}$ for the experts. $I(\cdot ;\cdot )$ is the mutual information between the two random variables. The solution can be found by iterating the following set of equations [Genewein et al., 2015]:
$$\{\begin{array}{cc}\begin{array}{ccc}\hfill p(xs)& \hfill =\hfill & \frac{1}{Z(s)}p(x)\mathrm{exp}({\beta}_{1}\mathrm{\Delta}{\text{F}}_{\text{par}}(s,x))\hfill \\ \hfill p(x)& \hfill =\hfill & {\sum}_{s}p(s)p(xs)\hfill \\ \hfill p(as,x)& \hfill =\hfill & \frac{1}{Z(s,x)}p(ax)\mathrm{exp}({\beta}_{2}\mathbf{U}(s,a))\hfill \\ \hfill p(ax)& \hfill =\hfill & {\sum}_{s}p(sx)p(as,x)\hfill \end{array}\hfill & \end{array}$$  (5) 
where $Z(s)$ and $Z(s,x)$ are normalization factors and $\mathrm{\Delta}{\text{F}}_{\text{par}}(s,x)={\mathbb{E}}_{p(as,x)}[\mathbf{U}(s,a)]\frac{1}{{\beta}_{2}}{\text{D}}_{\text{KL}}(p(as,x)p(ax))$ is the free energy of the action selection stage. Thus the marginal distribution $p(ax)$ defines a mixtureofexperts policy given by the posterior distributions $p(as,x)$ weighted by the responsibilities determined by the Bayesian posterior $p(sx)$. Note that $p(sx)$ is not determined by a given likelihood model, but is the result of the optimization process (4).
2.2 Meta Learning
Metalearning algorithms can be divided roughly into MetricLearning [Koch et al., 2015, Vinyals et al., 2016, Snell et al., 2017], Optimizer Learning [Ravi and Larochelle, 2017, Finn et al., 2017, Zintgraf et al., 2018], and Task Decomposition Models [Lan et al., 2019, Vezhnevets et al., 2019]. Our approach depicted in Figure 2 can be seen as a member of the latter group.
2.2.1 Meta Supervised Learning
In a supervised learning task we are usually interested in a dataset consisting of multiple input and output pairs $\mathcal{D}={\{({x}_{i},{y}_{i})\}}_{i=0}^{N}$ and the learner is tasked with finding a function $f(x)$ that maps from input to output, for example through a deep neural network. To do this, we split the dataset into training and test sets and fit a set of parameters $\theta $ on the training data and evaluate on test data using the learned function ${f}_{\theta}(x)$. In metalearning, we are instead working with metadatasets $D\in \mathcal{D}$, each containing regular datasets split into training and test sets. We thus have different metasets for metatraining, metavalidation, and metatest (${D}_{\text{metatrain}},{D}_{\text{metaval}}$, and ${D}_{\text{metatest}}$, respectively). On ${D}_{\text{metatrain}}$, we are interested in training a learning procedure (the metalearner) that can take as input one of its training sets ${D}_{\text{train}}$ and produce a classifier (the learner) that achieves low prediction error on its corresponding test set ${D}_{\text{test}}$.
A special case of metalearning for classification are $K$Shot $N$way tasks. In this setting, we are given for each dataset $D$ a training set consisting of $K$ labeled examples of each of the $N$ classes ($K\cdot N$ examples per dataset) and corresponding test sets. In our study, we focus on the following variation of $K$Shot 2Way tasks: the metalearner is presented with $2K$ samples ($K$ positive and $K$ negative examples) and must assign this dataset to an expert learner. Note that the negative examples may be drawn from any of the remaining classes.
2.2.2 Meta Reinforcement Learning
We model sequential decision problems by defining a Markov Decision Process as a tuple $(\mathcal{S},\mathcal{A},P,r)$, where $\mathcal{S}$ is the set of states, $\mathcal{A}$ the set of actions, $P:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0,1]$ is the transition probability, and $r:\mathcal{S}\times \mathcal{A}\to \mathbb{R}$ is a reward function. The aim is to find the parameter $\theta $ of a policy ${\pi}_{\theta}$ that maximizes the expected reward:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{max}}\underset{J({\pi}_{\theta})}{\underset{\u23df}{{\mathbb{E}}_{{\pi}_{\theta}}\left[\sum _{t=0}^{\mathrm{\infty}}r({s}_{t},{a}_{t})\right]}}.$$  (6) 
We define $r(\tau )={\sum}_{t=0}^{\mathrm{\infty}}r({s}_{t},{a}_{t})$ as the cumulative reward of trajectory $\tau ={\{({s}_{t},{a}_{t})\}}_{i=0}^{\mathrm{\infty}}$, which is sampled by acting according to the policy $\pi $, i.e. $(s,a)\sim \pi (\cdot s),$ and ${s}_{t+1}\sim P(\cdot {s}_{t},{a}_{t})$. Learning in this environment can then be modeled by reinforcement learning [Sutton and Barto, 2018], where an agent interacts with an environment over a number of (discrete) time steps $t$. At each time step $t$, the agent finds itself in a state ${s}_{t}$ and selects an action ${a}_{t}$ according to the policy $\pi ({a}_{t}{s}_{t})$. In return, the environment transitions to the next state ${s}_{t+1}$ and generates a scalar reward ${r}_{t}$. This process continues until the agent reaches a terminal state after which the process restarts. The goal of the agent is to maximize the expected return from each state ${s}_{t}$, which is typically defined as the infinite horizon discounted sum of the rewards. A common choice to achieving this is QLearning [Watkins and Dayan, 1992], where we make use of an actionvalue function that is defined as the discounted sum of rewards $Q(\tau )={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t},{a}_{t})$, where $\gamma \in (0,1]$ is a discount factor. Learning the optimal policy can be achieved in many ways. Here, we consider Policy gradient methods [Sutton et al., 2000] which are a popular choice to tackle continuous reinforcement learning problems. The main idea is to directly manipulate the parameters $\theta $ of the policy in order to maximize the objective $J({\pi}_{\theta})$ by taking steps in the direction of the gradient ${\nabla}_{\theta}J({\pi}_{\theta})$.
In meta reinforcement learning the problem is given by a set of tasks ${t}_{i}\in T$, where each task ${t}_{i}$ is defined by an MDP ${t}_{i}=(\mathcal{S},\mathcal{A},{P}_{i},{r}_{i})$ as described earlier. We are now interested in finding a set of policies $\mathrm{\Theta}$ that maximizes the average cumulative reward across all tasks in $T$ and generalizes well to new tasks sampled from a different set of tasks ${T}^{\prime}$.
[t!] {algorithmic}[1] \StateInput: Data Distribution $p(\mathcal{D})$, number of samples $K$, batchsize $M$, training episodes $N$ \StateHyperparameters: resource parameters ${\beta}_{1}$, ${\beta}_{2}$, learning rates ${\eta}_{s}$, ${\eta}_{x}$ for selector and experts \StateInitialize parameters $\theta ,\vartheta $ \For$i$ = 0, 1, 2, …, $N$ \StateSample batch of $M$ datasets ${D}_{i}\sim p(\mathcal{D})$, each consisting of a training dataset ${D}_{\text{metatrain}}$ and a metavalidation dataset ${D}_{\text{metaval}}$ with $2K$ samples each \For$D\in {D}_{i}$ \StateFind Latent Embedding $z({D}_{\text{metatrain}})$ \StateSelect expert $x\sim {p}_{\theta}(xz({D}_{\text{metatrain}}))$ \StateCompute $\mathrm{\Delta}{F}_{D,x}$ of $x$ on ${D}_{\text{metaval}}$ \EndFor \StateUpdate selection parameters $\theta $ with $\mathrm{\Delta}{F}_{D,x}$ \StateUpdate Autoencoder with pos. samples in ${D}_{i}$ \StateUpdate experts $x$ with assigned ${D}_{\text{metatrain}}$ \EndFor \Statereturn $\theta $, $\vartheta $
3 Expert Networks for MetaLearning
Informationtheoretic bounded rationality postulates that hierarchies and abstractions emerge when agents have only limited access computational resources [Genewein et al., 2015, Gottwald and Braun, 2019b, Gottwald and Braun, 2019a], e.g. limited sampling complexity [Hihn et al., 2018] or limited representational power [Hihn et al., 2019]. We will show that forming such abstractions equips an agent with the ability of learning the underlying problem structure and thus enables learning of unseen but similar concepts. The method we propose comes out of a unified optimization principle and has the following important features:

1.
A regularization mechanism to enforce the emergence of expert policies.

2.
A task compression mechanism to extract relevant task information.

3.
A selection mechanism to find the most efficient expert for a given task.

4.
A regularization mechanism to improve generalization capabilities.
3.1 Latent Task Embeddings
Note that the selector assigns a complete dataset to an expert and that this can be seen as a metalearning task, as described in [Ravi and Larochelle, 2017]. To do so, we must find a feature vector $z(d)$ of the dataset $d$. This feature vector must fulfill the following desiderata: 1) invariance against permutation of data points in $d$, 2) high representational capacity, 3) efficient computability, and 4) constant dimensionality regardless of sample size $K$. In the following we propose such features for image classification, regression, and reinforcement learning problems.
For image classification we propose to pass the positive images in the dataset through a convolutional autoencoder and use the respective outputs of the bottleneck layer. Convolutional Autoencoders are generative models that learn to reconstruct their inputs by minimizing the MeanSquaredError between the input and the reconstructed image (see e.g. [Chen et al., 2019]). In this way we get similar embeddings $z(d)$ for similar inputs belonging to the same class. The latent representation is computed for each positive sample in $d$ and then passed through a pooling function $h(z(d))$ to find a single embedding for the complete dataset–see figure 2 for an overview of our proposed model. While in principle functions such as mean, max, and min can be used, we found that max pooling yields the best results. The authors of [Yao et al., 2019] propose a similar feature set.
For regression we define a similar feature vector. The $K$ training data points are transformed into a feature vector $z(d)$ by binning the points into $N$ bins according to their respective $x$ value and collecting the respective $y$ value. If more than one point falls into the same bin the $y$ values are averaged, thus providing invariance against the order of the data points in ${D}_{\text{metatrain}}$. We use this feature vector to assign each data set to an expert according to ${p}_{\theta}(xz(d))$.
In the reinforcement learning setting we use a dynamic recurrent neural network (RNN) with LSTM units [Hochreiter and Schmidhuber, 1997] to classify trajectories. We feed the RNN with $({s}_{t},{a}_{t},{r}_{t},t)$ tuples to describe the underlying Markov Decision Process describing the task. At $t=0$ we sample the expert $x$ according to the learned prior distribution $p(x)$, as there is no information available so far. The authors of [Lan et al., 2019] propose a similar feature set.
Omniglot FewShot Classification Results  
Number of Experts  
K  2  4  8  16  
% Acc  I(X;W)  % Acc  I(X;W)  % Acc  I(X;W)  % Acc  I(X;W)  
1  76.2 ($\pm $ 0.02)  0.99 ($\pm $ 0.01)  86.7 ($\pm $ 0.02)  1.96 ($\pm $ 0.01)  90.1 ($\pm $ 0.01)  2.5 ($\pm $ 0.20)  92.9 ($\pm $ 0.01)  3.2 ($\pm $ 0.3) 
5  67.3 ($\pm $ 0.01)  0.93 ($\pm $ 0.01)  75.5 ($\pm $ 0.01)  1.95 ($\pm $ 0.10)  78.4 ($\pm $ 0.01)  2.7 ($\pm $ 0.10)  81.2 ($\pm $ 0.01)  3.3 ($\pm $ 0.2) 
10  66.4 ($\pm $ 0.04)  0.95 ($\pm $ 0.30)  75.8 ($\pm $ 0.01)  1.90 ($\pm $ 0.03)  77.3 ($\pm $ 0.01)  2.8 ($\pm $ 0.15)  77.8 ($\pm $ 0.01)  3.1 ($\pm $ 0.2) 
3.2 Hierarchical Online MetaLearning
As discussed in section 2.1, the aim of the selection network is to find an optimal partition of the experts ${p}_{\theta}(xs)$, such that the selector’s expected utility ${\sum}_{x}{p}_{\theta}(xs)\mathrm{\Delta}{\text{F}}_{\text{par}}(s,x)$ is maximized under an informationtheoretic constraint ${\text{D}}_{\text{KL}}({p}_{\theta}(xs)p(x))$, where $\theta $ are the selector’s parameters (e.g. weights in a neural network), $x$ the expert and $s$ is an input. Each expert $x$ follows a policy ${p}_{\vartheta}(as,x)$ that maximizes their expected utility $\sum p(sx)\sum {p}_{\theta}(as,x)\mathbf{U}(s,a)$. We introduce our gradient based online learning algorithm to find the optimal partitioning and the expert parameters in the following. Rewriting the optimization problem (4) as
$$\underset{{p}_{\vartheta}(as,x),{p}_{\theta}(xs)}{\mathrm{max}}\sum _{s,x,a}p(s){p}_{\theta}(xs){p}_{\vartheta}(as,x)J(s,x,a)$$  (7) 
where the objective $J(s,x,a)$ is given by
$$J(s,x,a)=\mathbf{U}(s,a)\frac{1}{{\beta}_{1}}\mathrm{log}\frac{{p}_{\theta}(xs)}{p(x)}\frac{1}{{\beta}_{2}}\mathrm{log}\frac{{p}_{\vartheta}(as,x)}{p(ax)},$$  (8) 
and $\theta ,\vartheta $ are the parameters of the selection policy and the expert policies, respectively. Note that each expert policy has a distinct set of parameters ${\vartheta}_{x}$, i.e. $\vartheta ={\{{\vartheta}_{x}\}}_{x}$, but we drop the $x$ index for readability. In the following we will show how we can apply this formulation to classification, regression and reinforcement learning.
3.2.1 Application to Supervised Learning
Combining multiple experts can often be beneficial [Kuncheva, 2004], e.g. in MixtureofExperts [Yuksel et al., 2012] or Multiple Classifier Systems [Bellmann et al., 2018]. Our method can be interpreted as a member of this family of algorithms.
In accordance with Section 2.1 we define the utility as the negative prediction loss, i.e. $U({f}_{x}(d),y)=\mathcal{L}({f}_{x}(d),y)$, where ${f}_{x}(d)$ is the prediction of the expert $x$ given the input data point $d$ (in the following we will use the shorthand ${\widehat{y}}_{x}$) and $y$ is the ground truth. We define the crossentropy loss $\mathcal{L}({\widehat{y}}_{x},y,)={\sum}_{i}{y}_{i}\mathrm{log}{\widehat{{y}_{i}}}_{x}$ as a performance measure for classification and the mean squared error $\mathcal{L}({\widehat{y}}_{x},y)={\sum}_{i}{({\widehat{y}}_{i}{y}_{i})}^{2}$ for regression. The objective for expert selection thus is given by
$$\underset{\theta}{\mathrm{max}}{\mathbb{E}}_{{p}_{\theta}(xd)}\left[\widehat{f}\frac{1}{{\beta}_{1}}\mathrm{log}\frac{{p}_{\theta}(xd)}{p(x)}\right],$$  (9) 
where $\widehat{f}={\mathbb{E}}_{{p}_{\vartheta}({\widehat{y}}_{x}x,s)}\left[\mathcal{L}({\widehat{y}}_{x},y)\frac{1}{{\beta}_{2}}\mathrm{log}\frac{{p}_{\vartheta}({\widehat{y}}_{x}s,x)}{p({\widehat{y}}_{x}x)}\right]$, i.e. the free energy of the expert and $\theta ,\vartheta $ are the parameters of the selection policy and the expert policies, respectively. Analogously, the action selection objective for each expert $x$ is defined by
$$\underset{\vartheta}{\mathrm{max}}{\mathbb{E}}_{{p}_{\vartheta}({\widehat{y}}_{x}x,s)}\left[\mathcal{L}({\widehat{y}}_{x},y)\frac{1}{{\beta}_{2}}\mathrm{log}\frac{{p}_{\vartheta}({\widehat{y}}_{x}s,x)}{{p}_{(}{\widehat{y}}_{x}x)}\right].$$  (10) 
3.2.2 Application to Reinforcement Learning
In the reinforcement learning setup the utility $\mathbf{U}(s,a)$ is given by the reward function $r(s,a)$. In maximum entropy RL the regularization penalizes deviation from a fixed uniform prior, but in a more general setting we can discourage deviation from an arbitrary prior policy by determining the optimal policy ${p}^{*}(as)$ as
$$\underset{p}{\mathrm{arg}\mathrm{max}}{\mathbb{E}}_{p}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}\left(r({s}_{t},{a}_{t})\frac{1}{\beta}\mathrm{log}\frac{p({a}_{t}{s}_{t})}{p({a}_{t})}\right)\right].$$  (11) 
As discussed in Section 2.1, the optimal prior is the marginal of the posterior policy given by $p(a)={\sum}_{s}p(s)p(as)$. We approximate the prior distributions $p(x)$ and $p(ax)$ by exponential running mean averages of the posterior policies.
To optimize the objective we define two separate value functions: one to estimate the discounted sum of rewards and one to estimate the free energy of the expert policies. The discounted reward for the experts is ${R}_{t}={\sum}_{l=0}^{T}{\gamma}^{l}r({s}_{t+l},{a}_{t+l}),$ which we learn by parameterizing the value function with a neural network. Similar to the discounted reward ${R}_{t}$ we can now define the discounted free energy ${F}_{t}$ as ${F}_{t}={\sum}_{l=0}^{T}{\gamma}^{l}f({s}_{t+l},{x}_{t+l},{a}_{t+l}),$ where $f(s,x,a)=r(s,a)\frac{1}{{\beta}_{2}}\mathrm{log}\frac{{p}_{\vartheta}(as,x)}{p(ax)}$. The value function ${F}_{t}$ is learned by parameterizing it with a neural network and performing regression on the meansquarederror.
3.2.3 Expert Selection
The selector network learns a policy ${p}_{\theta}(xs)$ that assigns states $s$ to expert policies $x$ optimally. The resource parameter ${\beta}_{1}$ constrains the informationprocessing in this step. For ${\beta}_{1}\to 0$ the selection assigns each state completely randomly to an expert, while for ${\beta}_{1}\to \mathrm{\infty}$ the selection becomes deterministic, always choosing the most promising expert $x$. The selector optimizes the following objective:
$$\underset{\theta}{\mathrm{max}}{\mathbb{E}}_{{p}_{\theta}(xs)}\left[\widehat{f}(s,x)\frac{1}{{\beta}_{1}}\mathrm{log}\frac{{p}_{\theta}(xs)}{p(x)}\right],$$  (12) 
where $\widehat{f}(s,x)={\mathbb{E}}_{{p}_{\vartheta}(as,x)}[f(s,x,a)]$, which is the free energy of the expert. The gradient of $J(\theta )$ is then given (up to an additive constant) by
$$\mathbb{E}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(xs)\left(\widehat{f}(s,x)\frac{1}{{\beta}_{1}}\mathrm{log}\frac{{p}_{\theta}(xs)}{p(x)}\right)\right].$$  (13) 
The double expectation can be replaced by Monte Carlo estimates, where in practice we use a single $(s,x,a)$ tuple for $\widehat{f}(s,x)$. This formulation is known as the policy gradient method [Sutton et al., 2000] and is prone to producing high variance gradients, but can be reduced by using an advantage function instead of the reward [Schulman et al., 2015]. The advantage function $A({a}_{t},{s}_{t})$ is a measure of how well a certain action $a$ performs in a state $s$ compared to the average performance in that state, i.e. $A(a,s)=Q(s,a)V(s)$. Here, $V(s)$ is called the value function and captures the expected cumulative reward when in state $s$, and $Q(s,a)$ is an estimate of the expected cumulative reward achieved in state $s$ when choosing a particular action $a$. Thus the advantage is an estimate of how advantageous it is to pick $a$ in state $s$ in relation to a baseline performance $V(s)$. Instead of learning $V$ and $Q$, we can approximate the advantage function
$${\widehat{A}}_{t}^{F}=\underset{\approx Q({s}_{t},{a}_{t})}{\underset{\u23df}{f({s}_{t},{x}_{t},{a}_{t})+\gamma {V}_{\varphi}({s}_{t+1})}}{V}_{\varphi}({s}_{t}),$$  (14) 
such that we can get away with just learning a single value function ${V}_{\varphi}({s}_{t})$. Both the selection network and the selector value network are implemented as recurrent neural networks with LSTM cells [Hochreiter and Schmidhuber, 1997]. Both networks share the recurrent cell followed by independent feed forward layers.
3.2.4 Action Selection
The actions $a$ is sampled from the posterior action distribution of the experts. Each expert $x$ maintains a policy for each of the world states $s$ and updates those according to the utility/cost tradeoff. The advantage function for each expert is given as
$${\widehat{A}}_{t}=r({s}_{t},{a}_{t})+\gamma {V}_{\phi}({s}_{t+1}){V}_{\phi}({s}_{t}).$$  (15) 
The objective of this stage is then to maximize the expected advantage ${\mathbb{E}}_{{p}_{\vartheta}(as,x)}\left[{\widehat{A}}_{t}\right]$.
4 Empirical Results
4.1 Sinusoid Regression
We adopt this task from [Finn et al., 2017]. In this $K$shot problem, each task consists of learning to predict a function of the form $y=a\cdot \mathrm{sin}(x+b)$, with both $a\in [0.1,5]$ and $b\in [0,2\pi ]$ chosen uniformly, and the goal of the learner is to find $y$ given $x$ based on only $K$ pairs of $(x,y)$. Given that the underlying function changes in each iteration it is impossible to solve this problem with a single learner. Our results show that by combing expert networks, we are able to reduce the generalization error iteratively as we add more experts to our system–see Figures 5 for $K=5$ and $K=10$ settings. In Figure 4 we show how the system is able to capture the underlying problem structure as we add more experts and in Figure 3 we visualize how the selector’s partition of the problem space looks like.
4.2 FewShot Classification
The Omniglot dataset [Lake et al., 2011] consists of over 1600 characters from 50 alphabets. As each character has merely 20 samples each drawn by a different person, this forms a difficult learning task and is thus often referred to as the ”transposed MNIST” dataset. The Omniglot dataset is regarded as a standard meta learning benchmark, see e.g. [Finn et al., 2017, Vinyals et al., 2016, Ravi and Larochelle, 2017].
We train the learner on a subset of the dataset ($\approx 80\%$, i.e. $\approx $ 1300 classes) and evaluate on the remaining $\approx 300$ classes, thus investigating the ability to generalize to new data. In each round we build the datasets ${D}_{\text{metatrain}}$ and ${D}_{\text{metatest}}$ by selecting a target class ${c}_{t}$ and sample $K$ positive and $K$ negative samples. To generate negative samples we draw $K$ images randomly out of the remaining $N1$ classes. We present the selection network with the feature presentation of the $K$ positive training samples (see Figure 2), but evaluate the experts’ performance on the $2K$ test samples in ${D}_{\text{metatest}}$. In this way the free energy of the experts becomes a measure of how well the expert is able to generalize to new samples of the target class and distinguish them from negative examples. Using this optimization scheme, we train the expert networks to become experts in recognizing a subset of classes. After a suitable expert is selected we train that expert using the $2K$ samples from the training dataset—see Figure 5 and Table 1 for results. To generate this figure, we ran a 10fold crossvalidation on the whole dataset and show the averaged performance metric and the respective standarddeviation across the folds. In both settings ”0 bits” corresponds to a single expert, i.e. a single neural network trained on the task.
4.3 Meta Reinforcement Learning
Task Distribution  
Paramater  $T$  ${T}^{\prime}$ 
Distance Penalty  [${10}^{3},{10}^{1}$]  [${10}^{3},{10}^{2}$] 
Goal Position  [0.3, 0.4]  [0, 3] 
Start Position  [0.15, 0.15]  [0.25, 0.25] 
Motor Torques  $[0,5]$  $[0,3]$ 
Motor Actuation  [185, 215]  [175, 225] 
Inverted Control  $p=0.5$  $p=0.5$ 
Gravity  [0.01, 4.9]  [4.9, 9.8] 
We create a set of RL tasks by sampling the parameters for the Inverted Double Pendulum problem [Sutton, 1996] implemented in OpenAI Gym [Brockman et al., 2016]. The task is to balance a twolink pendulum in an upward position. We modify inertia, motor torques, reward function, goal position and invert the control signal – see Table 2 for details. The control signal $a$ is continuous in the interval [1,1] is generated by neural network that outputs $\mu $ and $\mathrm{log}(\sigma )$ of a gaussian. The action is sampled by reparameterizing the distribution to $p(a)=\mu +\mathrm{exp}(\sigma )\u03f5$, where $\u03f5\sim \mathcal{N}(0,1)$, so that the distribution is differentiable w.r.t to the network outputs.
The meta task set ${T}^{\prime}$ is based on the same environment, but the parameter distribution and range is different, providing new but similar reinforcement learning problems. In each episode $M$ environments are sampled and the system is updated accordingly. After training is concluded the system is evaluated on tasks sampled from ${T}^{\prime}$. We trained the system for 1000 Episodes with 64 tasks from $T$ and evaluate for 100 system updates on tasks from ${T}^{\prime}$. We report the results in Figure 6, where we can see improving performance as more experts are added and the mutual information in the selection stage indicates that the tasks can be assigned to their respective expert policy.
5 Discussion
We have introduced and evaluated a novel informationtheoretic approach to meta learning. In particular we leveraged an informationtheoretic approach to bounded rationality [Leibfried et al., 2017, GrauMoya et al., 2019, Hihn et al., 2019, Schach et al., 2018, Gottwald and Braun, 2019b, LindigLeon et al., 2019]. Our results show that our method is able to identify subregions of the problem set with expert networks. In effect, this equips the system with several initializations covering the problem space and thus enables it to adapt quickly to new but similar tasks. To reliably identify such tasks, we have proposed feature extraction methods for classification, regression and reinforcement learning, that could be simply be replaced and improved in the future. The strength of our model is that it follows from simple principles that can be applied to a large range of problems. Moreover, the system performance can be interpreted in terms of the information processing of the selection stage and the expert decisionmakers.
Most other methods for meta learning such as [Finn et al., 2017] and [Ravi and Larochelle, 2017] try to find a initial parametrization of a single learner, such that it is able to adapt quickly to new problems. This initialization can be interpreted as compression of the most common task properties over all tasks. Our method however learns to identify task properties over a subset of tasks and provide several initializations. Task specific information is thus directly available instead of a delayed availability after several iterations as in [Finn et al., 2017] and [Ravi and Larochelle, 2017]. In principle, this can help to adapt within fewer iterations. Thus our method can be seen as the general case of such monolithic metalearning algorithms.
Another hierarchical approach to metalearning is the work of [Yao et al., 2019], where the focus is on learning similarities between completely different problems (e.g. different classification datasets). In this way the portioning is largely governed by the different tasks. Our study however focuses on discovering metainformation within the same task family, where the metapartitioning is determined solely by the optimization process and can thus potentially discover unknown dynamics and relations within a task family.
Although our method is widely applicable, it suffers from low sample efficiency in the RL domain. An interesting research direction would be to combine our system with modelbased RL which is known improve sample efficiency. Another research direction would be to investigate our systems performance in continual adaption tasks, such as in [Yao et al., 2019]. There the system is continuously provided with data sets (e.g. additional classes and samples). Another limitation is the restriction to binary meta classification tasks, which we leave for feature work.
Acknowledgments
This work was supported by the European Research Council Starting Grant BRISC, ERCSTG2015, Project ID 678082.
References
 [Arimoto, 1972] Arimoto, S. (1972). An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20.
 [Bellmann et al., 2018] Bellmann, P., Thiam, P., and Schwenker, F. (2018). Multiclassifiersystems: Architectures, algorithms and applications. In Computational Intelligence for Pattern Recognition, pages 83–113. Springer.
 [Blahut, 1972] Blahut, R. E. (1972). Computation of channel capacity and ratedistortion functions. IEEE Transactions on Information Theory, 18(4):460–473.
 [Botvinick et al., 2019] Botvinick, M., Ritter, S., Wang, J. X., KurthNelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in cognitive sciences.
 [Braun et al., 2009] Braun, D. A., Aertsen, A., Wolpert, D. M., and Mehring, C. (2009). Motor task variation induces structural learning. Current Biology, 19(4):352–357.
 [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
 [Caruana, 1997] Caruana, R. (1997). Multitask learning. Machine learning, 28(1):41–75.
 [Chen et al., 2019] Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C. F., and Huang, J.B. (2019). A closer look at fewshot classification. International Conference on Representation Learning.
 [Devlin et al., 2018] Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 [Finn et al., 2017] Finn, C., Abbeel, P., and Levine, S. (2017). Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org.
 [Genewein et al., 2015] Genewein, T., Leibfried, F., GrauMoya, J., and Braun, D. A. (2015). Bounded rationality, abstraction, and hierarchical decisionmaking: An informationtheoretic optimality principle. Frontiers in Robotics and AI, 2:27.
 [Gottwald and Braun, 2019a] Gottwald, S. and Braun, D. A. (2019a). Bounded rational decisionmaking from elementary computations that reduce uncertainty. Entropy, 21(4).
 [Gottwald and Braun, 2019b] Gottwald, S. and Braun, D. A. (2019b). Systems of bounded rational agents with informationtheoretic constraints. Neural computation, 31(2):440–476.
 [GrauMoya et al., 2019] GrauMoya, J., Leibfried, F., and Vrancx, P. (2019). Soft qlearning with mutualinformation regularization. International Conference on Learning Representations.
 [Hihn et al., 2018] Hihn, H., Gottwald, S., and Braun, D. A. (2018). Bounded rational decisionmaking with adaptive neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pages 213–225. Springer.
 [Hihn et al., 2019] Hihn, H., Gottwald, S., and Braun, D. A. (2019). An informationtheoretic online learning principle for specialization in hierarchical decisionmaking systems. arXiv preprint arXiv:1907.11452.
 [Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8):1735–1780.
 [Jankowski et al., 2011] Jankowski, N., Duch, W., and Grkabczewski, K. (2011). Metalearning in computational intelligence, volume 358. Springer Science & Business Media.
 [Koch et al., 2015] Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for oneshot image recognition. In ICML deep learning workshop, volume 2.
 [Kuncheva, 2004] Kuncheva, L. I. (2004). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.
 [Lake et al., 2011] Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J. (2011). One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33.
 [Lake et al., 2015] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
 [Lan et al., 2019] Lan, L., Li, Z., Guan, X., and Wang, P. (2019). Meta reinforcement learning with task embedding and shared policy. International Joint Conference on Artificial Intelligence.
 [Leibfried et al., 2017] Leibfried, F., GrauMoya, J., and Ammar, H. B. (2017). An informationtheoretic optimality principle for deep reinforcement learning. Deep Reinforcement Learning Workshop NIPS 2018.
 [LindigLeon et al., 2019] LindigLeon, C., Gottwald, S., and Braun, D. A. (2019). Analyzing abstraction and hierarchical decisionmaking in absolute identification by informationtheoretic bounded rationality. Frontiers in Neuroscience, 13:1230.
 [Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540):529.
 [Ortega and Braun, 2013] Ortega, P. A. and Braun, D. A. (2013). Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 469(2153).
 [Ortega et al., 2019] Ortega, P. A., Wang, J. X., Rowland, M., Genewein, T., KurthNelson, Z., Pascanu, R., Heess, N., Veness, J., Pritzel, A., Sprechmann, P., et al. (2019). Metalearning of sequential strategies. arXiv preprint arXiv:1905.03030.
 [Ravi and Larochelle, 2017] Ravi, S. and Larochelle, H. (2017). Optimization as a model for fewshot learning. International Conference on Learning Representations.
 [Schach et al., 2018] Schach, S., Gottwald, S., and Braun, D. A. (2018). Quantifying motor task performance by bounded rational decision theory. Frontiers in neuroscience, 12.
 [Schmidhuber, 2015] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61:85–117.
 [Schmidhuber et al., 1997] Schmidhuber, J., Zhao, J., and Wiering, M. (1997). Shifting inductive bias with successstory algorithm, adaptive levin search, and incremental selfimprovement. Machine Learning, 28(1):105–130.
 [Schulman et al., 2015] Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). Highdimensional continuous control using generalized advantage estimation. International Conference on Learning Representations.
 [Snell et al., 2017] Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087.
 [Sutton, 1996] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044.
 [Sutton and Barto, 2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
 [Sutton et al., 2000] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
 [Thrun and Pratt, 2012] Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
 [Vezhnevets et al., 2019] Vezhnevets, A. S., Wu, Y., Leblond, R., and Leibo, J. (2019). Options as responses: Grounding behavioural hierarchies in multiagent rl. arXiv preprint arXiv:1906.01470.
 [Vinyals et al., 2016] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
 [Von Neumann and Morgenstern, 2007] Von Neumann, J. and Morgenstern, O. (2007). Theory of games and economic behavior (commemorative edition). Princeton university press.
 [Watkins and Dayan, 1992] Watkins, C. J. and Dayan, P. (1992). Qlearning. Machine learning, 8(34):279–292.
 [Yao et al., 2019] Yao, H., Wei, Y., Huang, J., and Li, Z. (2019). Hierarchically structured metalearning. In International Conference on Machine Learning, pages 7045–7054.
 [Yuksel et al., 2012] Yuksel, S. E., Wilson, J. N., and Gader, P. D. (2012). Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193.
 [Zintgraf et al., 2018] Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., and Whiteson, S. (2018). Caml: Fast context adaptation via metalearning. Internationcal Conference on Learning Representations.