Abstract
I consider two problems in machine learning and statistics: the problem ofestimating the joint probability density of a collection of random variables,known as density estimation, and the problem of inferring model parameters whentheir likelihood is intractable, known as likelihoodfree inference. Thecontribution of the thesis is a set of new methods for addressing theseproblems that are based on recent advances in neural networks and deeplearning.
Quick Read (beta)
Abstract
I consider two problems in machine learning and statistics: the problem of estimating the joint probability density of a collection of random variables, known as density estimation, and the problem of inferring model parameters when their likelihood is intractable, known as likelihoodfree inference. The contribution of the thesis is a set of new methods for addressing these problems that are based on recent advances in neural networks and deep learning.
The first part of the thesis is about density estimation. The joint probability density of a collection of random variables is a useful mathematical description of their statistical properties, but can be hard to estimate from data, especially when the number of random variables is large. Traditional densityestimation methods such as histograms or kernel density estimators are effective for a small number of random variables, but scale badly as the number increases. In contrast, models for density estimation based on neural networks scale better with the number of random variables, and can incorporate domain knowledge in their design. My main contribution is Masked Autoregressive Flow, a new model for density estimation based on a bijective neural network that transforms random noise to data. At the time of its introduction, Masked Autoregressive Flow achieved stateoftheart results in generalpurpose density estimation. Since its publication, Masked Autoregressive Flow has contributed to the broader understanding of neural density estimation, and has influenced subsequent developments in the field.
The second part of the thesis is about likelihoodfree inference. Typically, a statistical model can be specified either as a likelihood function that describes the statistical relationship between model parameters and data, or as a simulator that can be run forward to generate data. Specifying a statistical model as a simulator can offer greater modelling flexibility and can produce more interpretable models, but can also make inference of model parameters harder, as the likelihood of the parameters may no longer be tractable. Traditional techniques for likelihoodfree inference such as approximate Bayesian computation rely on simulating data from the model, but often require a large number of simulations to produce accurate results. In this thesis, I cast the problem of likelihoodfree inference as a densityestimation problem, and address it with neural density models. My main contribution is the introduction of two new methods for likelihoodfree inference: Sequential Neural Posterior Estimation (Type A), which estimates the posterior, and Sequential Neural Likelihood, which estimates the likelihood. Both methods use a neural density model to estimate the posterior/likelihood, and a sequential training procedure to guide simulations. My experiments show that the proposed methods produce accurate results, and are often orders of magnitude faster than alternative methods based on approximate Bayesian computation.
Acknowledgements
I’m grateful to Iain Murray for supervising me during my PhD. Iain has had a massive impact on my research, my thinking, and my professional development; his contributions extend far beyond this thesis. I’m also grateful to John Winn and Chris Williams, who served on my PhD committee and gave me useful feedback in our annual meetings.
I’m grateful to my coauthors Theo Pavlakou and David Sterratt for their help and contribution. Conor Durkan, Maria Gorinova and Amos Storkey generously read parts of this thesis and gave me useful feedback. I’m grateful to Michael Gutmann and Kyle Cranmer for our discussions, which shaped part of the thesis, and to Johann Brehmer for fixing an important error in my code.
I’d like to extend a special thank you to the developers of Theano (AlRfou et al., 2016), which I used in all my experiments. Their contribution to machinelearning research as a whole has been monumental, and they deserve significant credit for that.
I’m grateful to the Centre for Doctoral Training in Data Science, the Engineering and Physical Sciences Research Council, the University of Edinburgh and Microsoft Research, whose generous funding and support enabled me to pursue a PhD.
Declaration
I declare that this thesis was composed by me, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
(George Papamakarios)
University of Edinburgh
School of Informatics
Doctor of Philosophy
Neural Density Estimation and Likelihoodfree Inference
by
George Papamakarios
April 2019
Contents
 1 Introduction
 I Neural density estimation
 II Likelihoodfree inference
Chapter 1 Introduction
Density estimation and likelihoodfree inference are two fundamental problems of interest in machine learning and statistics; they lie at the core of probabilistic modelling and reasoning under uncertainty, and as such they play a significant role in scientific discovery and artificial intelligence. The goal of this thesis is to develop a new set of methods for addressing these two problems, based on recent advances in neural networks and deep learning.
The first part of the thesis is about density estimation, the problem of estimating the joint probability density of a collection of random variables from samples. In a sense, density estimation is the reverse of sampling: in density estimation, we are given samples and we want to retrieve the density function from which the samples were generated; in sampling, we are given a density function and we want to generate samples from it.
Density estimation addresses one of the most fundamental problems in machine learning, the problem of discovering structure from data in an unsupervised manner. A density function is a complete description of the joint statistical properties of the data, and in that sense a model of the density function can be viewed as a model of data structure. As such, a model of the density function can be used in a variety of downstream tasks that involve knowledge of data structure, such as inference, prediction, data completion and data generation.
The second part of the thesis is about statistical inference, the problem of inferring parameters of interest from observations given a model of their statistical relationship. In this work, I adopt a Bayesian framework for statistical inference: beliefs over unknown quantities are represented by density functions, and Bayes’ rule is used to update prior beliefs given new observations. Bayesian inference is one of the two main approaches to statistical inference (the other one being frequentist inference), and is widely used in science and engineering. From now on, if not made specific, the term ‘inference’ will refer to ‘Bayesian inference’.
Likelihoodfree inference refers to the situation where the likelihood of the model is too expensive to evaluate, which is typically the case when the model is specified as a simulator that stochastically generates observations given parameters. Such simulators are used ubiquitously in science and engineering for modelling complex mechanistic processes of the real world; as a result, several scientific and engineering problems can be framed as likelihoodfree inference of a simulator’s parameters. The goal of likelihoodfree inference is to compute posterior beliefs over parameters using simulations from the model rather than likelihood evaluations.
In this thesis, I approach both density estimation and likelihoodfree inference as machinelearning problems; in either case, the task is to estimate a density model from data. The methods developed in this thesis are heavily based on neural networks and deep learning. The use of deep learning is motivated by two reasons. First, neural networks have demonstrated stateoftheart performance in a wide variety of machinelearning tasks, such as computer vision (Krizhevsky et al., 2012), naturallanguage processing (Devlin et al., 2018), generative modelling (Radford et al., 2016), and reinforcement learning (Silver et al., 2016) — as we will see in this thesis, neural networks advance the state of the art in density estimation and likelihoodfree inference too. Second, deep learning is actively supported by software frameworks such as Theano (AlRfou et al., 2016), TensorFlow (Abadi et al., 2015) and PyTorch (Paszke et al., 2017), which provide access to powerful hardware (such as graphicsprocessing units) and facilitate designing and training neural networks in practice. This thesis focuses on the application of neural networks to density estimation and likelihoodfree inference, and not on deep learning itself; knowledge of deep learning is generally assumed, but not absolutely required. A comprehensive review of neural networks and deep learning is provided by Goodfellow et al. (2016).
1.1 List of contributions
The main contribution of this thesis is a set of new methods for density estimation and likelihoodfree inference, which are based on techniques from neural networks and deep learning. In particular, the new methods contributed by this thesis are the following:

1.
Masked Autoregressive Flow, an expressive neural density model whose density is tractable to evaluate. MAF can be trained on data to estimate their underlying density function. At the time of its introduction, MAF achieved stateoftheart performance in density estimation, and has been influential ever since.

2.
Sequential Neural Posterior Estimation (Type A), a method for likelihoodfree inference of simulator models that is based on neural density estimation. SNPEA trains a neural density model on simulated data to approximate the posterior density of the parameters given observations. The designation ‘Type A’ is in order to distinguish SNPEA from its ‘TypeB’ variant, which was proposed later. SNPEA was shown to both improve accuracy and dramatically reduce the required number of simulations compared to traditional methods for likelihoodfree inference.

3.
Sequential Neural Likelihood, an alternative method for likelihoodfree inference that uses a neural density model to approximate the likelihood instead of the posterior. SNL can be as fast and accurate as SNPEA, but is more generally applicable and significantly more robust. Unlike SNPEA, SNL can be used with Masked Autoregressive Flow.
1.2 Structure of the thesis
The thesis is divided into two parts: part I is about neural density estimation, whereas part II is about likelihoodfree inference. Each part has a separate introduction and a separate conclusion. The two parts are intended to be standalone, and can be read independently. However, the part on likelihoodfree inference makes heavy use of densityestimation techniques, hence, although not absolutely necessary, I would recommend that the part on density estimation be read first.
The thesis consists of three chapters, each of which is centred on a different published paper. Each chapter includes the paper as published; in addition, it provides extra background in order to motivate the paper, and evaluates the contribution and impact of the paper since its publication. The three chapters are listed below:
Chapter 2 is about Masked Autoregressive Flow, and is based on the following paper:
Chapter 3 is about Sequential Neural Posterior Estimation (Type A), and is based on the following paper:
Chapter 4 is about Sequential Neural Likelihood, and is based on the following paper:
Part I Neural density estimation
Chapter 2 Masked Autoregressive Flow for Density Estimation
This chapter is devoted to density estimation, and how we can use neural networks to estimate densities. We start by explaining what density estimation is, why it is useful, and what challenges it presents (section 2.1). We then review some standard methods for density estimation, such as mixture models, histograms and kernel density estimators, and introduce the idea of neural density estimation (section 2.2). The main contribution of this chapter is the paper Masked Autoregressive Flow of Density Estimation, which presents Masked Autoregressive Flow, a new model for density estimation (sections 2.3 and 2.4). We conclude the chapter by reviewing advances in neural density estimation since the publication of the paper (section 2.5), and by discussing how neural density estimators fit more broadly in the space of generative models (section 2.6).
2.1 Density estimation: what and why?
Suppose we have a stationary process that generates data. The process could be of any kind, as long as it doesn’t change over time: it could be a blackbox simulator, a computer program, an agent acting, the physical world, or a thought experiment. Suppose that each time the process is run forward it independently generates a $D$dimensional real vector ${\mathbf{x}}^{\prime}$. We can think of the probability density at $\mathbf{x}\in {\mathbb{R}}^{D}$ as a measure of how often the process generates data near $\mathbf{x}$ per unit volume. In particular, let ${B}_{\u03f5}\left(\mathbf{x}\right)$ be a ball centred at $\mathbf{x}$ with radius $\u03f5>0$, and let $\left{B}_{\u03f5}\left(\mathbf{x}\right)\right$ be its volume. Informally speaking, the probability density at $\mathbf{x}$ is given by:
$$p\left(\mathbf{x}\right)=\frac{\mathrm{Pr}({\mathbf{x}}^{\prime}\in {B}_{\u03f5}\left(\mathbf{x}\right))}{\left{B}_{\u03f5}\left(\mathbf{x}\right)\right}\mathit{\hspace{1em}}\text{for}\u03f5\to 0,$$  (2.1) 
where $\mathrm{Pr}({\mathbf{x}}^{\prime}\in {B}_{\u03f5}\left(\mathbf{x}\right))$ is the probability that the process generates data in the ball. For brevity, I will often just say density and mean probability density.
A function $p(\cdot )$ that takes an arbitrary vector $\mathbf{x}$ and outputs the density at $\mathbf{x}$ is called a probabilitydensity function. To formally define the density function, suppose that $\mathrm{Pr}({\mathbf{x}}^{\prime}\in \mathcal{X})$ is a probability measure defined on all Lebesguemeasurable subsets of ${\mathbb{R}}^{D}$. We require $\mathrm{Pr}({\mathbf{x}}^{\prime}\in \mathcal{X})$ to be absolutely continuous with respect to the Lebesgue measure, which means $\mathrm{Pr}({\mathbf{x}}^{\prime}\in \mathcal{X})$ is zero if $\mathcal{X}$ has zero volume. A realvalued function $p(\cdot )$ is called a probabilitydensity function if it has the following two properties:
$p\left(\mathbf{x}\right)$  $\ge 0\mathit{\hspace{1em}}\text{for all}\mathbf{x}\in {\mathbb{R}}^{D},$  (2.2)  
${\int}_{\mathcal{X}}}p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x$  $=\mathrm{Pr}({\mathbf{x}}^{\prime}\in \mathcal{X})\mathit{\hspace{1em}}\text{for all Lebesguemeasurable}\mathcal{X}\subseteq {\mathbb{R}}^{D}.$  (2.3) 
The second property implies that a density function must integrate to $1$, that is:
$${\int}_{{\mathbb{R}}^{D}}p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}=1.$$  (2.4) 
The Radon–Nikodym theorem (Billingsley, 1995, theorem 32.2) guarantees that a density function exists and is unique almost everywhere, in the sense that any two density functions can only differ in a set of zero volume.
The density function is not defined if $\mathrm{Pr}({\mathbf{x}}^{\prime}\in \mathcal{X})$ is not absolutely continuous. For example, this is the case if $\mathrm{Pr}({\mathbf{x}}^{\prime}\in \mathcal{X})$ is discrete, i.e. concentrated on a countable subset of ${\mathbb{R}}^{D}$. From now on we will assume that the density function always exists, but many of the techniques discussed in this thesis can be adapted for e.g. discrete probability measures.
In practice, we often don’t have access to the density function of the process we are interested in. Rather, we have a set of datapoints generated by the process (or the ability to generate such a dataset) and we would like to estimate the density function from the dataset. Hence, the problem of density estimation can be stated as follows:
Given a set $\mathrm{\{}{\mathrm{x}}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\mathrm{x}}_{N}\mathrm{\}}$ of independently and identically generated datapoints, how can we estimate the probability density at an arbitrary location $\mathrm{x}$?
2.1.1 Why estimate densities?
Before I address the problem of how to estimate densities, I will discuss the issue of why. The question I will attempt to answer is the following:
Why is the density function useful, and why should we expend resources trying to estimate it?
To begin with, a model of the density function is a complete statistical model of the generative process. With the density function, we can (up to computational limitations) do the following:

1.
Calculate the probability of any Lebesguemeasurable subset of ${\mathbb{R}}^{D}$ under the process, by integration using property (2.3).
 2.

3.
Calculate expectations under the process using integration:
$$\mathbb{E}\left(f\left(\mathbf{x}\right)\right)={\int}_{{\mathbb{R}}^{D}}f\left(\mathbf{x}\right)p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}.$$ (2.5)  4.
Nevertheless, although the above are useful applications of the density function, I would argue that they don’t necessitate modelling the density function. Indeed, one could estimate a sampling model from the data, i.e. a model that generates data in way that is statistically similar to the process. A sampling model can be used (again up to computational limitations) instead of a density model in all the above applications as follows:

1.
The probability of a Lebesguemeasurable subset $\mathcal{X}\subset {\mathbb{R}}^{D}$ can be estimated by the fraction of samples falling in $\mathcal{X}$.

2.
Sampling new data is trivial (and in practice usually much more efficient) with the sampling model.

3.
Calculating expectations can be estimated by MonteCarlo integration.

4.
Testing the model against the actual process can be done with a twosample test (Gretton et al., 2012).
So why invest in density models if there are other ways to model the desirable statistical properties of a generative process? In the following, I discuss some applications for which knowing the value of the density is important.
Bayesian inference
In the context of Bayesian inference, density functions encode degrees of belief. Bayes’ rule describes how beliefs should change in light of new evidence as a operation over densities. In particular, if we express our beliefs about a quantity $\bm{\theta}$ using a density model $p\left(\bm{\theta}\right)$ and the statistical relationship between $\mathbf{x}$ and $\bm{\theta}$ using a conditional density model $p\left(\mathbf{x}\bm{\theta}\right)$, we can calculate how our beliefs about $\bm{\theta}$ should change in light of observing $\mathbf{x}$ by:
$$p\left(\bm{\theta}\mathbf{x}\right)\propto p\left(\mathbf{x}\bm{\theta}\right)p\left(\bm{\theta}\right).$$  (2.6) 
Being able to estimate densities such as the prior $p\left(\bm{\theta}\right)$, the likelihood $p\left(\mathbf{x}\bm{\theta}\right)$ and the posterior $p\left(\bm{\theta}\mathbf{x}\right)$ from data can be valuable for Bayesian inference. For example, density estimation on large numbers of unlabelled data can be used for constructing effective priors (Zoran and Weiss, 2011). In cases where the likelihood is unavailable, density estimation on joint examples of $\mathbf{x}$ and $\bm{\theta}$ can be used to model the posterior or the likelihood (Papamakarios and Murray, 2016; Lueckmann et al., 2017; Papamakarios et al., 2019; Lueckmann et al., 2018). A big part of this thesis is dedicated to using density estimation for likelihoodfree inference, and this is a topic that I will examine thoroughly in the following chapters.
Data compression
There is a close relationship between density modelling and data compression. Suppose we want to encode a message $\mathbf{x}$ up to a level of precision defined by a small neighbourhood $B\left(\mathbf{x}\right)$ around $\mathbf{x}$. Assuming the message was generated from a distribution with density $p\left(\mathbf{x}\right)$, the information content associated with $\mathbf{x}$ at this level of precision is:
$$I\left(\mathbf{x}\right)=\mathrm{log}{\int}_{B\left(\mathbf{x}\right)}p\left({\mathbf{x}}^{\prime}\right)\mathrm{d}{\mathbf{x}}^{\prime}\approx \mathrm{log}p\left(\mathbf{x}\right)\mathrm{log}\leftB\left(\mathbf{x}\right)\right.$$  (2.7) 
The above means that the density at $\mathbf{x}$ tells us how many bits an optimal compressor should use to encode $\mathbf{x}$ at a given level of precision (assuming base$2$ logarithms). Conversely, a data compressor implicitly defines a density model. Given a perfect model of the density function, data compressors such as arithmetic coding (MacKay, 2002, section 6.2) can achieve almost perfect compression. On the other hand, if the data compressor uses (explicitly or implicitly) a density model $q\left(\mathbf{x}\right)$ that is not the same as the true model $p\left(\mathbf{x}\right)$, the average number of wasted bits is:
$$\underset{p\left(\mathbf{x}\right)}{\mathbb{E}}\left(\mathrm{log}q\left(\mathbf{x}\right)\mathrm{log}\leftB\left(\mathbf{x}\right)\right\right)\underset{p\left(\mathbf{x}\right)}{\mathbb{E}}\left(\mathrm{log}p\left(\mathbf{x}\right)\mathrm{log}\leftB\left(\mathbf{x}\right)\right\right)={D}_{\mathrm{KL}}\left(p\left(\mathbf{x}\right)\parallel q\left(\mathbf{x}\right)\right)>0.$$  (2.8) 
Hence, a more accurate density model implies a more efficient data compressor, and vice versa.
Model training and evaluation
Even if we are not interested in estimating the density function per se, the density of the training data under the model is useful as an objective for training the model. For example, suppose we wish to fit a density model $q\left(\mathbf{x}\right)$ to training data $\{{\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N}\}$ using maximumlikelihood estimation. The objective to be maximized at training time is:
$$L\left(q\right)=\frac{1}{N}\sum _{n}\mathrm{log}q\left({\mathbf{x}}_{n}\right).$$  (2.9) 
After training, being able to calculate $L\left(q\right)$ on a heldout test set is useful for ranking and comparing models. Hence, the usefulness of the density function as a training and evaluation objective motivates endowing our models with densityestimation capabilities even if we don’t intend to use the model as a density estimator per se.
It is possible to formulate training and evaluation objectives that don’t explicitly require densities, for example based on likelihood ratios (Goodfellow et al., 2014), kernelspace discrepancies (Dziugaite et al., 2015), or Wasserstein distances (Arjovsky et al., 2017). One argument in favour of the maximumlikelihood objective is its good asymptotic properties: a maximumlikelihood density estimator is consistent, i.e. it converges in probability to the density being estimated (assuming it’s sufficiently flexible to represent it), and efficient, i.e. among all consistent estimators it attains the lowest mean squared error (Wasserman, 2010, section 9.4). Moreover, in the limit of infinite data, maximizing $\frac{1}{N}{\sum}_{n}\mathrm{log}q\left({\mathbf{x}}_{n}\right)$ is equivalent to minimizing ${D}_{\mathrm{KL}}\left(p\left(\mathbf{x}\right)\parallel q\left(\mathbf{x}\right)\right)$. In addition to its interpretation as a measure of efficiency of a data compressor which I already discussed, the KL divergence is the only divergence between density functions that possesses a certain set of properties, namely locality, coordinate invariance and subsystem independence, as defined by Caticha (2004). Rezende (2018) has argued that these properties of the KL divergence justify its use as a training and evaluation objective, and explain its popularity in machine learning.
Density models as components of other algorithms
Density models are often found as components of other algorithms in machine learning and statistics. For instance, sampling algorithms such as importance sampling and sequential Monte Carlo rely on an auxiliary model, known as the proposal, which is required to provide the density of its samples (Papamakarios and Murray, 2015; Gu et al., 2015; Paige and Wood, 2016; Müller et al., 2018). Variational autoencoders (which will be discussed in more detail in section 2.6.2) require two density models as subcomponents, known as the encoder and the prior (Rezende et al., 2014; Kingma and Welling, 2014). Finally, in variational inference of continuous parameters, the approximate posterior is required to be a density model over the parameters of interest (Ranganath et al., 2014; Kucukelbir et al., 2015).
2.1.2 Density estimation in high dimensions
Estimating densities in high dimensions is a hard problem. Naive methods that work well in low dimensions often break down as the dimensionality increases (as I will discuss in more detail in section 2.2). The observation that density estimation, as well as other machinelearning tasks, becomes dramatically harder as the dimensionality increases is often referred to as the curse of dimensionality (Hastie et al., 2001, section 2.5; Bishop, 2006, section 1.4).
I will illustrate the curse of dimensionality for density estimation with a simple example. Consider a process that generates data uniformly in the $D$dimensional unit cube ${[0,1]}^{D}$. Clearly, the density $p\left(\mathbf{x}\right)$ is equal to $1$ for all $\mathbf{x}$ inside the cube. Suppose that we try to estimate $p\left(\mathbf{x}\right)$ for some $\mathbf{x}$ inside the cube using the fraction of training datapoints that fall into a small ball around $\mathbf{x}$. The expected fraction of training datapoints that fall into a ball ${B}_{\u03f5}\left(\mathbf{x}\right)$ centred at $\mathbf{x}$ with radius $\u03f5$ is no greater than the volume of the ball, which is given by:
$$\left{B}_{\u03f5}\left(\mathbf{x}\right)\right=\frac{{\left({\pi}^{1/2}\u03f5\right)}^{D}}{\mathrm{\Gamma}\left(\frac{D}{2}+1\right)},$$  (2.10) 
where $\mathrm{\Gamma}(\cdot )$ is the Gamma function. In practice, we would need to make the ball large enough to contain at least one datapoint, otherwise the estimated density will be zero. However, no matter how large we make the radius $\u03f5$, the volume of the ball approaches zero as $D$ grows larger. In other words, in a dimension high enough, almost all balls will be empty, even if we make their radius larger than the side of the cube!
Figure (a)a illustrates the shrinkage in volume of a $D$dimensional ball as $D$ increases. Figure (b)b shows the expected number of datapoints we would need to generate from the process until one of them falls into the ball, which is no less than the inverse of the ball’s volume. As we can see, for a ball of radius $\u03f5=0.01$ and in $D=8$ dimensions, we would need a dataset of size at least a thousand trillion datapoints!
In practice, in order to scale density estimation to high dimensions without requiring astronomical amounts of data, we make assumptions about the densities we wish to estimate and encode these assumptions into our models. With careful model design, it has been possible to train good density models on highdimensional data such as highresolution images (Menick and Kalchbrenner, 2019; Kingma and Dhariwal, 2018; Dinh et al., 2017; Salimans et al., 2017) and audio streams (van den Oord et al., 2016a; Kim et al., 2018; Prenger et al., 2018). Fortunately, we can often make assumptions that are generic enough to apply to broad domains, while still making density estimation practical. In the following, I will examine a few generic assumptions that often guide model design.
Smoothness
We often assume that the density function varies smoothly, that is, if $\parallel {\mathbf{x}}_{A}{\mathbf{x}}_{B}\parallel $ is small, then $\leftp\left({\mathbf{x}}_{A}\right)p\left({\mathbf{x}}_{B}\right)\right$ is also small. The assumption of smoothness encourages the model to interpolate over small regions that happen to have no training data due to sampling noise, rather than assign zero density to them. In practice, we enforce smoothness either by limiting the flexibility of the density model, or by regularizing it.
Low intrinsic dimensionality
We often assume that the world has fewer degrees of freedom than the measurements we make to describe it. For example, an image of a natural scene could only vary in certain semantically meaningful ways (e.g. location of objects in the scene, direction of lighting, etc.), whereas each pixel of the image can’t vary arbitrarily. This means that such natural images would approximately lie on a manifold of low intrinsic dimensionality embedded in the $D$dimensional space of pixel intensities — see also discussions by Basri and Jacobs (2003) and Hinton et al. (1997) on the lowdimensional manifold structure of simple images. In practice, lowdimensional manifold structure can be modelled by limiting the degrees of freedom that the model can represent (e.g. by introducing information bottlenecks in the model’s structure).
Symmetries and invariances
Real data often have symmetries, some examples of which are listed below.

1.
Translation symmetry. An image of a car moved a few pixels to the right is still an image of the same car.

2.
Scale symmetry. An audio stream of music played twice as fast may still be a plausible piece of music.

3.
Mirror symmetry. An image flipped horizontally may still be a plausible image.

4.
Order symmetry. A dataset whose datapoints have been reordered is still the same dataset.
Such symmetries in the data are often reflected in model design. For example, topperforming image models employ convolutions and multiscale architectures (Menick and Kalchbrenner, 2019; Kingma and Dhariwal, 2018; Dinh et al., 2017), which equip the model with a degree of translation and scale symmetry. Additionally, symmetries in the data can be used for data augmentation: for instance, an image model can be trained with all training images flipped horizontally as additional training data (Papamakarios et al., 2017; Dinh et al., 2017).
Independencies and loose dependencies
Sometimes we know or suspect that certain measurements are either independent of or at least loosely dependent on other measurements. For example, in an audio stream, one could assume that the current audio intensity is only loosely dependent on audio intensities of more than a few seconds before it. This could be encoded in a model by limiting the range (or the receptive field) of each variable in the data, for example in the case of audio (van den Oord et al., 2016a) or pixel intensities (van den Oord et al., 2016b).
2.2 Methods for density estimation
Methods for density estimation can broadly be classified as either parametric or nonparametric. Parametric methods model the density function as a specified functional form with a fixed number of tunable parameters. Nonparametric methods are those that don’t fit the above description: typically, they specify a model whose complexity grows with the number of training datapoints. In this section I will discuss and compare some standard densityestimation methods from both categories, and I will introduce the idea of neural density estimation.
2.2.1 Simple parametric models and mixture models
In parametric density estimation, we first specify a density model ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ with a fixed number of tunable parameters $\mathit{\varphi}$, and then we try to find a setting of $\mathit{\varphi}$ that makes ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ as similar as possible to the true density $p\left(\mathbf{x}\right)$. A straightforward approach is to choose ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ to be in a simple parametric family, for example the Gaussian family:
$${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)=\frac{1}{{\leftdet\left(2\pi \mathbf{\Sigma}\right)\right}^{1/2}}\mathrm{exp}\left(\frac{1}{2}{\left(\mathbf{x}\bm{\mu}\right)}^{T}{\mathbf{\Sigma}}^{1}\left(\mathbf{x}\bm{\mu}\right)\right)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\mathit{\varphi}=\{\bm{\mu},\mathbf{\Sigma}\}.$$  (2.11) 
The parameters of the Gaussian family are a real $D$dimensional vector $\bm{\mu}$ and a $D\times D$ symmetric positivedefinite matrix $\mathbf{\Sigma}$. There are also special cases of the Gaussian family that further restrict the form of $\mathbf{\Sigma}$, such as the probabilistic versions of principalcomponents analysis (Tipping and Bishop, 1999), minorcomponents analysis (Williams and Agakov, 2002), and extremecomponents analysis (Welling et al., 2004).
The problem with simple parametric families such as the above is that the set of density functions they can represent is limited. For example, the Gaussian family can’t represent density functions with more than one mode. One way to increase the expressivity of parametric models is to combine a number of models into one mixture model. Let ${q}_{{\mathit{\varphi}}_{1}}^{\left(1\right)}\left(\mathbf{x}\right),\mathrm{\dots},{q}_{{\mathit{\varphi}}_{K}}^{\left(K\right)}\left(\mathbf{x}\right)$ be $K$ parametric models from the same or different families. A mixture model is a parametric model defined as:
$${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)=\sum _{k}{\alpha}_{k}{q}_{{\mathit{\varphi}}_{k}}^{\left(k\right)}\left(\mathbf{x}\right)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\sum _{k}{\alpha}_{k}=1\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\alpha}_{k}\ge 0\text{for all}k.$$  (2.12) 
The parameters of a mixture model are $\mathit{\varphi}=\{{\alpha}_{1},{\mathit{\varphi}}_{1},\mathrm{\dots},{\alpha}_{K},{\mathit{\varphi}}_{K}\}$. Mixture models where all ${q}_{{\mathit{\varphi}}_{k}}^{\left(k\right)}\left(\mathbf{x}\right)$ are Gaussian are usually referred to as Gaussian mixture models. Gaussian mixture models are a strong densityestimation baseline: with sufficiently many components they can approximate any density arbitrarily well (McLachlan and Basford, 1988). However, they may require a large number of components to approximate density functions that can be expressed compactly in a different form (e.g. a uniform density in the unit cube would require a large number of narrow Gaussians to approximate its steep boundary).
Parametric density models are typically estimated by maximum likelihood. Given a set of training datapoints $\{{\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N}\}$ that have been independently and identically generated by a process with density $p\left(\mathbf{x}\right)$, we seek a setting of the model’s parameters $\mathit{\varphi}$ that maximize the average log likelihood on the training data:
$$L\left(\mathit{\varphi}\right)=\frac{1}{N}\sum _{n}\mathrm{log}{q}_{\mathit{\varphi}}\left({\mathbf{x}}_{n}\right).$$  (2.13) 
From the strong law of large numbers, as $N\to \mathrm{\infty}$ we have that $L\left(\mathit{\varphi}\right)$ converges almost surely to ${\mathbb{E}}_{p\left(\mathbf{x}\right)}\left(\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$. Hence, for a large enough training set, maximizing $L\left(\mathit{\varphi}\right)$ is equivalent to minimizing ${D}_{\mathrm{KL}}\left(p\left(\mathbf{x}\right)\parallel {q}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$, since:
$${D}_{\mathrm{KL}}\left(p\left(\mathbf{x}\right)\parallel {q}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)=\underset{p\left(\mathbf{x}\right)}{\mathbb{E}}\left(\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)+\mathrm{const}.$$  (2.14) 
Some of the merits of maximumlikelihood estimation and of KLdivergence minimization have already been discussed in section 2.1.1.
For certain simple models, the optimizer of $L\left(\mathit{\varphi}\right)$ has a closedform solution. For example, the maximumlikelihood parameters of a Gaussian model are the empirical mean and covariance of the training data:
$${\bm{\mu}}^{*}=\frac{1}{N}\sum _{n}{\mathbf{x}}_{n}\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\mathbf{\Sigma}}^{*}=\frac{1}{N}\sum _{n}\left({\mathbf{x}}_{n}{\bm{\mu}}^{*}\right){\left({\mathbf{x}}_{n}{\bm{\mu}}^{*}\right)}^{T}.$$  (2.15) 
For mixture models based on simple parametric families, such as Gaussian mixture models, $L\left(\mathit{\varphi}\right)$ can be (locally) maximized using the expectationmaximization algorithm (Dempster et al., 1977) or its online variant (Cappé and Moulines, 2008). More generally, if $L\left(\mathit{\varphi}\right)$ is differentiable with respect to $\mathit{\varphi}$, it can be (locally) maximized with gradientbased methods, as I will discuss in section 2.2.4.
2.2.2 Histograms
The histogram is one of the simplest and most widely used methods for density estimation. The idea of the histogram is to partition the data space into a set of nonoverlapping bins $\{{B}_{1},\mathrm{\dots},{B}_{K}\}$, and estimate $\mathrm{Pr}({\mathbf{x}}^{\prime}\in {B}_{k})$ by the fraction of training datapoints in ${B}_{k}$. Then, the density in ${B}_{k}$ is approximated by the estimate of $\mathrm{Pr}({\mathbf{x}}^{\prime}\in {B}_{k})$ divided by the volume of ${B}_{k}$.
Histograms are often described as nonparametric models (e.g. Bishop, 2006, section 2.5). However, given the definition of a parametric model I gave earlier, I would argue that histograms are better described as parametric models that are trained with maximum likelihood. Given a partition of the data space into $K$ nonoverlapping bins, a histogram is the following parametric model:
$${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)=\prod _{k}{\pi}_{k}^{I(\mathbf{x}\in {B}_{k})}\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\sum _{k}{\pi}_{k}\left{B}_{k}\right=1\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\pi}_{k}\ge 0\text{for all}k.$$  (2.16) 
In the above, $I(\cdot )$ is the indicator function, which takes a logical statement and outputs $1$ if the statement is true and $0$ otherwise. Each ${\pi}_{k}$ represents the density in bin ${B}_{k}$. The parameters of the histogram are $\mathit{\varphi}=\{{\pi}_{1},\mathrm{\dots},{\pi}_{K}\}$.
The average log likelihood of the histogram on training data $\{{\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N}\}$ is:
$$L\left(\mathit{\varphi}\right)=\frac{1}{N}\sum _{k}{N}_{k}\mathrm{log}{\pi}_{k},$$  (2.17) 
where ${N}_{k}={\sum}_{n}I({\mathbf{x}}_{n}\in {B}_{k})$ is the number of training datapoints in ${B}_{k}$. Taking into account the equality constraint ${\sum}_{k}{\pi}_{k}\left{B}_{k}\right=1$, we can write the Lagrangian of the maximization problem as:
$$\mathcal{L}(\mathit{\varphi},\lambda )=L\left(\mathit{\varphi}\right)\lambda \left(\sum _{k}{\pi}_{k}\left{B}_{k}\right1\right),$$  (2.18) 
where $\lambda $ is a Lagrange multiplier enforcing the equality constraint. Taking derivatives of the Lagrangian with respect to ${\pi}_{k}$ and $\lambda $ and jointly solving for zero, we find that the maximumlikelihood optimizer is:
$${\pi}_{k}^{*}=\frac{{N}_{k}}{N\left{B}_{k}\right},$$  (2.19) 
which also satisfies ${\pi}_{k}\ge 0$. As expected, the maximumlikelihood density in ${B}_{k}$ is the fraction of the training data in ${B}_{k}$ divided by the volume of ${B}_{k}$.
In practice, to construct the bins we typically grid up each axis between two extremes, and take the $D$dimensional hyperrectangles formed this way to be the bins. How fine or coarse we grid up the space determines the volume of the bins and the granularity of the histogram. There is a biasvariance tradeoff controlled by bin volume: a histogram with too many bins of small volume may overfit, whereas a histogram with too few bins may underfit.
A drawback of histograms is that they suffer from the curse of dimensionality. To illustrate why, suppose we are trying to estimate a uniform density in the $D$dimensional unit cube ${[0,1]}^{D}$, and that we’d like a granularity of $K$ equally sized bins per axis. The total number of bins will be ${K}^{D}$, which is also the expected number of datapoints until one of them falls in a given bin. Hence, the amount of training data we’d need to populate the histogram scales exponentially with dimensionality. In practice, histograms are often used in low dimensions if there are enough datapoints (e.g. for visualization purposes) but rarely in more than two or three dimensions.
2.2.3 Kernel density estimation
Kernel density estimation is a nonparametric method for estimating densities. A kernel density estimator can be thought of as a smoothed version of the empirical distribution of the training data. Given training data $\{{\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N}\}$, their empirical distribution ${q}_{0}\left(\mathbf{x}\right)$ is an equally weighted mixture of $N$ delta distributions located at training datapoints:
$${q}_{0}\left(\mathbf{x}\right)=\frac{1}{N}\sum _{n}\delta \left(\mathbf{x}{\mathbf{x}}_{n}\right).$$  (2.20) 
We can smooth out the empirical distribution and turn it into a density by replacing each delta distribution with a smoothing kernel. A smoothing kernel ${k}_{\u03f5}\left(\mathbf{u}\right)$ is a density function defined by:
$${k}_{\u03f5}\left(\mathbf{u}\right)=\frac{1}{{\u03f5}^{D}}{k}_{1}\left(\frac{\mathbf{u}}{\u03f5}\right),$$  (2.21) 
where $\u03f5>0$, and ${k}_{1}\left(\mathbf{u}\right)$ is a density function bounded from above. The parameter $\u03f5$ controls the “width” of the kernel; as $\u03f5\to 0$, ${k}_{\u03f5}\left(\mathbf{u}\right)$ approaches $\delta \left(\mathbf{u}\right)$. Given a smoothing kernel ${k}_{\u03f5}\left(\mathbf{u}\right)$, the kernel density estimator is defined as:
$${q}_{\u03f5}\left(\mathbf{x}\right)=\frac{1}{N}\sum _{n}{k}_{\u03f5}\left(\mathbf{x}{\mathbf{x}}_{n}\right).$$  (2.22) 
In practice, common choices of kernel include the Gaussian kernel:
$${k}_{1}\left(\mathbf{u}\right)=\frac{1}{{\left(2\pi \right)}^{\mathrm{D}/2}}\mathrm{exp}\left(\frac{1}{2}{\parallel \mathbf{u}\parallel}^{2}\right),$$  (2.23) 
or the multiplicative Epanechnikov kernel:
$${k}_{1}\left(\mathbf{u}\right)=\{\begin{array}{cc}{\left(\frac{3}{4}\right)}^{D}{\prod}_{d}\left(1{u}_{d}^{2}\right)\hfill & \left{u}_{d}\right\le 1\text{for all}d\hfill \\ 0\hfill & \text{otherwise.}\hfill \end{array}$$  (2.24) 
The multiplicative Epanechnikov kernel is the most efficient among decomposable kernels, in the sense that asymptotically it achieves the lowest mean squared error (Epanechnikov, 1969).
In the limit $\u03f5\to 0$, the kernel density estimator is unbiased: it is equal to the true density in expectation. This is because as $\u03f5\to 0$ we have ${q}_{\u03f5}\left(\mathbf{x}\right)\to {q}_{0}\left(\mathbf{x}\right)$, and
$$\mathbb{E}\left({q}_{0}\left(\mathbf{x}\right)\right)=\frac{1}{N}\sum _{n}\underset{p\left({\mathbf{x}}_{n}\right)}{\mathbb{E}}\left(\delta \left(\mathbf{x}{\mathbf{x}}_{n}\right)\right)=\underset{p\left({\mathbf{x}}^{\prime}\right)}{\mathbb{E}}\left(\delta \left(\mathbf{x}{\mathbf{x}}^{\prime}\right)\right)=p\left(\mathbf{x}\right).$$  (2.25) 
Moreover, the kernel density estimator is consistent: it approaches the true density for small $\u03f5$ and large $N$, provided $\u03f5$ doesn’t shrink too fast with $N$. To show this, we first upperbound the variance of the estimator:
$\mathbb{V}\left({q}_{\u03f5}\left(\mathbf{x}\right)\right)$  $={\displaystyle \frac{1}{{N}^{2}}}{\displaystyle \sum _{n}}\underset{p\left({\mathbf{x}}_{n}\right)}{\mathbb{V}}\left({k}_{\u03f5}\left(\mathbf{x}{\mathbf{x}}_{n}\right)\right)={\displaystyle \frac{1}{N}}\underset{p\left({\mathbf{x}}^{\prime}\right)}{\mathbb{V}}\left({k}_{\u03f5}\left(\mathbf{x}{\mathbf{x}}^{\prime}\right)\right)$  (2.26)  
$\le {\displaystyle \frac{1}{N}}\underset{p\left({\mathbf{x}}^{\prime}\right)}{\mathbb{E}}\left({k}_{\u03f5}^{2}\left(\mathbf{x}{\mathbf{x}}^{\prime}\right)\right)={\displaystyle \frac{1}{N{\u03f5}^{2D}}}{\displaystyle {\int}_{{\mathbb{R}}^{D}}}{k}_{1}^{2}\left({\displaystyle \frac{\mathbf{x}{\mathbf{x}}^{\prime}}{\u03f5}}\right)p\left({\mathbf{x}}^{\prime}\right)\mathrm{d}{\mathbf{x}}^{\prime}$  (2.27)  
$\le {\displaystyle \frac{{sup}_{\mathbf{u}}{k}_{1}^{2}\left(\mathbf{u}\right)}{N{\u03f5}^{2D}}}.$  (2.28) 
We can see that the variance approaches zero as $N{\u03f5}^{2D}$ approaches infinity. Hence, ${q}_{\u03f5}\left(\mathbf{x}\right)$ converges in probability to $p\left(\mathbf{x}\right)$ as $N$ approaches infinity, provided that $\u03f5$ approaches zero at a rate less than ${N}^{1/2\mathrm{D}}$.
In practice, the width parameter $\u03f5$ controls the degree of smoothness, and trades off bias for variance: if $\u03f5$ is too low the model may overfit, whereas if $\u03f5$ is too high the model may underfit. In general, we want $\u03f5$ to be smaller the more data we have and larger the higher the dimension is; there are rules of thumb for setting $\u03f5$ based on $N$ and $D$ such as Scott’s rule (Scott, 1992) or Silverman’s rule (Silverman, 1986).
Sometimes it is not possible to find a value for $\u03f5$ that works equally well everywhere. For instance, a lower value may be more appropriate in regions with high concentration of training data than in regions with low concentration. One possible solution, known as the method of nearest neighbours, is to choose a different $\u03f5$ for each location $\mathbf{x}$, such that the effective number of training datapoints contributing to the density at $\mathbf{x}$ is constant. However, the method of nearest neighbours doesn’t always result in a normalizable density (Bishop, 2006, section 2.5.2).
The kernel density estimator is widely used and a strong baseline in low dimensions due to its flexibility and good asymptotic properties. However it suffers from the curse of dimensionality in high dimensions. To illustrate why, consider estimating the uniform density in the unit cube ${[0,1]}^{D}$ using the multiplicative Epanechnikov kernel, whose support is a $D$dimensional hyperrectangle of side $2\u03f5$. The volume of space covered by kernels is at most $N{\left(2\u03f5\right)}^{D}$, which approaches zero as $D$ grows large for any $$. Hence, to avoid covering only a vanishing amount of space, we must either make the support of the kernel at least as large as the support of the entire density, or have the number of training datapoints grow at least exponentially with dimensionality.
Compared to parametric methods, kernel density estimation and nonparametric methods in general have the advantage that they don’t require training: there is no need to search for a model because the training data is the model. However, the memory cost of storing the model and the computational cost of evaluating the model grow linearly with $N$, which can be significant for large datasets. In contrast, parametric models have fixed memory and evaluation costs.
2.2.4 Neural density estimation
Neural density estimation is a parametric method for density estimation that uses neural networks to parameterize a density model. A neural density estimator is a neural network with parameters $\mathit{\varphi}$ that takes as input a datapoint $\mathbf{x}$ and returns a real number ${f}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ such that:
$${\int}_{{\mathbb{R}}^{D}}\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)\mathrm{d}\mathbf{x}=1.$$  (2.29) 
The above constraint is enforced by construction, that is, the architecture of the neural network is such that $\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$ integrates to $1$ for all settings of $\mathit{\varphi}$ (I will discuss how this can be achieved in section 2.3). Since $\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$ meets all the requirements of a density function, the neural network can be used as density model ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)=\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$.
Given training data $\{{\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N}\}$, neural density estimators are typically trained by maximizing the average log likelihood:
$$L\left(\mathit{\varphi}\right)=\frac{1}{N}\sum _{n}\mathrm{log}{q}_{\mathit{\varphi}}\left({\mathbf{x}}_{n}\right)=\frac{1}{N}\sum _{n}{f}_{\mathit{\varphi}}\left({\mathbf{x}}_{n}\right).$$  (2.30) 
The maximization of $L\left(\mathit{\varphi}\right)$ is typically done using a variant of stochasticgradient ascent (Bottou, 2012). First, the parameters $\mathit{\varphi}$ are initialized to some arbitrary value. The algorithm proceeds in a number of iterations, in each of which $\mathit{\varphi}$ is updated. In each iteration, a subset of $M$ training datapoints $\{{\mathbf{x}}_{{n}_{1}},\mathrm{\dots},{\mathbf{x}}_{{n}_{M}}\}$, known as a minibatch, is selected at random. The selection is usually done without replacement; if no more datapoints are left, all datapoints are put back in. Then, the gradient with respect to $\mathit{\varphi}$ of the average log likelihood on the minibatch is computed:
$${\nabla}_{\mathit{\varphi}}\widehat{L}\left(\mathit{\varphi}\right)=\frac{1}{M}\sum _{m}{\nabla}_{\mathit{\varphi}}{f}_{\mathit{\varphi}}\left({\mathbf{x}}_{{n}_{m}}\right).$$  (2.31) 
Each ${\nabla}_{\mathit{\varphi}}{f}_{\mathit{\varphi}}\left({\mathbf{x}}_{{n}_{m}}\right)$ can be computed in parallel using reversemode automatic differentiation, also known as backpropagation in the context of neural networks (Goodfellow et al., 2016, section 6.5). The gradient ${\nabla}_{\mathit{\varphi}}\widehat{L}\left(\mathit{\varphi}\right)$ is an unbiased estimator of ${\nabla}_{\mathit{\varphi}}L\left(\mathit{\varphi}\right)$, and is known as a stochastic gradient. Finally, an ascent direction $\mathbf{d}$ is computed based on its previous value, the total number of iterations so far, the current stochastic gradient, and possibly a window (or running aggregate) of previous stochastic gradients, and the parameters are updated by $\mathit{\varphi}\leftarrow \mathit{\varphi}+\mathbf{d}$. There are various strategies for computing $\mathbf{d}$, such as momentum (Qian, 1999), AdaGrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012), Adam (Kingma and Ba, 2015), and AMSGrad (Reddi et al., 2018).
Stochasticgradient ascent is a general algorithm for optimizing differentiable functions that can be written as averages of multiple terms. Due to its generality, it has the advantage that it decouples the task of modelling the density from the task of optimizing the training objective. Due to its use of stochastic gradients instead of full gradients, it scales well to large datasets, and it can be used with training datasets of infinite size (such as data produced by a generative process on the fly). Finally, there is some preliminary evidence that the stochasticity of the gradients may contribute in finding parameter settings that generalize well (Keskar et al., 2017).
The question that remains to be answered is how we can design neural networks such that their exponentiated output integrates to $1$ by construction. This is one of the main contributions of this thesis, and it will be the topic of section 2.3. As we shall see, it is possible to design neural density estimators that, although parametric, are flexible enough to approximate complex densities in thousands of dimensions.
2.3 The paper
This section presents the paper Masked Autoregressive Flow for Density Estimation, which is the main contribution of this chapter. The paper discusses stateoftheart methods for constructing neural density estimators, and proposes a new method which we term Masked Autoregressive Flow. We show how Masked Autoregressive Flow can increase the flexibility of previously proposed neural density estimators, and demonstrate MAF’s performance in highdimensional density estimation.
The paper was initially published as a preprint on arXiv in May 2017. Then, it was accepted for publication at the conference Advances in Neural Information Processing Systems (NeurIPS) in December 2017. It was featured as an oral presentation at the conference; of the $3,240$ papers submitted to NeurIPS in 2017, $678$ were accepted for publication, of which $40$ were featured as oral presentations.
Author contributions
The paper is coauthored by me, Theo Pavlakou and Iain Murray. As the leading author, I conceived and developed Masked Autoregressive Flow, performed the experiments, and wrote the paper. Theo Pavlakou prepared the UCI datasets used in section 4.2 of the paper; his earlier work on density estimation using the UCI datasets served as a guide and point of reference for the experiments in the paper. Iain Murray supervised the project, offered suggestions, and helped revise the final version.
Differences in notation
The previous sections used $p\left(\mathbf{x}\right)$ to mean the true density of the generative process and ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ to mean the density represented by a parametric model with parameters $\mathit{\varphi}$. The paper uses $p\left(\mathbf{x}\right)$ both for the true density and for the model density depending on context. In section 3.2 and appendix A of the paper, where disambiguation between the two densities is required, we use ${\pi}_{x}\left(\mathbf{x}\right)$ for the true density and ${p}_{x}\left(\mathbf{x}\right)$ for the model density.
Corrections from original version
About a year after the initial publication of the paper, we discovered that the experimental results on conditional density estimation with Masked Autoregressive Flow were incorrect due to an error in the code. Upon discovery, we corrected the results and issued a replacement of the paper on arXiv. The version included in this chapter is the corrected version of the paper, which was published on arXiv in June 2018. The particular results that were updated from earlier versions are indicated with a footnote in this version.
See pages  of papers/maf.pdf
2.4 Contribution and impact
The paper Masked Autoregressive Flow for Density Estimation showed that we can increase the flexibility of simple neural density estimators by explicitly modelling their internal randomness. By reparameterizing a density model in terms of its internal randomness, we obtain an invertible transformation that can be composed into a normalizing flow. The resulting model can be more expressive than the original, while it remains tractable to train, evaluate and sample from. Masked Autoregressive Flow is a specific implementation of this idea that uses masked autoregressive models with Gaussian conditionals as building blocks.
In addition to introducing a new density estimator, the paper contributed to the understanding of the relationship between MADE, MAF, IAF and Real NVP, and clarified the computational tradeoffs involved with using these models for density estimation and variational inference. Specifically, the paper explained the following relationships:

1.
A MAF with one layer is a MADE with Gaussian conditionals.

2.
An IAF is a MAF with its transformation inverted (and vice versa).

3.
A Real NVP is a MAF/IAF with one autoregressive step instead of $D$ autoregressive steps.

4.
Fitting a MAF to the training data by maximum likelihood can be viewed as fitting an implicit IAF to the base density by stochastic variational inference.
Finally, the paper clarified the following computational tradeoffs:

1.
MAF is fast to evaluate but slow to sample from.

2.
IAF is slow to evaluate but fast to sample from.

3.
Real NVP is fast to both evaluate and sample from, at the cost of decreased flexibility compared to MAF/IAF.
According to Google Scholar, the paper has received $66$ citations as of April 2019. MAF has been used as a prior and/or decoder for variational autoencoders (Alemi et al., 2018; Dillon et al., 2017; Choi et al., 2019; Bauer and Mnih, 2019; Tran et al., 2018; Vikram et al., 2019), for modelling stateaction pairs in imitation learning (Schroecker et al., 2019), and for estimating likelihoods or likelihood ratios in likelihoodfree inference (Brehmer et al., 2018c; Papamakarios et al., 2019). The five datasets we used for our experiments on unconditional density estimation (namely POWER, GAS, HEPMASS, MINIBOONE and BSDS300) have been made available online (Papamakarios, 2018) and have been used by other researchers as densityestimation benchmarks (Huang et al., 2018; Grathwohl et al., 2018; Oliva et al., 2018; Li and Grathwohl, 2018; De Cao et al., 2019; Nash and Durkan, 2019). Finally, MAF has been implemented as part of TensorFlow Probability (Dillon et al., 2017), a software library for probabilistic modelling and inference which forms part of TensorFlow (Abadi et al., 2015).
There are two ways in which MAF is limited. First, MAF can be slow to sample from in high dimensions, as the computational cost of generating $D$dimensional data from MAF scales linearly with $D$. This is true in general for autoregressive models, such as WaveNet (van den Oord et al., 2016a) or PixelCNN (van den Oord et al., 2016b; Salimans et al., 2017). Slow sampling limits the applicability of MAF as a neural sampler or as a variational posterior. Second, it is still unknown whether MAF is a universal density approximator, i.e. whether it can model any wellbehaved density arbitrarily well given enough layers and hidden units. In the next section, I will review advances in normalizing flows since the publication of the paper, and I will discuss further the tradeoffs between efficiency and expressivity in existing models.
2.5 Further advances in normalizing flows
Since the publication of the paper Masked Autoregressive Flow for Density Estimation, there has been a lot of research interest in normalizing flows. In this section, I review advances in normalizing flows after the publication of the paper, and I show how the various approaches are related to each other. Some of the advances are based on Masked Autoregressive Flow, and others are independent threads of research.
2.5.1 Nonaffine autoregressive layers
MAF, IAF and Real NVP are all composed of affine autoregressive layers, i.e. autoregressive layers where each variable is scaled and shifted as a function of previous variables. (Since the coupling layers used by Real NVP are special cases of autoregressive layers, I won’t make a distinction between the two from now on.) As we discussed in section 2.3, an affine autoregressive layer transforms each noise variable ${u}_{i}$ into a data variable ${x}_{i}$ as follows:
$${x}_{i}={\alpha}_{i}{u}_{i}+{\beta}_{i},$$  (2.32) 
where ${\alpha}_{i}$ and ${\beta}_{i}$ are functions of ${\mathbf{x}}_{1:i1}$ or ${\mathbf{u}}_{1:i1}$. Restricting the transformation from ${u}_{i}$ to ${x}_{i}$ to be affine allowed us to invert it and compute the determinant of its Jacobian efficiently.
However, one could increase the expressivity of an autoregressive layer by allowing more general transformations from ${u}_{i}$ to ${x}_{i}$ of the form:
$${x}_{i}={g}_{{\bm{\psi}}_{i}}\left({u}_{i}\right),$$  (2.33) 
where ${\bm{\psi}}_{i}$ is a function of ${\mathbf{x}}_{1:i1}$ or ${\mathbf{u}}_{1:i1}$ that parameterizes the transformation. As long as ${g}_{{\bm{\psi}}_{i}}$ is taken to be smooth and invertible, the resulting flow is a nonaffine autoregressive flow. The absolute determinant of the Jacobian of such a flow is:
$$\leftdet\left(\frac{\partial {f}^{1}}{\partial \mathbf{x}}\right)\right=\prod _{i}{\left\frac{\partial {g}_{{\bm{\psi}}_{i}}}{\partial {u}_{i}}\right}^{1}.$$  (2.34) 
Neural autoregressive flows
Neural autoregressive flows (Huang et al., 2018) are nonaffine autoregressive flows that use monotonicallyincreasing neural networks to parameterize ${g}_{{\bm{\psi}}_{i}}$. A feedforward neural network with one input ${u}_{i}$ and one output ${x}_{i}$ can be made monotonically increasing if (a) all its activation functions are monotonically increasing (sigmoid or leakyReLU activation functions have this property), and (b) all its weights are strictly positive. Huang et al. (2018) propose neural architectures that follow this principle, termed deep sigmoidal flows and deep dense sigmoidal flows. The derivative of ${g}_{{\bm{\psi}}_{i}}$ with respect to its input (needed for the computation of the Jacobian determinant above) can be obtained by automatic differentiation. A special case of a neural autoregressive flow is Flow++ (Ho et al., 2019), which parameterizes ${g}_{{\bm{\psi}}_{i}}$ as a mixture of logistic CDFs, and is equivalent to a neural network with positive weights and one hidden layer of logisticsigmoid units.
The advantage of neural autoregressive flows is their expressivity. Huang et al. (2018) show that with a sufficiently flexible transformation ${g}_{{\bm{\psi}}_{i}}$, a single neural autoregressive layer can approximate any wellbehaved density arbitrarily well. This is because if ${g}_{{\bm{\psi}}_{i}}$ becomes equal to the inverse CDF of the conditional $p\left({x}_{i}{\mathbf{x}}_{1:i1}\right)$, the neural autoregressive layer transforms the joint density $p\left(\mathbf{x}\right)$ into a uniform density in the unit cube (Hyvärinen and Pajunen, 1999).
The disadvantage of neural autoregressive flows is that in general they are not analytically invertible. That is, even though the inverse of ${g}_{{\bm{\psi}}_{i}}$ exists, it’s not always available in closed form. In order to invert the flow, one would have to resort to numerical methods. A neural autoregressive flow can still be used to estimate densities if it is taken to parameterize the transformation from $\mathbf{x}$ to $\mathbf{u}$, but if the transformation from $\mathbf{u}$ to $\mathbf{x}$ is not available analytically, it would not be possible to sample from the trained model efficiently.
Nonlinear squared flow
If we take ${g}_{{\bm{\psi}}_{i}}$ to be nonaffine but restrict it to have an analytic inverse, we would have a nonaffine autoregressive flow that we could sample from. An example is the nonlinear squared flow (Ziegler and Rush, 2019), which adds an inversequadratic perturbation to the affine transformation as follows:
$${g}_{{\bm{\psi}}_{i}}\left({u}_{i}\right)={\alpha}_{i}{u}_{i}+{\beta}_{i}+\frac{{\gamma}_{i}}{1+{\left({\delta}_{i}{u}_{i}+{\u03f5}_{i}\right)}^{2}}\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}{\bm{\psi}}_{i}=\{{\alpha}_{i},{\beta}_{i},{\gamma}_{i},{\delta}_{i},{\u03f5}_{i}\}.$$  (2.35) 
The above transformation is not generally invertible, but it can be made monotonically increasing if we restrict ${\alpha}_{i}>\frac{9}{8\sqrt{3}}\left{\gamma}_{i}\right{\delta}_{i}$ and ${\delta}_{i}>0$. For ${\gamma}_{i}=0$, the nonlinear squared flow reduces to an affine autoregressive flow. Given ${x}_{i}$, the equation ${x}_{i}={g}_{{\bm{\psi}}_{i}}\left({u}_{i}\right)$ is a cubic polynomial with respect to ${u}_{i}$, so it can be solved analytically. The above transformation is more expressive than an affine transformation, but not as expressive as a general neural autoregressive flow.
Piecewisepolynomial autoregressive flows
Another approach to creating nonaffine but analytically invertible autoregressive flows is to parameterize ${g}_{{\bm{\psi}}_{i}}$ as a piecewiselinear or piecewisequadratic monotonicallyincreasing function (Müller et al., 2018). In this case, the parameters ${\bm{\psi}}_{i}$ correspond to the locations of the segments and their shape (i.e. slope and curvature). In order to invert ${g}_{{\bm{\psi}}_{i}}$ for a given ${x}_{i}$, one would need to first identify which segment this ${x}_{i}$ corresponds to (which can be done by binary search since the segments are sorted), and then invert that segment (which is easy for a linear or quadratic segment). The more segments we have, the more flexible the transformation becomes.
2.5.2 Invertible convolutional layers
If we are interested in modelling image data, we may want to design a flow that contains invertible convolutional layers. For this discussion, we will assume that the data is an image of shape $H\times W\times C$, where $H$ is the height, $W$ is the width, and $C$ is the number of channels. A convolutional layer transforms noise of shape $H\times W\times C$ into data via a convolution with filter $k$. Let $\mathbf{x}$ and $\mathbf{u}$ represent the vectorized image and noise respectively. Since convolution is a linear operation, we can write it as the following matrix multiplication:
$$\mathbf{x}={\mathbf{W}}_{k}\mathbf{u},$$  (2.36) 
where ${\mathbf{W}}_{k}$ is a matrix of shape $HWC\times HWC$ whose entries depend on the filter $k$. If ${\mathbf{W}}_{k}$ is invertible then the convolution is invertible, and its Jacobian has absolute determinant:
$$\leftdet\left(\frac{\partial {f}^{1}}{\partial \mathbf{x}}\right)\right={\leftdet\left({\mathbf{W}}_{k}\right)\right}^{1}.$$  (2.37) 
Nonetheless, naively inverting ${\mathbf{W}}_{k}$ or calculating its determinant has a cost of $\mathcal{O}\left({\left(HWC\right)}^{3}\right)$, so more scalable solutions have to be found in practice.
Invertible $1\times 1$ convolutions and Glow
Kingma and Dhariwal (2018) introduced invertible $\mathrm{1}\mathrm{\times}\mathrm{1}$ convolutions in their model Glow. An invertible $1\times 1$ convolution is essentially a linear transformation where each pixel of size $C$ (with one value for each channel) is multiplied by the same matrix $\mathbf{V}$ of shape $C\times C$. The equivalent matrix ${\mathbf{W}}_{k}$ can be obtained by:
$${\mathbf{W}}_{k}=\mathbf{V}\otimes \mathbf{I},$$  (2.38) 
where $\mathbf{I}$ is the identity matrix of shape $HW\times HW$ and $\otimes $ is the Kronecker product. The inverse and determinant of ${\mathbf{W}}_{k}$ have a cost of $\mathcal{O}\left({C}^{3}\right)$, which may not be prohibitive for moderate $C$. To further reduce the cost, Kingma and Dhariwal (2018) suggest parameterizing $\mathbf{V}$ as follows:
$$\mathbf{V}=\mathrm{\mathbf{P}\mathbf{L}\mathbf{U}},$$  (2.39) 
where $\mathbf{P}$ is a fixed permutation matrix, $\mathbf{L}$ is a lower triangular matrix with ones in its diagonal, and $\mathbf{U}$ is an upper triangular matrix. In that case, the absolute determinant of ${\mathbf{W}}_{k}$ becomes:
$$\leftdet\left({\mathbf{W}}_{k}\right)\right=HW\prod _{c}\left{U}_{cc}\right,$$  (2.40) 
where ${U}_{cc}$ is the $c$th element of $\mathbf{U}$’s diagonal. If we further restrict every ${U}_{cc}$ to be positive, we guarantee that the transformation is always invertible. In addition to modelling images, invertible $1\times 1$ convolutions have been used in modelling audio by WaveGlow (Prenger et al., 2018) and FloWaveNet (Kim et al., 2018).
Autoregressive and emerging convolutions
One way of obtaining scalable invertible convolutions without restricting the receptive field to be $1\times 1$ is via autoregressive convolutions (Hoogeboom et al., 2019). In an autoregressive convolution, pixels are assumed to be ordered, and part of the filter is zeroed out so that output pixel $i$ only depends on input pixels $1$ to $i1$. An autoregressive convolution corresponds to a triangular matrix ${\mathbf{W}}_{k}$, and hence its determinant can be calculated at a cost of $\mathcal{O}\left(HWC\right)$. A convolution that is not restricted to be autoregressive can be obtained by composing an autoregressive convolution whose matrix ${\mathbf{W}}_{{k}_{1}}$ is upper triangular with an autoregressive convolution whose matrix ${\mathbf{W}}_{{k}_{2}}$ is lower triangular; the result is equivalent to a nonautoregressive convolution with matrix ${\mathbf{W}}_{{k}_{2}}{\mathbf{W}}_{{k}_{1}}$. This is analogous to parameterizing an LU decomposition of the convolution matrix, and is termed emerging convolution (Hoogeboom et al., 2019).
Periodic convolutions in the Fourier domain
Finally, another way of scaling up invertible convolutions is via the Fourier domain. According to the convolution theorem, the convolution between a filter $k$ and a signal $u$ is equal to:
$$k*u={\mathcal{F}}^{1}\left(\mathcal{F}\left(k\right)\mathcal{F}\left(u\right)\right),$$  (2.41) 
where $\mathcal{F}$ is the Fourier transform and ${\mathcal{F}}^{1}$ is its inverse. Since the Fourier transform is a unitary linear operator, its discrete version corresponds to multiplication with a particular unitary matrix $\mathbf{F}$. Hence, the convolution of a discrete signal $\mathbf{u}$ can be written in vectorized form as:
$${\mathbf{W}}_{k}\mathbf{u}={\mathbf{F}}^{T}\left(\mathrm{\mathbf{F}\mathbf{k}}\odot \mathrm{\mathbf{F}\mathbf{u}}\right)=\left({\mathbf{F}}^{T}{\mathbf{D}}_{k}\mathbf{F}\right)\mathbf{u}$$  (2.42) 
where $\odot $ is elementwise multiplication, and ${\mathbf{D}}_{k}$ is a diagonal matrix whose diagonal is $\mathrm{\mathbf{F}\mathbf{k}}$. Since the absolute determinant of $\mathbf{F}$ is $1$, the absolute determinant of ${\mathbf{W}}_{k}$ is:
$$\leftdet\left({\mathbf{W}}_{k}\right)\right=\prod _{i}\left{D}_{k,ii}\right,$$  (2.43) 
where ${D}_{k,ii}$ is the $i$th element of the filter $k$ expressed in the Fourier domain.
In a typical convolution layer with $C$ filters and $C$ input channels, we perform a total of ${C}^{2}$ convolutions (one for each combination of filter and inputchannel), and the resulting ${C}^{2}$ output maps are summed across input channels to obtain $C$ output channels. If we express the entire convolution layer in the Fourier domain as we did above, we obtain a blockdiagonal matrix ${\mathbf{D}}_{k}$ of shape $HWC\times HWC$ whose diagonal contains $HW$ matrices of shape $C\times C$ (Hoogeboom et al., 2019). Hence, calculating the determinant of a convolution layer using its Fourier representation has a cost of $\mathcal{O}\left(HW{C}^{3}\right)$, which may be acceptable for moderate $C$. This type of convolution layer is termed periodic convolution (Hoogeboom et al., 2019).
2.5.3 Invertible residual layers
A residual layer (He et al., 2016) is a transformation of the following form:
$$\mathbf{x}=f\left(\mathbf{u}\right)=\mathbf{u}+g\left(\mathbf{u}\right).$$  (2.44) 
Residual layers are designed to avoid vanishing gradients in deep neural networks. Since the Jacobian of a residual layer is:
$$\frac{\partial f}{\partial \mathbf{u}}=\mathbf{I}+\frac{\partial g}{\partial \mathbf{u}},$$  (2.45) 
the propagated gradient doesn’t vanish even if the Jacobian of $g$ does.
Sylvester flow
Residual networks are not generally invertible, but can be made to be if $g$ is restricted accordingly. One such example is the Sylvester flow (van den Berg et al., 2018), where $g$ is taken to be:
$$g\left(\mathbf{u}\right)=\mathrm{\mathbf{Q}\mathbf{R}}h\left(\stackrel{~}{\mathbf{R}}{\mathbf{Q}}^{T}\mathbf{u}+\mathbf{b}\right).$$  (2.46) 
In the above, $\mathbf{Q}$ is a $D\times M$ matrix whose columns form an orthonormal basis, $\mathbf{R}$ and $\stackrel{~}{\mathbf{R}}$ are $M\times M$ uppertriangular matrices, $\mathbf{b}$ is an $M$dimensional bias, $h(\cdot )$ is a smooth activation function applied elementwise, and $M\le D$. The above transformation can be thought of as a feedforward neural network with one hidden layer of $M$ units, whose weight matrices have been parameterized in a particular way.
Theorem 2 of van den Berg et al. (2018) gives sufficient conditions for the invertibility of the Sylvester flow. However, their proof requires $\stackrel{~}{\mathbf{R}}$ to be invertible, which as I show here is not necessary. In theorem 2.1 below, I give a more succinct proof of the invertibility of the Sylvester flow, which doesn’t require the invertibility of $\stackrel{~}{\mathbf{R}}$.
Theorem 2.1 (Invertibility of the Sylvester flow).
The Sylvester flow is invertible if $h(\cdot )$ is monotonically increasing with bounded derivative (e.g. a sigmoid activation function has this property), and if for all $m$ we have:
$${\stackrel{~}{R}}_{mm}{R}_{mm}>\frac{1}{{sup}_{z}{h}^{\prime}\left(z\right)}.$$  (2.47) 
Proof.
Using the matrixdeterminant lemma, the determinant of the Jacobian of $f$ can be written as:
$$det\left(\frac{\partial f}{\partial \mathbf{u}}\right)=det\left(\mathbf{I}+\mathbf{A}\stackrel{~}{\mathbf{R}}\mathbf{R}\right),$$  (2.48) 
where $\mathbf{A}$ is an $M\times M$ diagonal matrix whose diagonal is ${h}^{\prime}\left(\stackrel{~}{\mathbf{R}}{\mathbf{Q}}^{T}\mathbf{u}+\mathbf{b}\right)$. Since $\mathbf{I}+\mathbf{A}\stackrel{~}{\mathbf{R}}\mathbf{R}$ is an $M\times M$ uppertriangular matrix, its determinant is the product of its diagonal elements, hence:
$$det\left(\frac{\partial f}{\partial \mathbf{u}}\right)=\prod _{m}\left(1+{A}_{mm}{\stackrel{~}{R}}_{mm}{R}_{mm}\right).$$  (2.49) 
Condition (2.47) ensures that the Jacobian determinant is positive everywhere. Hence, from the inversefunction theorem it follows that $f$ is invertible. ∎
Sylvester flows are related to other normalizing flows. For $M=1$, the Sylvester flow becomes a special case of the planar flow (Rezende and Mohamed, 2015). Furthermore, if $M=D$, $\mathbf{Q}$ is taken to be a reversepermutation matrix and $\mathbf{R}$ is strictly upper triangular, $g$ becomes a MADE (Germain et al., 2015) with one hidden layer of $D$ units. In that case, the Sylvester flow becomes a special case of the Inverse Autoregressive Flow (Kingma et al., 2016), where the scaling factor is $1$ and the shifting factor is $g\left(\mathbf{u}\right)$.
Even though the Sylvester flow is invertible and has a tractable Jacobian, its inverse is not available analytically. This means that the Sylvester flow can calculate efficiently only the density of its own samples, so it can be used as a variational posterior. Alternatively, if we take it to parameterize the transformation from $\mathbf{x}$ to $\mathbf{u}$, we can use it as a density estimator but we won’t be able to sample from it efficiently.
Contractive residual layers and iResNet
In general, a residual layer is invertible if $g$ is a contraction, i.e. its Lipschitz constant is less than $1$. A residual layer with this property is termed an iResNet (Behrmann et al., 2018). To see why an iResNet is invertible, fix a value for $\mathbf{x}$ and consider the sequence ${\mathbf{u}}_{1},{\mathbf{u}}_{2},\mathrm{\dots}$ obtained by:
$${\mathbf{u}}_{k+1}=\mathbf{x}g\left({\mathbf{u}}_{k}\right).$$  (2.50) 
The map ${\mathbf{u}}_{k}\mapsto {\mathbf{u}}_{k+1}$ is a contraction, so by the Banach fixedpoint theorem it follows that the sequence converges for any choice of ${\mathbf{u}}_{1}$ to the same fixed point ${\mathbf{u}}_{\mathrm{\infty}}$. That fixed point is the unique value that satisfies $\mathbf{x}=f\left({\mathbf{u}}_{\mathrm{\infty}}\right)$, which proves the invertibility of $f$.
One way to construct an iResNet is to parameterize $g$ to be a feedforward neural network with contractive activation functions (such as sigmoids or ReLUs), and with weight matrices of spectral norm less than $1$. Even though its inverse is not analytically available in general, an iResNet can be numerically inverted using the iterative procedure of equation (2.50).
Directly calculating the Jacobian determinant of an iResNet costs $\mathcal{O}\left({D}^{3}\right)$. Alternatively, following Behrmann et al. (2018), the log absolute determinant of the Jacobian can be first expanded into a power series as follows:
$$\mathrm{log}\leftdet\left(\frac{\partial f}{\partial \mathbf{u}}\right)\right=\sum _{k=1}^{\mathrm{\infty}}\frac{{\left(1\right)}^{k+1}}{k}\mathrm{tr}\left({\mathbf{J}}_{g}^{k}\right)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}{\mathbf{J}}_{g}=\frac{\partial g}{\partial \mathbf{u}},$$  (2.51) 
and then it can be approximated by truncating the power series at a desired accuracy. Further, $\mathrm{tr}\left({\mathbf{J}}_{g}^{k}\right)$ can be approximated by the following unbiased stochastic estimator, known as the Hutchinson estimator (Hutchinson, 1990):
$$\mathrm{tr}\left({\mathbf{J}}_{g}^{k}\right)\approx {\mathbf{v}}^{T}{\mathbf{J}}_{g}^{k}\mathbf{v}\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\mathbf{v}\sim \mathcal{N}(\mathrm{\U0001d7ce},\mathbf{I}).$$  (2.52) 
Calculating ${\mathbf{v}}^{T}{\mathbf{J}}_{g}^{k}\mathbf{v}$ doesn’t require explicitly computing the Jacobian, as it can be done by backpropagating through $g$ a total of $k$ times, once for each Jacobianvector product.
The main advantage of iResNets over other types of invertible residual networks is the flexibility in constructing $g$. Unlike Sylvester flows where $g$ is restricted to one hidden layer and to no more than $D$ hidden units, an iResNet can have any number of layers and hidden units. However, unlike other normalizing flows, it is expensive both to sample and to calculate exact densities under an iResNet, which limits its applicability in practice.
2.5.4 Infinitesimal flows
So far I have discussed normalizing flows consisting of a fixed number of layers. We can imagine a flow where the number of layers grows larger and larger, but at the same time the effect of each layer becomes smaller and smaller. In the limit of infinitely many layers each of which has an infinitesimal effect, we obtain an flow where $\mathbf{u}$ is transformed continuously, rather than in discrete steps. We call such flows infinitesimal flows.
Deep diffeomorphic flow
One type of infinitesimal flow is the deep diffeomorphic flow (Salman et al., 2018). We start with a residual layer mapping ${\mathbf{u}}_{t}$ to ${\mathbf{u}}_{t+1}$:
$${\mathbf{u}}_{t+1}={\mathbf{u}}_{t}+{g}_{t}\left({\mathbf{u}}_{t}\right),$$  (2.53) 
and then extend it to a variablesized step as follows:
$${\mathbf{u}}_{t+dt}={\mathbf{u}}_{t}+dt{g}_{t}\left({\mathbf{u}}_{t}\right).$$  (2.54) 
In the limit $dt\to 0$, we obtain the following ordinary differential equation:
$$\frac{d{\mathbf{u}}_{t}}{dt}={g}_{t}\left({\mathbf{u}}_{t}\right).$$  (2.55) 
In the above ODE, we can interpret ${g}_{t}$ as a timevarying velocity field. For small enough $dt$, we can interpret a residual layer of the form of equation (2.54) as the Euler integrator of this ODE. In that case, we can approximately invert the flow by running the integrator backwards:
$${\mathbf{u}}_{t}\approx {\mathbf{u}}_{t+dt}dt{g}_{t}\left({\mathbf{u}}_{t+dt}\right),$$  (2.56) 
which becomes exact for $dt\to 0$. We can also think of a diffeomorphic flow as a special case of an iResNet, where $dt$ is small enough such that the transformation is contractive. In that case, each step of running the integrator backwards corresponds to a single iteration of the iResNetinversion algorithm, where ${\mathbf{u}}_{t}$ is initialized with ${\mathbf{u}}_{t+dt}$, the value to be inverted at each step.
Since the deep diffeomorphic flow is a special case of an iResNet, calculating its Jacobian determinant can be done similarly. Using a small (but not infinitesimal) $dt$, we begin by writing the log absolute determinant of each step of the Euler integrator as a power series:
$$\mathrm{log}\leftdet\left(\frac{\partial {\mathbf{u}}_{t+dt}}{\partial {\mathbf{u}}_{t}}\right)\right=\sum _{k=1}^{\mathrm{\infty}}\frac{{\left(1\right)}^{k+1}}{k}{\left(dt\right)}^{k}\mathrm{tr}\left({\mathbf{J}}_{{g}_{t}}^{k}\right)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}{\mathbf{J}}_{{g}_{t}}=\frac{\partial {g}_{t}}{\partial {\mathbf{u}}_{t}},$$  (2.57) 
and then approximate it by truncating the power series at a desired level of accuracy. The smaller we take $dt$ the more we can afford to truncate. Finally, we can estimate $\mathrm{tr}\left({\mathbf{J}}_{{g}_{t}}^{k}\right)$ using the Hutchinson estimator in equation (2.52).
The main drawback of the deep diffeomorphic flow is that inverting it and computing its Jacobian determinant are approximate operations that can introduce approximation error. To make the approximation more accurate, one needs to reduce $dt$, which in turn can make the flow prohibitively deep, since the number of layers scales as $\mathcal{O}\left(1/\mathrm{dt}\right)$.
Neural ODEs and FFJORD
The drawbacks of the deep diffeomorphic flow stem from the fact that (a) it backpropagates through the integrator, (b) the Euler integrator it uses is not exactly invertible, and (c) the calculation of the Jacobian determinant is approximate. A different approach that avoids these issues is Neural ODEs (Chen et al., 2018). Instead of using the Euler integrator and backpropagating through it, Neural ODEs defines an additional ODE that describes the evolution of the gradient of a loss as it backpropagates through the flow. Hence, instead of backpropagating through the integrator in order to train the flow, the gradients with respect to its parameters can be obtained as the solution to this additional ODE. Both the ODE for the forward pass and the ODE for the backward pass can be solved with any integrator, which allows one to use an integrator that is exactly invertible. Furthermore, the integrator can choose the number of integration steps (or layers of the flow) adaptively.
Instead of approximating the Jacobian determinant by truncating its power series, we can define directly how the log density evolves through the flow via a third ODE. We start by rewriting the power series of the log absolute determinant as follows:
$$\mathrm{log}\leftdet\left(\frac{\partial {\mathbf{u}}_{t+dt}}{\partial {\mathbf{u}}_{t}}\right)\right=dt\mathrm{tr}\left({\mathbf{J}}_{{g}_{t}}\right)+\mathcal{O}\left(d{t}^{2}\right).$$  (2.58) 
Using the above, we can express the evolution of the log density from $t$ to $t+dt$ as:
$$\mathrm{log}{q}_{t+dt}\left({\mathbf{u}}_{t+dt}\right)=\mathrm{log}{q}_{t}\left({\mathbf{u}}_{t}\right)dt\mathrm{tr}\left({\mathbf{J}}_{{g}_{t}}\right)+\mathcal{O}\left(d{t}^{2}\right).$$  (2.59) 
Taking the limit $dt\to 0$, we obtain:
$$\frac{d}{dt}\mathrm{log}{q}_{t}\left({\mathbf{u}}_{t}\right)=\mathrm{tr}\left({\mathbf{J}}_{{g}_{t}}\right)=\nabla \cdot {g}_{t}\left({\mathbf{u}}_{t}\right).$$  (2.60) 
The above ODE is a special case of the Fokker–Planck equation where the diffusion is zero. As before, $\mathrm{tr}\left({\mathbf{J}}_{{g}_{t}}\right)$ can be approximated using the Hutchinson estimator. The above ODE can be solved together with the ODEs for forward and backward propagation through the flow with the integrator of our choice. The resulting flow is termed FFJORD (Grathwohl et al., 2018). A special case of FFJORD is the Monge–Ampère flow (Zhang et al., 2018), which takes ${g}_{t}$ to be the gradient of a scalar field (in which case ${g}_{t}$ has zero rotation everywhere).
The advantage of Neural ODEs and FFJORD is their flexibility. Unlike other normalizing flows where the architecture has to be restricted in some way or another, ${g}_{t}$ can be parameterized by any neural network, which can depend on ${\mathbf{u}}_{t}$ and $t$ in arbitrary ways. On the other hand, their disadvantage is that they necessitate using an ODE solver for sampling and calculating the density under the model, which can be slow compared to a single pass through a neural network. Additionally, unlike a neural network which has a fixed evaluation time, the evaluation time of an ODE solver may vary depending on the value of its input (i.e. its initial state).
2.6 Generative models without tractable densities
So far in this chapter I have focused on neural density estimators, that is, models whose density can be calculated efficiently. However, neural density estimation is not the only approach to generative modelling. Before I conclude the chapter, I will take a step back and briefly review generative models without tractable densities. I will discuss how such models relate to neural density estimators, how they differ from them, and what alternative capabilities they may offer.
2.6.1 Energybased models
In section 2.2.4, we defined a neural density estimator to be a neural network that takes a vector $\mathbf{x}$ and returns a real number ${f}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ such that for any parameter setting $\mathit{\varphi}$:
$${\int}_{{\mathbb{R}}^{D}}\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)\mathrm{d}\mathbf{x}=1.$$  (2.61) 
The above property allows us to interpret ${q}_{\varphi}\left(\mathbf{x}\right)=\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$ as a density function. In order to enforce this property, we had to restrict the architecture of the neural network, which as we saw can hurt the flexibility of the model.
We can relax this restriction by requiring only that the integral of $\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)$ be finite. That is, for every parameter setting $\mathit{\varphi}$:
$$  (2.62) 
If that’s the case, we can still define a valid density function as follows:
$${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)=\frac{1}{{Z}_{\mathit{\varphi}}}\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right).$$  (2.63) 
Models defined this way are called energybased models. The quantity ${f}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ is known as the energy, and ${Z}_{\mathit{\varphi}}$ is known as the normalizing constant. Under the above definitions, a neural density estimator is an energybased model whose normalizing constant is always $1$.
The main advantage of energybased modelling is the increased flexibility in specifying ${f}_{\mathit{\varphi}}\left(\mathbf{x}\right)$. However, evaluating the density under an energybased model is intractable, since calculating the normalizing constant involves a highdimensional integral. There are ways to estimate the normalizing constant, e.g. via annealed importance sampling (Salakhutdinov and Murray, 2008) or model distillation (Papamakarios and Murray, 2015), but they are approximate and often expensive.
In addition to their density being intractable, energybased models typically don’t provide a mechanism of generating samples. Instead, one must resort to approximate methods such as Markovchain Monte Carlo (Murray, 2007; Neal, 1993) to generate samples. In contrast, some neural density estimators such Masked Autoregressive Flow or Real NVP are capable of generating exact independent samples directly.
Finally, the intractability of ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ makes likelihoodbased training of energybased models less straightforward than of neural density estimators. In particular, the gradient of $\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ with respect to $\mathit{\varphi}$ is:
$$\frac{\partial}{\partial \mathit{\varphi}}\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)=\frac{\partial}{\partial \mathit{\varphi}}{f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\frac{\partial}{\partial \mathit{\varphi}}\mathrm{log}{Z}_{\mathit{\varphi}}.$$  (2.64) 
By substituting in the definition of the normalizing constant and by exchanging the order of differentiation and integration, we can write the gradient of $\mathrm{log}{Z}_{\mathit{\varphi}}$ as follows:
$\frac{\partial}{\partial \mathit{\varphi}}}\mathrm{log}{Z}_{\mathit{\varphi}$  $={\displaystyle \frac{1}{{Z}_{\mathit{\varphi}}}}{\displaystyle \frac{\partial}{\partial \mathit{\varphi}}}{\displaystyle {\int}_{{\mathbb{R}}^{D}}}\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right)\mathrm{d}\mathbf{x}$  (2.65)  
$={\displaystyle {\int}_{{\mathbb{R}}^{D}}}{\displaystyle \frac{1}{{Z}_{\mathit{\varphi}}}}\mathrm{exp}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right){\displaystyle \frac{\partial}{\partial \mathit{\varphi}}}{f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}$  (2.66)  
$={\displaystyle {\int}_{{\mathbb{R}}^{D}}}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right){\displaystyle \frac{\partial}{\partial \mathit{\varphi}}}{f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}$  (2.67)  
$=\underset{{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)}{\mathbb{E}}\left({\displaystyle \frac{\partial}{\partial \mathit{\varphi}}}{f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right).$  (2.68) 
The above expectation is intractable, hence direct gradientbased optimization of the average log likelihood is impractical. There are ways to either approximate the above expectation or avoid computing it entirely, such as contrastive divergence (Hinton, 2002), score matching (Hyvärinen, 2005), or noisecontrastive estimation (Gutmann and Hyvärinen, 2012). In contrast, for neural density estimators the above expectation is always zero by construction, which makes likelihoodbased training by backpropagation readily applicable.
2.6.2 Latentvariable models and variational autoencoders
One way of increasing the expressivity of a simple generative model is by introducing latent variables. Latent variables are auxiliary variables $\mathbf{z}$ that are used to augment the data $\mathbf{x}$. For the purposes of this discussion we will assume that $\mathbf{z}$ is continuous (i.e. $\mathbf{z}\in {\mathbb{R}}^{K}$), but discrete latent variables are also possible.
Having decided on how many latent variables to introduce, one would typically define a joint density model ${q}_{\mathit{\varphi}}(\mathbf{x},\mathbf{z})$. This is often done by separately defining a prior model ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$ and a conditional model ${q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right)$. Then, the density of $\mathbf{x}$ is obtained by:
$${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)={\int}_{{\mathbb{R}}^{K}}{q}_{\mathit{\varphi}}(\mathbf{x},\mathbf{z})\mathrm{d}\mathbf{z}={\int}_{{\mathbb{R}}^{K}}{q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right){q}_{\mathit{\varphi}}\left(\mathbf{z}\right)\mathrm{d}\mathbf{z}.$$  (2.69) 
The motivation for introducing latent variables and then integrating them out is that the marginal ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ can be complex even if the prior ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$ and the conditional ${q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right)$ are not. Hence, one can obtain a complex model of the data despite using simple modelling components. Another way to view ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ is as a mixture of infinitely many components ${q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right)$ indexed by $\mathbf{z}$, weighted by ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$. In fact, mixture models of finitely many components such as those discussed in section 2.2.1 can be thought of as latentvariable models with discrete latent variables.
Like energybased models, the density of a latentvariable model is intractable to calculate since it involves a highdimensional integral. If that integral is tractable, the model is essentially a neural density estimator and can be treated as such. So, for the purposes of this discussion, we’ll assume that the integral is always intractable by definition.
Even though the model density is intractable, it’s possible to lowerbound $\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ by introducing an auxiliary density model ${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)$ with parameters $\bm{\psi}$ as follows:
$\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$  $=\mathrm{log}{\displaystyle {\int}_{{\mathbb{R}}^{K}}}{q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right){q}_{\mathit{\varphi}}\left(\mathbf{z}\right)\mathrm{d}\mathbf{z}$  (2.70)  
$=\mathrm{log}\underset{{r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)}{\mathbb{E}}\left({\displaystyle \frac{{q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right){q}_{\mathit{\varphi}}\left(\mathbf{z}\right)}{{r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)}}\right)$  (2.71)  
$\ge \underset{{r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)}{\mathbb{E}}\left(\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right)+\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{z}\right)\mathrm{log}{r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)\right),$  (2.72) 
where we used Jensen’s inequality to exchange the logarithm with the expectation. The above lower bound is often called evidence lower bound or ELBO. The ELBO is indeed a lower bound given any choice of conditional density ${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)$, and it becomes equal to $\mathrm{log}{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)$ if:
$${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)={q}_{\mathit{\varphi}}\left(\mathbf{z}\mathbf{x}\right)=\frac{{q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right){q}_{\mathit{\varphi}}\left(\mathbf{z}\right)}{{q}_{\mathit{\varphi}}\left(\mathbf{x}\right)}.$$  (2.73) 
In practice, the ELBO can be used as a tractable objective to train the model. Maximizing the average ELBO over training data with respect to $\mathit{\varphi}$ is equivalent to maximizing a lower bound to the average log likelihood. Moreover, maximizing the average ELBO with respect to $\bm{\psi}$ makes the ELBO a tighter lower bound. Stochastic gradients of the average ELBO with respect to $\mathit{\varphi}$ and $\bm{\psi}$ can be obtained by reparameterizing ${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)$ with respect to its internal random variables (similarly to how we reparameterized MADE in section 2.3), and by estimating the expectation over the reparameterized ${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)$ via Monte Carlo. A latentvariable model trained this way is known as a variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014). In the context of variational autoencoders, ${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)$ is often referred to as the encoder and ${q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right)$ as the decoder.
Apart from increasing the flexibility of the model, another motivation for introducing latent variables is representation learning. Consider for example a variational autoencoder that is used to model images of handwritten characters. Assume that the decoder ${q}_{\mathit{\varphi}}\left(\mathbf{x}\mathbf{z}\right)$ is deliberately chosen to factorize over the elements of $\mathbf{x}$, i.e. the pixels in an image. Clearly, a factorized distribution over pixels is unable to model global structure, which in our case corresponds to factors of variation such as the identity of a character, its shape, the style of handwriting, and so on. Therefore, information about global factors of variation such as the above must be encoded into the latent variables $\mathbf{z}$ if the model is to represent the data well. If the dimensionality of $\mathbf{z}$ is deliberately chosen to be smaller than that of $\mathbf{x}$ (the number of pixels), the latent variables correspond to a compressed encoding of the global factors of variation in an image. Given an image $\mathbf{x}$, plausible such encodings can be directly sampled from ${r}_{\bm{\psi}}\left(\mathbf{z}\mathbf{x}\right)$.
Finally, we can further encourage the latent variables to acquire semantic meaning by engineering the structure of the model appropriately. Examples include: hierarchicallystructured latent variables that learn to represent datasets (Edwards and Storkey, 2017), exchangeable groups of latent variables that learn to represent objects in a scene (Nash et al., 2017; Eslami et al., 2016), and latent variables with a Markovian structure that learn to represent environment states (Gregor et al., 2019; Buesing et al., 2018).
2.6.3 Neural samplers and generative adversarial networks
A neural sampler is a neural network ${f}_{\mathit{\varphi}}$ with parameters $\mathit{\varphi}$ that takes random unstructured noise $\mathbf{u}\in {\mathbb{R}}^{K}$ and transforms it into data $\mathbf{x}\in {\mathbb{R}}^{D}$:
$$\mathbf{x}={f}_{\mathit{\varphi}}\left(\mathbf{u}\right)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\mathbf{u}\sim {\pi}_{u}\left(\mathbf{u}\right).$$  (2.74) 
In the above, ${\pi}_{u}\left(\mathbf{u}\right)$ is a fixed source of noise (not necessarily a density model). Under this definition, normalizing flows with tractable sampling, such as Masked Autoregressive Flow or Real NVP, are also neural samplers. A typical variational autoencoder is also a neural sampler, in which case $\mathbf{u}$ are the internal random numbers needed to sample $\mathbf{z}$ from the prior and $\mathbf{x}$ from the decoder.
Typically, the main purpose of a neural sampler is to be used as a data generator and not as a density estimator. Hence, a neural sampler doesn’t need to have the restrictions of a normalizing flow: ${f}_{\mathit{\varphi}}$ doesn’t have to be bijective and $K$ (the dimensionality of the noise) doesn’t have to match $D$ (the dimensionality of the data). As a result, a neural sampler offers increased flexibility in specifying the neural network ${f}_{\mathit{\varphi}}$ compared to a normalizing flow.
A consequence of the freedom of specifying ${f}_{\mathit{\varphi}}$ and $K$ is that a neural sampler may no longer correspond to a valid density model of the data. For example, if ${f}_{\mathit{\varphi}}$ is not injective, a whole subset of ${\mathbb{R}}^{K}$ of nonzero measure under ${\pi}_{u}\left(\mathbf{u}\right)$ can be mapped to a single point in ${\mathbb{R}}^{D}$, where the model density will be infinite. Moreover, if $$, the neural sampler can only generate data on a $K$dimensional manifold embedded in ${\mathbb{R}}^{D}$. Hence, the model density will be zero almost everywhere, except for a manifold of zero volume where the model density will be infinite.
Training a neural sampler is typically done by generating a set of samples from the model, selecting a random subset of the training data, and calculating a measure of discrepancy between the two sets. If that discrepancy measure is differentiable, its gradient with respect to the model parameters $\mathit{\varphi}$ can be obtained by backpropagation. Then, the model can be trained by stochasticgradient descent, with stochastic gradients obtained by repeating the above procedure.
One way to obtain a differentiable discrepancy measure is via an additional neural network, known as the discriminator, whose task is to discriminate generated samples from training data. The classification performance of the discriminator (as measured by e.g. cross entropy) can be used as a differentiable discrepancy measure. In practice, the discriminator is trained at the same time as the neural sampler. Neural samplers trained this way are known as generative adversarial networks (Goodfellow et al., 2014). Other possible discrepancy measures between training data and generated samples include the maximum mean discrepancy (Dziugaite et al., 2015), or the Wasserstein distance (Arjovsky et al., 2017).
Generative adversarial networks often excel at datageneration performance. When trained as image models, they are able to generate highresolution images that are visually indistinguishable from real images (Brock et al., 2019; Karras et al., 2018a, b). However, their good performance as data generators doesn’t necessarily imply good performance in terms of average log likelihood. Grover et al. (2018) and Danihelka et al. (2017) have observed that a Real NVP trained as a generative adversarial network can generate realisticlooking samples, but it may perform poorly as a density estimator.
2.7 Summary and conclusions
The probability density of a generative process at a location $\mathbf{x}$ is how often the process generates data near $\mathbf{x}$ per unit volume. Density estimation is the task of modelling the density of a process given training data generated by the process. In this chapter, I discussed how to perform density estimation with neural networks, and why this may be a good idea.
Instead of densities, we could directly estimate other statistical quantities associated with the process, such as expectations of various functions under the process, or we could model the way data is generated by the process. Why estimate densities then? In section 2.1.1, I identified the following reasons as to why estimating densities can be useful:

1.
Bayesian inference is a calculus over densities.

2.
Accurate densities imply good data compression.

3.
The density of a model is a principled training and evaluation objective.

4.
Density models are useful components in other algorithms.
That said, I would argue that in practice we should first identify what task we are interested in, and then use the tool that is more appropriate for the task. If the task calls for a density model, then a density model should be used. Otherwise, a different solution may be more appropriate.
Estimating densities is hard, and it becomes dramatically harder as the dimensionality of the data increases. As I discussed in section 2.1.2, the reason is the curse of dimensionality: a dataset becomes sparser in higher dimensions, and naive density estimation doesn’t scale. In section 2.2, I argued that standard density estimation methods based on histograms or smoothing kernels are particularly susceptible to the curse of dimensionality.
Is highdimensional density estimation hopeless? In section 2.1.2 I argued that it’s not, as long as we are prepared to make assumptions about the densities we want to estimate, and encode these assumptions into our models. I identified the following assumptions that are often appropriate to make for realworld densities:

1.
They vary smoothly.

2.
Their intrinsic dimensionality is smaller than that of the data.

3.
They have symmetries.

4.
They involve variables that are loosely dependent.
Neural density estimators scale to high dimensions partly because their architecture can encode the above assumptions. In contrast, it’s not clear how these assumptions can be encoded into models such as parametric mixtures, histograms, or smoothing kernels.
The main contribution of this chapter is Masked Autoregressive Flow, a normalizing flow that consists of a stack of reparameterized autoregressive models. In section 2.4, I discussed the impact Masked Autoregressive Flow has had since its publication, how it contributed to further research, and what its main limitations have been.
Since the publication of Masked Autoregressive Flow in December 2017, the field of normalizing flows has experienced considerable research interest. In section 2.5, I reviewed advances in the field since the publication of the paper, and I examined their individual strengths and weaknesses. Even though normalizing flows have improved significantly, no model yet exists that has all the following characteristics:

1.
The density under the model can be exactly evaluated in one neuralnetwork pass (like Masked Autoregressive Flow).

2.
Exact independent samples can be generated in one neuralnetwork pass (like Inverse Autoregressive Flow).

3.
The modelling performance is on par with the best autoregressive models.
Of all the models that satisfy both (i) and (ii), the bestperforming ones are Real NVP and its successor Glow. However, as we saw in section 2.3, the coupling layers used by Real NVP and Glow are not as expressive as the autoregressive layers used by MAF and IAF. On the other hand, satisfying (iii) often requires sacrificing either (i) or (ii). Nonetheless, there is a lot of research activity currently in normalizing flows, and several new ideas such as Neural ODEs and FFJORD are beginning to emerge, so it’s reasonable to expect further advances in the near future.
Neural density estimation is only one approach to generative modelling, with variational autoencoders and generative adversarial networks being the main alternatives when exact density evaluations are not required. As I discussed in section 2.6, the advantage of variational autoencoders is their ability to learn structured representations, whereas the strength of generative adversarial networks is their performance in generating realistic data. As I argued earlier, choosing among a neural density estimator, a variational autoencoder and a generative adversarial network should primarily depend on the task we are interested in.
That said, normalizing flows, variational autoencoders and generative adversarial networks are all essentially the same model: a neural network that takes unstructured noise and turns it into data. The difference is in the requirements the different models must satisfy (a normalizing flow must be invertible), in the capabilities they offer (a normalizing flow provides explicit densities in addition to sampling), and in the way they are trained. The fact that generative adversarial networks can take white noise and turn it into convincingly realistic images should be seen as proof that (a) this task can be achieved by a smooth function, and (b) such a function is learnable from data. Since this task is demonstrably possible, it should serve as a performance target for neural density estimators too.
Part II Likelihoodfree inference
Chapter 3 Fast $\u03f5$free Inference of Simulator Models with Bayesian Conditional Density Estimation
This chapter is about likelihoodfree inference, i.e. Bayesian inference when the likelihood is intractable, and how neural density estimation can be used to address this challenge. We start by explaining what likelihoodfree inference is, in what situations the likelihood may be intractable, and why it is important to address this challenge (section 3.1). We then review approximate Bayesian computation, a family of methods for likelihoodfree inference based on simulations (section 3.2). The main contribution of this chapter is the paper Fast $\u03f5$free Inference of Simulator Models with Bayesian Conditional Density Estimation, which introduces a new method for likelihoodfree inference based on neural density estimation (section 3.3). We conclude the chapter with a discussion of the contributions, impact and limitations of the paper, and a comparison of the proposed method with previous and following work (section 3.4).
3.1 Likelihoodfree inference: what and why?
Suppose we have a stationary process that generates data and is controlled by parameters $\bm{\theta}$. Every time the process is run forward, it stochastically generates a datapoint $\mathbf{x}$ whose distribution depends on $\bm{\theta}$. We will assume that $\mathbf{x}$ and $\bm{\theta}$ are continuous vectors, but many of the techniques discussed in this thesis can be adapted for discrete $\mathbf{x}$ and/or $\bm{\theta}$. We will further assume that for every setting of $\bm{\theta}$, the corresponding distribution over $\mathbf{x}$ admits a finite density everywhere; in other words, the process defines a conditional density function $p\left(\mathbf{x}\bm{\theta}\right)$.
Given a datapoint ${\mathbf{x}}_{o}$ known to have been generated by the process, we are interested in inferring plausible parameter settings that could have generated ${\mathbf{x}}_{o}$. In particular, given prior beliefs over parameters described by a density $p\left(\bm{\theta}\right)$, we are interested in computing the posterior density $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ obtained by Bayes’ rule:
$$p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})=\frac{p\left({\mathbf{x}}_{o}\bm{\theta}\right)p\left(\bm{\theta}\right)}{Z\left({\mathbf{x}}_{o}\right)}\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}Z\left({\mathbf{x}}_{o}\right)=\int p({\mathbf{x}}_{o}{\bm{\theta}}^{\prime})p\left({\bm{\theta}}^{\prime}\right)\mathrm{d}{\bm{\theta}}^{\prime}.$$  (3.1) 
We assume that the normalizing constant $Z\left({\mathbf{x}}_{o}\right)$ is finite for every ${\mathbf{x}}_{o}$, so that the posterior density is always welldefined.
3.1.1 Density models and simulator models
The choice of inference algorithm depends primarily on how the generative process is modelled. In this section, I discuss and compare two types of models: density models and simulator models.
A density model, also known as an explicit model, describes the conditional density function of the process. Given values for $\mathbf{x}$ and $\bm{\theta}$, a density model returns the value of the conditional density $p\left(\mathbf{x}\bm{\theta}\right)$ (or an approximation of it if the model is not exact). With a density model, we can easily evaluate the posterior density $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ up to a normalizing constant using Bayes’ rule (equation 3.1). Even though the normalizing constant $Z\left({\mathbf{x}}_{o}\right)$ is typically intractable, we can sample from the posterior using e.g. Markovchain Monte Carlo (Murray, 2007; Neal, 1993), or approximate the posterior with a more convenient distribution using e.g. variational inference (Ranganath et al., 2014; Kucukelbir et al., 2015; Blei et al., 2017) or knowledge distillation (Papamakarios and Murray, 2015; Papamakarios, 2015). We refer to such methods as likelihoodbased inference methods, as they explicitly evaluate the likelihood $p\left({\mathbf{x}}_{o}\bm{\theta}\right)$.
On the other hand, a simulator model, also known as an implicit model, describes how the process generates data. For any parameter setting $\bm{\theta}$, a simulator model can be run forward to generate exact independent samples from $p\left(\mathbf{x}\bm{\theta}\right)$ (or approximate independent samples if the model is not perfect). Unlike a density model, we can’t typically evaluate the density under a simulator model. In order to perform inference in a simulator model we need methods that make use of simulations from the model but don’t require density evaluations. We refer to such methods as likelihoodfree inference methods.
In general, likelihoodfree methods are less efficient than likelihoodbased methods. As we will see in sections 3.2 and 3.3, likelihoodfree methods can require a large number of simulations from the model to produce accurate results, even for fairly simple models. One of the main topics of this thesis, and of research in likelihoodfree inference in general, is how to reduce the required number of simulations without sacrificing inference quality.
Since likelihoodfree methods are less efficient than likelihoodbased methods, why should we bother with simulator models at all, and not just require all models to be density models instead? The answer is that simulators can often be more natural and more interpretable modelling tools than density functions. A simulator model is a direct specification of how a process of interest generates data, which can be closer to our understanding of the process as a working mechanism. In contrast, a density function is a mathematical abstraction that is often hard to interpret, and as a result it can be an unintuitive description of realworld phenomena.
Simulator models are particularly wellsuited for scientific applications. In highenergy physics for instance, simulators are used to model collisions in particle accelerators and the interaction of produced particles with particle detectors (Agostinelli et al., 2003; Sjöstrand et al., 2008). Inference in such models can be used to infer parameters of physical theories and potentially discover new physics (Brehmer et al., 2018a, b). Other scientific applications of simulator models and likelihoodfree inference methods include astronomy and cosmology (Schafer and Freeman, 2012; Alsing et al., 2018a, 2019), systems biology (Wilkinson, 2011), evolution and ecology (Pritchard et al., 1999; Ratmann et al., 2007; Wood, 2010; Beaumont, 2010), population generics (Beaumont et al., 2002), and computational neuroscience (Pospischil et al., 2008; Markram et al., 2015; Lueckmann et al., 2017; Sterratt et al., 2011)
In addition to their scientific applications, simulator models can be useful in engineering and artificial intelligence. For instance, one can build an image model using a computergraphics engine: the engine takes a highlevel description of a scene (which corresponds to the model parameters $\bm{\theta}$) and renders an image of that scene (which corresponds to the generated data $\mathbf{x}$). Computer vision and scene understanding can be cast as inference of scene parameters given an observed image (Mansinghka et al., 2013; Kulkarni et al., 2015; Romaszko et al., 2017). Other examples of applications of simulator models in artificial intelligence include perceiving physical properties of objects using a physics engine (Wu et al., 2015), and inferring the goals of autonomous agents using probabilistic programs (CusumanoTowner et al., 2017).
3.1.2 When is the likelihood intractable?
In the previous section, I argued that simulator models are useful modelling tools. Nonetheless, this doesn’t necessitate the use of likelihoodfree inference methods: given a simulator model, we could try to calculate its likelihood based on the simulator’s internal workings, and then use likelihoodbased inference methods instead. In this section, I will discuss certain cases in which it may be too expensive or even impossible to calculate the likelihood of a simulator, and thus likelihoodfree methods may be preferable or necessary.
Integrating over the simulator’s latent state
Typically, the stochasticity of a simulator model comes from internal generation of random numbers, which form the simulator’s latent state. Consider for example a simulator whose latent state consists of variables ${\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{K}$, generated from distributions ${p}_{{\bm{\psi}}_{1}},\mathrm{\dots},{p}_{{\bm{\psi}}_{K}}$ whose parameters ${\bm{\psi}}_{1},\mathrm{\dots},{\bm{\psi}}_{K}$ can be arbitrary functions of all variables preceding them. For simplicity, assume that all variables are continuous, and that $K$ is fixed (and not random). Such a simulator can be expressed as the following sequence of statements:
${\bm{\psi}}_{1}$  $={f}_{1}\left(\bm{\theta}\right)$  (3.2)  
${\mathbf{z}}_{1}$  $\sim {p}_{{\bm{\psi}}_{1}}$  (3.3)  
$\mathrm{\dots}$  
${\bm{\psi}}_{K}$  $={f}_{K}({\mathbf{z}}_{1:K1},\bm{\theta})$  (3.4)  
${\mathbf{z}}_{K}$  $\sim {p}_{{\bm{\psi}}_{K}}$  (3.5)  
$\mathit{\varphi}$  $=g({\mathbf{z}}_{1:K},\bm{\theta})$  (3.6)  
$\mathbf{x}$  $\sim {p}_{\mathit{\varphi}}.$  (3.7) 
If the distributions ${p}_{{\bm{\psi}}_{1}},\mathrm{\dots},{p}_{{\bm{\psi}}_{K}}$ and ${p}_{\mathit{\varphi}}$ admit tractable densities, the joint density of data $\mathbf{x}$ and latent state ${\mathbf{z}}_{1:K}$ is also tractable, and can be calculated by:
$$p(\mathbf{x},{\mathbf{z}}_{1:K}\bm{\theta})={p}_{\mathit{\varphi}}\left(\mathbf{x}\right)\prod _{k}{p}_{{\bm{\psi}}_{k}}\left({\mathbf{z}}_{k}\right).$$  (3.8) 
Given observed data ${\mathbf{x}}_{o}$, the likelihood of the simulator is:
$$p\left({\mathbf{x}}_{o}\bm{\theta}\right)=\int p({\mathbf{x}}_{o},{\mathbf{z}}_{1:K}\bm{\theta})\mathrm{d}{\mathbf{z}}_{1:K}.$$  (3.9) 
Since the likelihood of the simulator involves integration over the latent state, it is generally intractable.
Even when integration over the latent state makes the likelihood intractable, the joint likelihood $p({\mathbf{x}}_{o},{\mathbf{z}}_{1:K}\bm{\theta})$ may still be tractable (as in the above example). We could then sample from the joint posterior $p({\mathbf{z}}_{1:K},\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ using e.g. MCMC, and simply discard the latentstate samples. However, this may be infeasible when the latent state is large. Consider for example a simulator of a physical process involving Brownian motion of a large set of particles. Inferring the latent state of the simulator would entail sampling from the space of all possible particle trajectories that are consistent with the data. If both the observed data ${\mathbf{x}}_{o}$ and the parameters of interest $\bm{\theta}$ correspond to macroscopic variables associated with the process, likelihoodfree inference methods that don’t involve inferring the microscopic state of the process may be preferable.
Deterministic transformations
Sometimes, the likelihood of a simulator model is intractable because the output of the simulator is a deterministic transformation of the simulator’s latent state. Consider for example the following simulator model:
$\mathbf{z}$  $\sim p\left(\mathbf{z}\bm{\theta}\right)$  (3.10)  
$\mathbf{x}$  $=f(\mathbf{z},\bm{\theta}),$  (3.11) 
where $\mathbf{z}$ is continuous and the density $p\left(\mathbf{z}\bm{\theta}\right)$ is tractable. The joint distribution of $\mathbf{x}$ and $\mathbf{z}$ does not admit a proper density, but can be expressed using a delta distribution as follows:
$$p(\mathbf{x},\mathbf{z}\bm{\theta})=\delta \left(\mathbf{x}f(\mathbf{z},\bm{\theta})\right)p\left(\mathbf{z}\bm{\theta}\right).$$  (3.12) 
The likelihood of the simulator is:
$$p\left({\mathbf{x}}_{o}\bm{\theta}\right)=\int \delta \left({\mathbf{x}}_{o}f(\mathbf{z},\bm{\theta})\right)p\left(\mathbf{z}\bm{\theta}\right)\mathrm{d}\mathbf{z}.$$  (3.13) 
Calculating the above integral requires a change of variables from $\mathbf{z}$ to $\mathbf{x}$ which is not tractable in general. Moreover, the distribution $p\left(\mathbf{x}\bm{\theta}\right)$ may not admit a proper density even if $p\left(\mathbf{z}\bm{\theta}\right)$ does.
Sampling from the joint posterior of $\mathbf{z}$ and $\bm{\theta}$ in order to avoid calculating the likelihood also poses difficulties. Given observed data ${\mathbf{x}}_{o}$, the joint posterior $p(\mathbf{z},\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ is constrained on a manifold in the $(\mathbf{z},\bm{\theta})$space defined by ${\mathbf{x}}_{o}=f(\mathbf{z},\bm{\theta})$. One approach to sampling from this posterior is to replace the joint distribution $p(\mathbf{x},\mathbf{z}\bm{\theta})$ with the following approximation:
$${p}_{\u03f5}(\mathbf{x},\mathbf{z}\bm{\theta})=\mathcal{N}\left(\mathbf{x}f(\mathbf{z},\bm{\theta}),{\u03f5}^{2}\mathbf{I}\right)p\left(\mathbf{z}\bm{\theta}\right),$$  (3.14) 
where $\u03f5$ is a small positive constant. This is equivalent to replacing the original simulator model with the following one:
$\mathbf{z}$  $\sim p\left(\mathbf{z}\bm{\theta}\right)$  (3.15)  
${\mathbf{z}}^{\prime}$  $\sim \mathcal{N}(\mathrm{\U0001d7ce},\mathbf{I})$  (3.16)  
$\mathbf{x}$  $=f(\mathbf{z},\bm{\theta})+\u03f5{\mathbf{z}}^{\prime}.$  (3.17) 
An issue with this approach is that for small $\u03f5$ methods such as MCMC may struggle, whereas for large $\u03f5$ the posterior no longer corresponds to the original model. MCMC methods on constrained manifolds exist and may be used instead, but make assumptions about the manifold which may not hold for a simulator model in general (Brubaker et al., 2012; Graham and Storkey, 2017).
Inaccessible internal workings
Even if it is theoretically possible to use a likelihoodbased method with a simulator model, this may be difficult in practice if the simulator is a black box whose internal workings we don’t have access to. For example, the simulator may be provided as an executable program, or it may be written in a lowlevel language (e.g. for running efficiency), or it may be an external library routine. I would argue that a generalpurpose inference engine ought to have inference algorithms that support such cases.
Finally, the “simulator” may not be a model written in computer code, but an actual process in the real world. For example, the “simulator“ could be an experiment in a lab, asking participants to perform a task, or a measurement of a physical phenomenon. If a model doesn’t exist, likelihoodfree inference methods could be used directly in the real world.
3.2 Approximate Bayesian computation
In this section, I will discuss a family of likelihoodfree inference methods broadly known as approximate Bayesian computation or ABC. In general, ABC methods draw approximate samples from the parameter posterior by repeatedly simulating the model and checking whether simulated data match the observed data. I will discuss three types of ABC methods: rejection ABC, Markovchain Monte Carlo ABC, and sequential Monte Carlo ABC.
3.2.1 Rejection ABC
Let ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$ be a neighbourhood of ${\mathbf{x}}_{o}$, defined as the set of datapoints whose distance from ${\mathbf{x}}_{o}$ is no more than $\u03f5$:
$${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)=\{\mathbf{x}:\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\}.$$  (3.18) 
In the above, $\parallel \cdot \parallel $ can be the Euclidean norm, or any other norm in ${\mathbb{R}}^{D}$. For a small $\u03f5$, we can approximate the likelihood $p\left({\mathbf{x}}_{o}\bm{\theta}\right)$ by the average conditional density inside ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$:
$$p\left({\mathbf{x}}_{o}\bm{\theta}\right)\approx \frac{\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})}{\left{B}_{\u03f5}\left({\mathbf{x}}_{o}\right)\right},$$  (3.19) 
where $\left{B}_{\u03f5}\left({\mathbf{x}}_{o}\right)\right$ is the volume of ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$. Substituting the above likelihood approximation into Bayes’ rule (equation 3.1), we obtain the following approximation to the posterior:
$$p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})\approx \frac{\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})p\left(\bm{\theta}\right)}{\int \mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5{\bm{\theta}}^{\prime})p\left({\bm{\theta}}^{\prime}\right)\mathrm{d}{\bm{\theta}}^{\prime}}=p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5).$$  (3.20) 
The approximate posterior $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ can be thought of as the exact posterior under an alternative observation, namely that generated data $\mathbf{x}$ falls inside the neighbourhood ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$. As $\u03f5$ approaches zero, ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$ becomes infinitesimally small, and the approximate posterior approaches the exact posterior $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ (for a formal proof under suitable regularity conditions, see e.g. Prangle, 2017, supplementary material, theorem 1). Conversely, as $\u03f5$ approaches infinity, ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$ covers the whole space, and the approximate posterior approaches the prior $p\left(\bm{\theta}\right)$.
Rejection ABC (Pritchard et al., 1999) is a rejectionsampling method for obtaining independent samples from the approximate posterior $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$. It works by first sampling parameters from the prior $p\left(\bm{\theta}\right)$, then simulating the model using the sampled parameters, and only accepting the sample if the simulated data is no further than a distance $\u03f5$ away from ${\mathbf{x}}_{o}$. The parameter $\u03f5$ controls the tradeoff between approximation quality and computational efficiency: as $\u03f5$ becomes smaller, the accepted samples follow more closely the exact posterior, but the algorithm accepts less often. Rejection ABC is shown in algorithm 1.
An issue with rejection ABC is that it can become prohibitively expensive for small $\u03f5$, especially when the data is highdimensional. As an illustration, consider the case where $\parallel \cdot \parallel $ is the maximum norm, and hence the acceptance region ${B}_{\u03f5}\left({\mathbf{x}}_{o}\right)$ is a cube of side $2\u03f5$ centred at ${\mathbf{x}}_{o}$. For small $\u03f5$, the acceptance probability can be approximated as follows:
$$\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})\approx p({\mathbf{x}}_{o}\bm{\theta})\left{B}_{\u03f5}\left({\mathbf{x}}_{o}\right)\right=p({\mathbf{x}}_{o}\bm{\theta}){\left(2\u03f5\right)}^{D},$$  (3.21) 
where $D$ is the dimensionality of $\mathbf{x}$. As $\u03f5\to 0$, the acceptance probability approaches zero at a rate of $\mathcal{O}\left({\u03f5}^{D}\right)$. Moreover, as $D$ grows large, the acceptance probability approaches zero for any $$. To put that into perspective, if $p\left({\mathbf{x}}_{o}\bm{\theta}\right)=1$, $\u03f5=0.05$ and $D=15$, we would need on average about a thousand trillion simulations for each accepted sample!
In practice, in order to maintain the acceptance probability at manageable levels, we typically transform the data into a small number of features, also known as summary statistics, in order to reduce the dimensionality $D$. If the summary statistics are sufficient for inferring $\bm{\theta}$, then turning data into summary statistics incurs no loss of information about $\bm{\theta}$ (for a formal definition of sufficiency, see e.g. Wasserman, 2010, definition 9.32). However, it can be hard to find sufficient summary statistics, and so in practice summary statistics are often chosen based on domain knowledge about the inference task. A large section of the ABC literature is devoted to constructing good summary statistics (see e.g. Fearnhead and Prangle, 2012; Blum et al., 2013; Prangle et al., 2014; Charnock et al., 2018, and references therein).
So far I’ve described rejection ABC as a rejectionsampling algorithm for the approximate posterior obtained by replacing $p\left({\mathbf{x}}_{o}\bm{\theta}\right)$ by the average conditional density near ${\mathbf{x}}_{o}$. An alternative way to view rejection ABC is as a kernel density estimate of the joint density $p(\bm{\theta},\mathbf{x})$ that has been conditioned on ${\mathbf{x}}_{o}$. Let $\{({\bm{\theta}}_{1},{\mathbf{x}}_{1}),\mathrm{\dots},({\bm{\theta}}_{N},{\mathbf{x}}_{N})\}$ be a set of joint samples from $p(\bm{\theta},\mathbf{x})$, obtained as follows:
$${\bm{\theta}}_{n}\sim p\left(\bm{\theta}\right)\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\mathbf{x}}_{n}\sim p\left(\mathbf{x}{\bm{\theta}}_{n}\right).$$  (3.22) 
Let ${k}_{\u03f5}(\cdot )$ be a smoothing kernel, and let $\delta (\cdot )$ be the delta distribution. We form the following approximation to the joint density, which is a kernel density estimate in $\mathbf{x}$ and an empirical distribution in $\bm{\theta}$:
$$\widehat{p}(\bm{\theta},\mathbf{x})=\frac{1}{N}\sum _{n}{k}_{\u03f5}\left(\mathbf{x}{\mathbf{x}}_{n}\right)\delta \left(\bm{\theta}{\bm{\theta}}_{n}\right).$$  (3.23) 
By conditioning on ${\mathbf{x}}_{o}$ and normalizing, we obtain an approximate posterior in the form of a weighted empirical distribution:
$$\widehat{p}(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})=\frac{\widehat{p}(\bm{\theta},{\mathbf{x}}_{o})}{\int \widehat{p}(\bm{\theta},{\mathbf{x}}_{o})\mathrm{d}\bm{\theta}}=\frac{{\sum}_{n}{k}_{\u03f5}\left({\mathbf{x}}_{o}{\mathbf{x}}_{n}\right)\delta \left(\bm{\theta}{\bm{\theta}}_{n}\right)}{{\sum}_{n}{k}_{\u03f5}\left({\mathbf{x}}_{o}{\mathbf{x}}_{n}\right)}.$$  (3.24) 
If we use the following uniform smoothing kernel:
$${k}_{\u03f5}\left(\mathbf{u}\right)\propto I(\parallel \mathbf{u}\parallel \le \u03f5)=\{\begin{array}{cc}1\hfill & \parallel \mathbf{u}\parallel \le \u03f5\hfill \\ 0\hfill & \mathrm{otherwise},\hfill \end{array}$$  (3.25) 
then $\widehat{p}(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ is simply the empirical distribution of samples returned by rejection ABC. The above can be seen as a way to generalize rejection ABC to smoothrejection ABC, where we weight each prior sample ${\bm{\theta}}_{n}$ by a smoothing kernel ${k}_{\u03f5}\left({\mathbf{x}}_{o}{\mathbf{x}}_{n}\right)$ instead of accepting or rejecting it (Beaumont et al., 2002).
Additionally, we can interpret $\widehat{p}(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ as the empirical distribution of the exact posterior of a different model, as given by importance sampling. Consider an alternative simulator model, obtained by adding noise $\mathbf{u}\sim {k}_{\u03f5}\left(\mathbf{u}\right)$ to the output $\mathbf{x}$ of the original simulator:
$\mathbf{x}$  $\sim p\left(\mathbf{x}\bm{\theta}\right)$  (3.26)  
$\mathbf{u}$  $\sim {k}_{\u03f5}\left(\mathbf{u}\right)$  (3.27)  
${\mathbf{x}}^{\prime}$  $=\mathbf{x}+\mathbf{u}.$  (3.28) 
Suppose we observed ${\mathbf{x}}^{\prime}={\mathbf{x}}_{o}$, and want to perform inference on $\bm{\theta}$. The posterior $p(\bm{\theta},\mathbf{x}{\mathbf{x}}^{\prime}={\mathbf{x}}_{o})$ can be written as:
$$p(\bm{\theta},\mathbf{x}{\mathbf{x}}^{\prime}={\mathbf{x}}_{o})\propto {k}_{\u03f5}({\mathbf{x}}_{o}\mathbf{x})p(\bm{\theta},\mathbf{x}).$$  (3.29) 
We will form a weighted empirical distribution of the above posterior using importance sampling, with $p(\bm{\theta},\mathbf{x})$ as the proposal. First, we sample $\{({\bm{\theta}}_{1},{\mathbf{x}}_{1}),\mathrm{\dots},({\bm{\theta}}_{N},{\mathbf{x}}_{N})\}$ from $p(\bm{\theta},\mathbf{x})$, and then we form the following importanceweighted empirical distribution:
$$\widehat{p}(\bm{\theta},\mathbf{x}{\mathbf{x}}^{\prime}={\mathbf{x}}_{o})=\frac{{\sum}_{n}{k}_{\u03f5}\left({\mathbf{x}}_{o}{\mathbf{x}}_{n}\right)\delta \left(\bm{\theta}{\bm{\theta}}_{n}\right)\delta \left(\mathbf{x}{\mathbf{x}}_{n}\right)}{{\sum}_{n}{k}_{\u03f5}\left({\mathbf{x}}_{o}{\mathbf{x}}_{n}\right)}.$$  (3.30) 
By marginalizing out $\mathbf{x}$ (i.e. by discarding samples ${\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N}$), we obtain the weighted empirical distribution $\widehat{p}(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ as given by equation (3.24). This shows that smoothrejection ABC, and by extension rejection ABC, can be understood as sampling from the exact posterior of an alternative model. In this interpretation, $\u03f5$ controls the extent to which the alternative model is different to the original one. This interpretation of rejection ABC as sampling from a different model was proposed by Wilkinson (2013) using rejection sampling; here I provide an alternative perspective using importance sampling instead.
3.2.2 Markovchain Monte Carlo ABC
As we saw in the previous section, rejection ABC proposes parameters from the prior $p\left(\bm{\theta}\right)$, and only accepts parameters that are likely under the approximate posterior $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$. If the approximate posterior is significantly narrower than the prior, as is often the case in practice, a large fraction of proposed parameters will be rejected.
An alternative approach that can lead to fewer rejections is to propose parameters using Markovchain Monte Carlo in the form of Metropolis–Hastings. Instead of proposing parameters independently from the prior, Metropolis–Hastings proposes new parameters by e.g. perturbing previously accepted parameters. If the perturbation is not too large, the proposed parameters are likely to be accepted too, which can lead to fewer rejections compared to proposing from the prior.
A likelihoodbased Metropolis–Hastings algorithm that targets $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ is shown in algorithm 2. The algorithm uses a proposal distribution $q\left({\bm{\theta}}^{\prime}\bm{\theta}\right)$ from which it proposes new parameters ${\bm{\theta}}^{\prime}$ based on previously accepted parameters $\bm{\theta}$. Then, the following quantity is calculated, known as the acceptance ratio:
$$\alpha =\frac{\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5{\bm{\theta}}^{\prime})p\left({\bm{\theta}}^{\prime}\right)q(\bm{\theta}{\bm{\theta}}^{\prime})}{\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})p\left(\bm{\theta}\right)q({\bm{\theta}}^{\prime}\bm{\theta})}.$$  (3.31) 
Finally, the algorithm outputs the proposed parameters ${\bm{\theta}}^{\prime}$ with probability $\mathrm{min}(1,\alpha )$, otherwise it outputs the previous parameters $\bm{\theta}$. It can be shown that this procedure leaves the approximate posterior invariant; that is, if $\bm{\theta}$ is a sample from $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$, then the algorithm’s output is also a sample from $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ (for a proof, see Bishop, 2006, section 11.2.2).
In the likelihoodfree setting, we cannot evaluate the approximate likelihood $\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})$, but we can estimate it as the fraction of simulated data that are at most distance $\u03f5$ away from the observed data ${\mathbf{x}}_{o}$:
$$\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})\approx \frac{1}{N}\sum _{n}I(\parallel {\mathbf{x}}_{n}{\mathbf{x}}_{o}\parallel \le \u03f5)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}{\mathbf{x}}_{n}\sim p(\mathbf{x}\bm{\theta}).$$  (3.32) 
The above estimate is unbiased for any number of simulations $N$, that is, its expectation is equal to $\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})$. It can be shown that Metropolis–Hastings leaves the target distribution invariant even if we use an unbiased estimate of the target distribution instead of its exact value in the acceptance ratio, as long as the estimate itself is made part of the Markov chain (for a proof, see Andrieu and Roberts, 2009). This means that we can use the above unbiased estimate of the approximate likelihood in the acceptance ratio and obtain a valid likelihoodfree Metropolis–Hastings algorithm. This algorithm, which I refer to as pseudomarginal Metropolis–Hastings, is shown in algorithm 3.
The estimate in equation (3.32) is unbiased for any number of simulations $N$, hence pseudomarginal Metropolis–Hastings is valid for any $N$. For larger $N$, the approximatelikelihood estimate has lower variance, but the runtime of the algorithm increases as more simulations are required per step. Overall, as discussed by Bornn et al. (2017), the efficiency of the algorithm doesn’t necessarily increase with $N$, and so $N=1$ is usually good enough. In the case of $N=1$, i.e. where only one datapoint $\mathbf{x}$ is simulated from $p\left(\mathbf{x}{\bm{\theta}}^{\prime}\right)$ per step, and assuming $\bm{\theta}$ is a previously accepted sample, the acceptance ratio simplifies into the following:
$$\alpha =\{\begin{array}{cc}\frac{p\left({\bm{\theta}}^{\prime}\right)q\left(\bm{\theta}{\bm{\theta}}^{\prime}\right)}{p\left(\bm{\theta}\right)q\left({\bm{\theta}}^{\prime}\bm{\theta}\right)}\hfill & \text{if}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5,\hfill \\ 0\hfill & \text{otherwise.}\hfill \end{array}$$  (3.33) 
For $N=1$, pseudomarginal Metropolis–Hastings can be simplified into algorithm 4, which is known as Markovchain Monte Carlo ABC. MCMCABC was proposed by Marjoram et al. (2003) prior to the introduction of pseudomarginal MCMC by Andrieu and Roberts (2009). Here I provide an alternative perspective, where I describe MCMCABC as pseudomarginal MCMC.
Compared to rejection ABC, which proposes from the prior and thus can lead to most proposals being rejected, MCMCABC tends to accept more often. As discussed earlier, we can think of the proposal $q\left({\bm{\theta}}^{\prime}\bm{\theta}\right)$ as randomly ‘perturbing’ accepted parameters $\bm{\theta}$ in order to propose new parameters ${\bm{\theta}}^{\prime}$; if the perturbation is small, the proposed parameters are also likely to be accepted. However, unlike rejection ABC which produces independent samples, MCMCABC produces dependent samples, which can lead to low effective sample size.
Similarly to rejection ABC, the acceptance probability of MCMCABC vanishes as $\u03f5$ becomes small or as the data dimensionality becomes large. In practice, the data is typically transformed into summary statistics in order to reduce the data dimensionality. Another issue with MCMCABC is the initialization of the Markov chain: if the initial parameters $\bm{\theta}$ are unlikely to generate data near ${\mathbf{x}}_{o}$, the chain may be stuck at its initial state for a long time. In practice, one possibility is to run rejection ABC until one parameter sample is accepted, and initialize the Markov chain with that parameter sample.
3.2.3 Sequential Monte Carlo ABC
As we have seen, the low acceptance rate of rejection ABC is partly due to parameters being proposed from the prior $p\left(\bm{\theta}\right)$. Indeed, if the approximate posterior $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ is significantly narrower than the prior, it’s highly unlikely that a prior sample will be accepted. In the previous section, I discussed an approach for increasing the acceptance rate based on perturbing accepted parameters using Markovchain Monte Carlo. In this section, I will discuss an alternative approach based on importance sampling.
To begin with, consider a variant of rejection ABC where instead of proposing parameters from the prior $p\left(\bm{\theta}\right)$, we propose parameters from an alternative distribution $\stackrel{~}{p}\left(\bm{\theta}\right)$. Parameter samples accepted under this procedure will be distributed according to the following distribution:
$$\stackrel{~}{p}(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)\propto \mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5\bm{\theta})\stackrel{~}{p}\left(\bm{\theta}\right).$$  (3.34) 
Assuming $\stackrel{~}{p}\left(\bm{\theta}\right)$ is nonzero in the support of the prior, we can approximate $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ with a population of $N$ importanceweighted samples $\{({w}_{1},{\bm{\theta}}_{1}),\mathrm{\dots},({w}_{N},{\bm{\theta}}_{N})\}$, where each ${\bm{\theta}}_{n}$ is obtained using the above procedure, and $\{{w}_{1},\mathrm{\dots},{w}_{N}\}$ are importance weights that are normalized to sum to $1$ and are calculated by:
$${w}_{n}\propto \frac{p({\bm{\theta}}_{n}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)}{\stackrel{~}{p}({\bm{\theta}}_{n}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)}\propto \frac{\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5{\bm{\theta}}_{n})p\left({\bm{\theta}}_{n}\right)}{\mathrm{Pr}(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5{\bm{\theta}}_{n})\stackrel{~}{p}\left({\bm{\theta}}_{n}\right)}=\frac{p\left({\bm{\theta}}_{n}\right)}{\stackrel{~}{p}\left({\bm{\theta}}_{n}\right)}.$$  (3.35) 
The above procedure, which I refer to as importancesampling ABC, is shown in algorithm 5. Importancesampling ABC reduces to rejection ABC if $\stackrel{~}{p}\left(\bm{\theta}\right)=p\left(\bm{\theta}\right)$.
The efficiency of importancesampling ABC depends heavily on the choice of the proposal $\stackrel{~}{p}\left(\bm{\theta}\right)$. If the proposal is too broad the acceptance rate will be low, whereas if the proposal is too narrow the importance weights will have high variance (see Alsing et al., 2018b, for a discussion on the optimality of the proposal). In practice, one approach for constructing a proposal is to perform a preliminary run of importancesampling ABC with ${\u03f5}^{\prime}>\u03f5$, obtain a population of importanceweighted parameter samples $\{({w}_{1},{\bm{\theta}}_{1}),\mathrm{\dots},({w}_{N},{\bm{\theta}}_{N})\}$, and take $\stackrel{~}{p}\left(\bm{\theta}\right)$ to be a mixture distribution defined by:
$$\stackrel{~}{p}\left(\bm{\theta}\right)=\sum _{n}{w}_{n}q\left(\bm{\theta}{\bm{\theta}}_{n}\right).$$  (3.36) 
Similar to MCMCABC, $q\left(\bm{\theta}{\bm{\theta}}_{n}\right)$ can be thought of as a way to perturb ${\bm{\theta}}_{n}$. For instance, a common choice is to take $q\left(\bm{\theta}{\bm{\theta}}_{n}\right)$ to be a Gaussian distribution centred at ${\bm{\theta}}_{n}$, which corresponds to adding zeromean Gaussian noise to ${\bm{\theta}}_{n}$.
Based on the above idea, we can run a sequence of $T$ rounds of importancesampling ABC with ${\u03f5}_{1}>\mathrm{\dots}>{\u03f5}_{T}$, and construct the proposal for round $t$ using the importanceweighted population obtained in round $t1$. The first round can use the prior $p\left(\bm{\theta}\right)$ as proposal and a large enough ${\u03f5}_{1}$ so that the acceptance rate is not too low. This procedure is known as sequential Monte Carlo ABC, and is shown in algorithm 6. Further discussion on SMCABC and different variants of the algorithm are provided by Sisson et al. (2007); Bonassi and West (2015); Beaumont et al. (2009); Toni et al. (2009).
An issue with sequential Monte Carlo more broadly (and not just in the context of ABC) is sample degeneracy, which occurs when only few samples have nonnegligible weights. If that’s the case, the effective sample size, i.e. the number of independent samples that would be equivalent to the weighted population, is significantly less than the population size $N$. One strategy for avoiding sample degeneracy, which is shown in algorithm 6, is to estimate the effective sample size after each round of importancesampling ABC; if the estimated effective sample size is less than a threshold (e.g. $N/2$), we resample the population with probabilities given by the weights in order to obtain a new population with equal weights. In the context of ABC in particular, the above resampling may not be necessary, since sampling the next population from the proposal distribution already diversifies its samples. In any case, it’s a good idea to monitor the effective sample size of each population as the algorithm progresses, as a consistently low sample size can serve as a warning of potentially poor performance. In practice, the effective sample size $S$ of a weighted population $\{({w}_{1},{\bm{\theta}}_{1}),\mathrm{\dots},({w}_{N},{\bm{\theta}}_{N})\}$ can be estimated by:
$$\widehat{S}=\frac{1}{{\sum}_{n}{w}_{n}^{2}}.$$  (3.37) 
A justification of the above estimate is provided by Nowozin (2015).
SMCABC is a strong baseline for likelihoodfree inference; its acceptance rate is typically higher than rejection ABC, and it can be easier to tune than MCMCABC. One way to use SMCABC in practice is to start with a tolerance ${\u03f5}_{1}$ that is large enough for rejection ABC to be practical, and exponentially decay it after each round in order to obtain increasingly accurate posterior samples. Nevertheless, even though SMCABC can produce a good proposal distribution after a few rounds, it doesn’t solve the problem of the acceptance probability eventually vanishing as $\u03f5$ becomes small. In practice, as $\u03f5$ decreases in every round, the required number of simulations increases, and can thus become prohibitively large after a few rounds.
3.3 The paper
This section presents the paper Fast $\u03f5$free Inference of Simulator Models with Bayesian Conditional Density Estimation, which is the main contribution of this chapter. In the paper, we cast likelihoodfree inference as densityestimation problem: we estimate the posterior density using a neural density estimator trained on data simulated from the model. We employ a Bayesian mixturedensity network as our neural density estimator, and we describe how to use the network in order to guide simulations and thus accelerate the training process.
The paper was first published as a preprint on arXiv in May 2016. Afterwards, it was accepted for publication at the conference Advances in Neural Information Processing Systems (NeurIPS) in December 2016. There were about $2,500$ submissions to NeurIPS in 2016, of which $568$ were accepted for publication.
Author contributions
The paper is coauthored by me and Iain Murray. We jointly conceived and developed the method. As the leading author, I wrote the code, performed the experiments, and drafted the paper. Iain Murray supervised the project, and revised the final draft.
Differences in notation
For precision, in this chapter I use $p(\cdot )$ for probability densities and $\mathrm{Pr}(\cdot )$ for probability measures. The paper doesn’t make this distinction; instead, it uses the term ‘probability’ for both probability densities and probability measures, and denotes both of them by $p(\cdot )$ or similar. This overloaded notation, common in the machinelearning literature, should be interpreted appropriately depending on context.
Throughout this chapter I use $\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5$ to denote the acceptance condition of ABC; the paper uses $$ instead. The difference is immaterial, since the set $\{\mathbf{x}:\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel =\u03f5\}$ has zero probability measure by assumption.
Finally, the paper uses ${\u27e8f\left(\mathbf{x}\right)\u27e9}_{p\left(\mathbf{x}\right)}$ to denote the expectation of a function $f\left(\mathbf{x}\right)$ with respect to a distribution $p\left(\mathbf{x}\right)$, whereas the rest of the thesis uses ${\mathbb{E}}_{p\left(\mathbf{x}\right)}\left(f\left(\mathbf{x}\right)\right)$.
Minor corrections
Section 2.1 of the paper states that $\mathbf{x}={\mathbf{x}}_{o}$ corresponds to setting $\u03f5=0$ in $$. Strictly speaking, this statement is incorrect, since $$ is the empty set. For this reason, in this chapter I use $\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5$, such that setting $\u03f5$ to zero correctly implies $\mathbf{x}={\mathbf{x}}_{o}$.
Section 4 of the paper misrepresents the work of Fan et al. (2013) as a synthetic likelihood method that trains “a separate density model of $\mathbf{x}$ for each $\bm{\theta}$ by repeatedly simulating the model for fixed $\bm{\theta}$”. This is incorrect; in reality, Fan et al. (2013) use a single conditionaldensity model of $\mathbf{x}$ given $\bm{\theta}$ that is trained on pairs $(\bm{\theta},\mathbf{x})$ independently simulated from a joint density $r\left(\bm{\theta}\right)p\left(\mathbf{x}\bm{\theta}\right)$, where $r\left(\bm{\theta}\right)$ is some reference density.
See pages  of papers/efree.pdf
3.4 Discussion
I conclude this chapter with a discussion of the paper Fast $\u03f5$free Inference of Simulator Models with Bayesian Conditional Density Estimation, which was presented in the previous section. In what follows, I will discuss the paper’s contribution and impact so far, I will compare the proposed method with regression adjustment, I will criticize the paper’s limitations, and I will discuss further advances inspired by the proposed method.
3.4.1 Contribution and impact
The paper establishes neural density estimation as a viable method for likelihoodfree inference. Inference is viewed as a densityestimation problem, where the unknown posterior is the target density to estimate, and the simulator is the source of training data. Unlike ABC methods such as those discussed in section 3.2, likelihoodfree inference via neural density estimation doesn’t involve accepting/rejecting parameter samples, and consequently it doesn’t require specifying a tolerance $\u03f5$. The paper demonstrates that it is possible for neural density models to estimate joint posterior densities accurately, and without the need to replace the exact posterior $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ by an approximate posterior $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ as typically done in ABC.
Furthermore, the paper demonstrates that likelihoodfree inference via neural density estimation can lead to massive computational savings in simulation cost compared to traditional ABC. This is achieved by a sequential training procedure, where preliminary estimates of the posterior are used as proposals for obtaining more training data. This sequential procedure is similar to the one used in sequential Monte Carlo ABC, where new parameters are proposed by perturbing previous posterior samples obtained with a larger $\u03f5$ (as explained in section 3.2.3). As the paper shows, the proposed method can estimate the joint density of the entire posterior with roughly the same number of simulations as SMCABC needs to produce (the equivalent of) a single independent posterior sample.
According to Google Scholar, the paper has received $41$ citations as of April 2019. The proposed method has directly influenced subsequent developments in likelihoodfree inference, such as Sequential Neural Posterior Estimation (Lueckmann et al., 2017), which will be discussed in more detail in section 3.4.4, Sequential Neural Likelihood (Papamakarios et al., 2019), which is the topic of next chapter, and adaptive Gaussiancopula ABC (Chen and Gutmann, 2019). Finally, variants of the proposed method have been used in applications of likelihoodfree inference in cosmology (Alsing et al., 2018a, 2019) and in neuroscience (Lueckmann et al., 2017).
3.4.2 Comparison with regression adjustment
As mentioned in section 4 of the paper, regression adjustment is closely related to our work, and can be seen as a predecessor to likelihoodfree inference via neural density estimation. Here, I discuss in more depth what regression adjustment is, in what way it is similar to neural density estimation, and also how it differs.
Regression adjustment is typically performed as a postprocessing step after rejection ABC, to correct for the approximation incurred by using a nonzero tolerance $\u03f5$. Suppose we run rejection ABC with some tolerance $\u03f5$, until we obtain a set of $N$ joint samples $\{({\bm{\theta}}_{1},{\mathbf{x}}_{1}),\mathrm{\dots},({\bm{\theta}}_{N},{\mathbf{x}}_{N})\}$. As we have seen, each pair $({\bm{\theta}}_{n},{\mathbf{x}}_{n})$ is an independent joint sample from the following distribution:
$$p(\bm{\theta},\mathbf{x}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)\propto I(\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)p(\mathbf{x}\bm{\theta})p\left(\bm{\theta}\right).$$  (3.38) 
Marginally, each ${\bm{\theta}}_{n}$ is an independent sample from the approximate posterior $p(\bm{\theta}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$. If $\u03f5=\mathrm{\infty}$ then all simulations are accepted, in which case each ${\bm{\theta}}_{n}$ is an independent sample from the prior $p\left(\bm{\theta}\right)$. The goal of regression adjustment is to transform each sample ${\bm{\theta}}_{n}$ into an ‘adjusted’ sample ${\bm{\theta}}_{n}^{\prime}$, such that ${\bm{\theta}}_{n}^{\prime}$ is an independent sample from the exact posterior $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$.
The first step of regression adjustment is to estimate the stochastic mapping from $\mathbf{x}$ to $\bm{\theta}$, where $\mathbf{x}$ and $\bm{\theta}$ are obtained by rejection ABC and are jointly distributed according to $p(\bm{\theta},\mathbf{x}\parallel \mathbf{x}{\mathbf{x}}_{o}\parallel \le \u03f5)$ as described above. Regression adjustment models this mapping with a parametric regressor of the following form:
$$\bm{\theta}={f}_{\mathit{\varphi}}(\mathbf{u},\mathbf{x}),$$  (3.39) 
where $\mathbf{u}$ is random noise from some distribution ${\pi}_{u}\left(\mathbf{u}\right)$ that doesn’t depend on $\mathbf{x}$, and $\mathit{\varphi}$ are the parameters of the regressor. The random noise $\mathbf{u}$ accounts for the stochasticity of the mapping from $\mathbf{x}$ to $\bm{\theta}$. We don’t need to specify the noise distribution ${\pi}_{u}\left(\mathbf{u}\right)$, but we need to design the function ${f}_{\mathit{\varphi}}(\cdot ,\mathbf{x})$ to be invertible, so that given $\bm{\theta}$ and $\mathbf{x}$ we can easily solve for $\mathbf{u}$. For example, Beaumont et al. (2002) used the following linear regressor:
$$\bm{\theta}=\mathrm{\mathbf{A}\mathbf{x}}+\bm{\beta}+\mathbf{u}\mathit{\hspace{1em}}\text{with}\mathit{\hspace{1em}}\mathit{\varphi}=\{\mathbf{A},\bm{\beta}\},$$  (3.40) 
whereas Blum and François (2010) used the following nonlinear regressor:
$$\bm{\theta}={g}_{{\mathit{\varphi}}_{1}}\left(\mathbf{x}\right)+{h}_{{\mathit{\varphi}}_{2}}\left(\mathbf{x}\right)\odot \mathbf{u}\mathit{\hspace{1em}}\text{with}\mathit{\hspace{1em}}\mathit{\varphi}=\{{\mathit{\varphi}}_{1},{\mathit{\varphi}}_{2}\}.$$  (3.41) 
In the above, ${g}_{{\mathit{\varphi}}_{1}}$ and ${h}_{{\mathit{\varphi}}_{2}}$ are neural networks parameterized by ${\mathit{\varphi}}_{1}$ and ${\mathit{\varphi}}_{2}$, and $\odot $ denotes the elementwise product.
In practice, the regressor ${f}_{\mathit{\varphi}}$ is trained on the set of samples $\{({\bm{\theta}}_{1},{\mathbf{x}}_{1}),\mathrm{\dots},({\bm{\theta}}_{N},{\mathbf{x}}_{N})\}$ obtained by rejection ABC. Assuming the noise $\mathbf{u}$ has zero mean, the linear regressor in equation (3.40) can be trained by minimizing the average squared error on the parameters as follows:
$$({\mathbf{A}}^{*},{\bm{\beta}}^{*})=\underset{\mathbf{A},\bm{\beta}}{\mathrm{arg}\mathrm{min}}\frac{1}{N}\sum _{n}{\parallel {\bm{\theta}}_{n}{\mathrm{\mathbf{A}\mathbf{x}}}_{n}\bm{\beta}\parallel}^{2}.$$  (3.42) 
The above optimization problem has a unique solution in closed form. The nonlinear regressor in equation (3.41) can be trained in two stages. In the first stage, we assume that $\mathbf{u}$ has zero mean, and we fit ${\mathit{\varphi}}_{1}$ by minimizing the average squared error on the parameters:
$${\mathit{\varphi}}_{1}^{*}=\underset{{\mathit{\varphi}}_{1}}{\mathrm{arg}\mathrm{min}}\frac{1}{N}\sum _{n}{\parallel {\bm{\theta}}_{n}{g}_{{\mathit{\varphi}}_{1}}\left({\mathbf{x}}_{n}\right)\parallel}^{2}.$$  (3.43) 
In the second stage, we assume that $\mathrm{log}\left\mathbf{u}\right$ has zero mean, and we fit ${\mathit{\varphi}}_{2}$ by minimizing the average squared error on the log absolute residuals:
$${\mathit{\varphi}}_{2}^{*}=\underset{{\mathit{\varphi}}_{2}}{\mathrm{arg}\mathrm{min}}\frac{1}{N}\sum _{n}{\parallel \mathrm{log}\left{\bm{\theta}}_{n}{g}_{{\mathit{\varphi}}_{1}^{*}}\left({\mathbf{x}}_{n}\right)\right\mathrm{log}\left{h}_{{\mathit{\varphi}}_{2}}\left({\mathbf{x}}_{n}\right)\right\parallel}^{2},$$  (3.44) 
where the log and absolutevalue operators are meant to be understood elementwise. The above two optimization problems don’t have a known closedform solution, but can be approximately solved by gradient methods.
The second step of regression adjustment is to use the trained regressor ${f}_{{\mathit{\varphi}}^{*}}$ to adjust the parameter samples $\{{\bm{\theta}}_{1},\mathrm{\dots},{\bm{\theta}}_{N}\}$ obtained by rejection ABC. First, we can obtain $N$ independent samples $\{{\mathbf{u}}_{1},\mathrm{\dots},{\mathbf{u}}_{N}\}$ from ${\pi}_{u}\left(\mathbf{u}\right)$ by solving ${\bm{\theta}}_{n}={f}_{{\mathit{\varphi}}^{*}}({\mathbf{u}}_{n},{\mathbf{x}}_{n})$ for ${\mathbf{u}}_{n}$ separately for each $n$. Then, assuming ${f}_{{\mathit{\varphi}}^{*}}$ captures the exact stochastic mapping from $\mathbf{x}$ to $\bm{\theta}$, we can generate $N$ samples $\{{\bm{\theta}}_{1}^{\prime},\mathrm{\dots},{\bm{\theta}}_{N}^{\prime}\}$ from the exact posterior $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$ by simply plugging ${\mathbf{u}}_{n}$ and ${\mathbf{x}}_{o}$ back into the regressor as follows:
$${\bm{\theta}}_{n}^{\prime}={f}_{{\mathit{\varphi}}^{*}}({\mathbf{u}}_{n},{\mathbf{x}}_{o}).$$  (3.45) 
For example, using the linear regressor in equation (3.40) we obtain:
$${\bm{\theta}}_{n}^{\prime}={\mathbf{A}}^{*}\left({\mathbf{x}}_{o}{\mathbf{x}}_{n}\right)+{\bm{\theta}}_{n},$$  (3.46) 
whereas using the nonlinear regressor in equation (3.41) we obtain:
$${\bm{\theta}}_{n}^{\prime}={g}_{{\mathit{\varphi}}_{1}^{*}}\left({\mathbf{x}}_{o}\right)+{h}_{{\mathit{\varphi}}_{2}^{*}}\left({\mathbf{x}}_{o}\right)\odot \frac{{\bm{\theta}}_{n}{g}_{{\mathit{\varphi}}_{1}^{*}}\left({\mathbf{x}}_{n}\right)}{{h}_{{\mathit{\varphi}}_{2}^{*}}\left({\mathbf{x}}_{n}\right)},$$  (3.47) 
where division is meant to be understood elementwise.
In practice, ${f}_{{\mathit{\varphi}}^{*}}$ will be an approximation to the exact stochastic mapping from $\mathbf{x}$ to $\bm{\theta}$, hence the adjusted samples $\{{\bm{\theta}}_{1}^{\prime},\mathrm{\dots},{\bm{\theta}}_{N}^{\prime}\}$ will only approximately follow the exact posterior. For the adjusted samples to be a good description of the exact posterior, ${f}_{{\mathit{\varphi}}^{*}}$ needs to be a good model of the stochastic mapping from $\mathbf{x}$ to $\bm{\theta}$ at least within a distance $\u03f5$ away from ${\mathbf{x}}_{o}$. Consequently, a less flexible regressor (such as the linear regressor described above) requires a smaller $\u03f5$, whereas a more flexible regressor can be used with a larger $\u03f5$.
Similar to regression adjustment, likelihoodfree inference via neural density estimation is based on modelling the stochastic relationship between $\bm{\theta}$ and $\mathbf{x}$. The difference is in the type of model employed and in the way the model is used. Neural density estimation uses a conditionaldensity model ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ (which in the paper is taken to be a mixturedensity network) with the following characteristics:

1.
We can calculate the conditional density under ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ for any $\bm{\theta}$ and $\mathbf{x}$.

2.
We can obtain exact independent samples from ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ for any $\mathbf{x}$.
The first property allows us to train ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ by maximum likelihood on simulated data, and thus estimate the density of the posterior. The second property allows us to obtain as many samples from the estimated posterior as we wish.
In contrast, regression adjustment models the stochastic relationship between $\bm{\theta}$ and $\mathbf{x}$ as follows:
$$\bm{\theta}={f}_{\mathit{\varphi}}(\mathbf{u},\mathbf{x})\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\mathbf{u}\sim {\pi}_{u}\left(\mathbf{u}\right).$$  (3.48) 
We can think of the above as a reparameterized version of a neural density estimator in terms of its internal random numbers (the chapter on Masked Autoregressive Flow explores this idea in depth). The crucial difference is that regression adjustment only requires fitting the regressor ${f}_{\mathit{\varphi}}$, and it entirely avoids modelling the noise distribution ${\pi}_{u}\left(\mathbf{u}\right)$. As a consequence, regression adjustment can neither calculate densities under the model, nor generate samples from the model at will, as these operations require knowing ${\pi}_{u}\left(\mathbf{u}\right)$.
In principle, we can use a neural density estimator ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ to perform regression adjustment, as long as we can write ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ as an invertible transformation of noise as in equation (3.48). A mixturedensity network cannot be easily expressed this way, but a model such as Masked Autoregressive Flow (or any other normalizing flow) is precisely an invertible transformation of noise by design. Hence, we can use a normalizing flow to estimate the conditional density of $\bm{\theta}$ given $\mathbf{x}$, and then obtain posterior samples by adjusting simulated data. More precisely, given a simulated pair $({\bm{\theta}}_{n},{\mathbf{x}}_{n})$, we can use the trained normalizing flow to compute the noise ${\mathbf{u}}_{n}$ that would have generated ${\bm{\theta}}_{n}$ given ${\mathbf{x}}_{n}$, and then run the flow with noise ${\mathbf{u}}_{n}$ but this time conditioned on ${\mathbf{x}}_{o}$ to obtain a sample ${\bm{\theta}}_{n}^{\prime}$ from the exact posterior $p(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})$.
Nevertheless, I would argue that using a neural density estimator such as a normalizing flow for regression adjustment is an unnecessarily roundabout way of generating posterior samples. If we assume that the flow is a good model of the stochastic relationship between $\bm{\theta}$ and $\mathbf{x}$ (which is what regression adjustment requires), then we may as well obtain posterior samples by sampling the noise directly from ${\pi}_{u}\left(\mathbf{u}\right)$ (recall that the flow explicitly models the noise distribution). In other words, with neural density estimation there is no need to adjust simulated data to obtain posterior samples; we can obtain as many posterior samples as we like by sampling from the neural density model directly.
3.4.3 Limitations and criticism
So far I’ve discussed the contributions of the paper and the impact the paper has had in the field of likelihoodfree inference. In this section, I examine the paper from a critical perspective: I discuss the limitations of the proposed method, and identify its weaknesses.
Recall that the proposed method estimates the posterior in a series of rounds. In each round, parameter samples are proposed from a distribution $\stackrel{~}{p}\left(\bm{\theta}\right)$, then the neural density estimator ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ is trained on simulated data, and finally the posterior is estimated as follows:
$$\widehat{p}(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})\propto \frac{p\left(\bm{\theta}\right)}{\stackrel{~}{p}\left(\bm{\theta}\right)}{q}_{\mathit{\varphi}}(\bm{\theta}{\mathbf{x}}_{o}).$$  (3.49) 
The above posterior estimate is then used as the proposal for the next round, and the algorithm can thus progress for any number of rounds. In the first round, we may use the prior as proposal.
In order to ensure that the computation of the posterior estimate in equation (3.49) is analytically tractable, the proposed method restricts ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ to be a conditional Gaussian mixture model (i.e. a mixturedensity network), and assumes that the prior $p\left(\bm{\theta}\right)$ and the proposal $\stackrel{~}{p}\left(\bm{\theta}\right)$ are also Gaussian. The Gaussian assumption can be relaxed to some extent; equation (3.49) remains analytically tractable if we replace Gaussians with any other exponential family. However, the proposed algorithm is still tied to mixturedensity networks and can’t employ more advanced neural density estimators, such as Masked Autoregressive Flow. In addition, the algorithm can’t be used with arbitrary priors, which limits its applicability.
Even though mixturedensity networks can be made arbitrarily flexible, I would argue that, in principle, tying an inference algorithm to a particular neural density estimator should be viewed as a weakness. Having the inference algorithm and the neural density model as separate components of an inference framework encourages modularity and good software design. Furthermore, if an inference algorithm can interface with any neural density model, improvements in neural density estimation (such as those described in the previous chapter) will automatically translate to improvements in the inference algorithm.
An important limitation of the proposed method is that equation (3.49) is not guaranteed to yield a posterior estimate that integrates to $1$. For example, assume that $p\left(\bm{\theta}\right)\propto 1$, that ${q}_{\mathit{\varphi}}\left(\bm{\theta}{\mathbf{x}}_{o}\right)$ has a Gaussian component with covariance $\mathbf{S}$, and that $\stackrel{~}{p}\left(\bm{\theta}\right)$ is a Gaussian with covariance $\alpha \mathbf{S}$ for some $$; in other words, the proposal is narrower than one of the components of ${q}_{\mathit{\varphi}}\left(\bm{\theta}{\mathbf{x}}_{o}\right)$. In this case, dividing ${q}_{\mathit{\varphi}}\left(\bm{\theta}{\mathbf{x}}_{o}\right)$ by the proposal yields a Gaussian component with covariance:
$${\left({\mathbf{S}}^{1}{\left(\alpha \mathbf{S}\right)}^{1}\right)}^{1}=\frac{\alpha}{\alpha 1}\mathbf{S},$$  (3.50) 
which is negativedefinite. When this happens, the algorithm has no good way of continuing to make progress; in practice it terminates early and returns the posterior estimate from the previous round. The problem of negativedefinite covariances resulting from dividing one Gaussian by another is also encountered in expectation propagation (Minka, 2001).
In appendix C, the paper states the following:
[The neural density estimator] ${q}_{\mathit{\varphi}}\left(\bm{\theta}{\mathbf{x}}_{o}\right)$ is trained on parameters sampled from $\stackrel{~}{p}\left(\bm{\theta}\right)$, hence, if trained properly, it tends to be narrower than $\stackrel{~}{p}\left(\bm{\theta}\right)$.
While in practice this is often the case, I should emphasize that there is no guarantee that ${q}_{\mathit{\varphi}}\left(\bm{\theta}{\mathbf{x}}_{o}\right)$ will always be narrower than the proposal. Furthermore, the paper proceeds to state the following:
[The covariance] not being positivedefinite rarely happens, whereas it happening is an indication that the algorithm’s parameters have not been set up properly.
This was indeed my experience in the experiments of the paper. However, as I continued to use the algorithm after the publication of the paper, I found that sometimes negativedefinite covariances are hard to avoid, regardless of how the algorithm’s parameters are set up. In fact, in the next chapter I give two examples of inference tasks in which the algorithm fails after its second round.
Since the publication of the paper, the above two issues, i.e. restricting the type of neural density estimator and terminating due to an improper posterior estimate, have motivated further research that aims to improve upon the proposed algorithm. In section 3.4.4 and in the next chapter, I will discuss methods that were developed to address these two issues.
3.4.4 Sequential Neural Posterior Estimation
About a year after the publication of the paper, the proposed method was extended by Lueckmann et al. (2017), partly in order to address the limitations described in the previous section. Lueckmann et al. (2017) called their method Sequential Neural Posterior Estimation. I thought that Sequential Neural Posterior Estimation was an excellent name indeed, and that it would be an appropriate name for our method as well, which until then had remained nameless. With JanMatthis Lueckmann’s permission, I now use the term SNPE to refer to both methods; if I need to distinguish between them, I use SNPEA for our method and SNPEB for the method of Lueckmann et al. (2017).
Similar to SNPEA, SNPEB estimates the posterior density by training a neural density estimator over multiple rounds, where the posterior estimate of each round becomes the proposal for the next round. The main difference between SNPEA and SNPEB is the strategy employed in order to correct for sampling parameters from the proposal $\stackrel{~}{p}\left(\bm{\theta}\right)$ instead of the prior $p\left(\bm{\theta}\right)$. Recall that SNPEA trains the neural density estimator ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)$ on samples from the proposal, and then analytically adjusts ${q}_{\mathit{\varphi}}\left(\bm{\theta}{\mathbf{x}}_{o}\right)$ by multiplying with $p\left(\bm{\theta}\right)$ and dividing by $\stackrel{~}{p}\left(\bm{\theta}\right)$, as shown in equation (3.49). As we’ve seen, this strategy restricts the form of the neural density estimator, and may lead to posterior estimates that don’t integrate to $1$.
In contrast, SNPEB assigns importance weights to the proposed samples, and then trains the neural density estimator on the weighted samples. Let $\{({\bm{\theta}}_{1},{\mathbf{x}}_{1}),\mathrm{\dots},({\bm{\theta}}_{N},{\mathbf{x}}_{N})\}$ be a set of $N$ independent joint samples obtained by:
$${\bm{\theta}}_{n}\sim \stackrel{~}{p}\left(\bm{\theta}\right)\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\mathbf{x}}_{n}\sim p\left(\mathbf{x}{\bm{\theta}}_{n}\right).$$  (3.51) 
Assuming $\stackrel{~}{p}\left(\bm{\theta}\right)$ is nonzero in the support of the prior, SNPEB assigns an importance weight ${w}_{n}$ to each joint sample $({\bm{\theta}}_{n},{\mathbf{x}}_{n})$, given by:
$${w}_{n}=\frac{p\left({\bm{\theta}}_{n}\right)}{\stackrel{~}{p}\left({\bm{\theta}}_{n}\right)}.$$  (3.52) 
Then, the neural density estimator is trained by maximizing the average log likelihood on the weighted dataset, given by:
$$L\left(\mathit{\varphi}\right)=\frac{1}{N}\sum _{n}{w}_{n}\mathrm{log}{q}_{\mathit{\varphi}}\left({\bm{\theta}}_{n}{\mathbf{x}}_{n}\right).$$  (3.53) 
As $N\to \mathrm{\infty}$, due to the strong law of large numbers, $L\left(\mathit{\varphi}\right)$ converges almost surely to the following expectation:
$L\left(\mathit{\varphi}\right)\stackrel{a.s.}{\to}\underset{\stackrel{~}{p}\left(\bm{\theta}\right)p\left(\mathbf{x}\bm{\theta}\right)}{\mathbb{E}}\left({\displaystyle \frac{p\left(\bm{\theta}\right)}{\stackrel{~}{p}\left(\bm{\theta}\right)}}\mathrm{log}{q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)\right)=\underset{p(\bm{\theta},\mathbf{x})}{\mathbb{E}}\left(\mathrm{log}{q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)\right).$  (3.54) 
Maximizing the above expectation is equivalent to minimizing ${D}_{\mathrm{KL}}(p(\bm{\theta},\mathbf{x})\parallel {q}_{\mathit{\varphi}}(\bm{\theta}\mathbf{x})p\left(\mathbf{x}\right))$, which happens only if ${q}_{\mathit{\varphi}}\left(\bm{\theta}\mathbf{x}\right)=p\left(\bm{\theta}\mathbf{x}\right)$ almost everywhere. Therefore, the training procedure of SNPEB targets the exact posterior (at least asymptotically), which means that we can estimate the exact posterior simply by:
$$\widehat{p}(\bm{\theta}\mathbf{x}={\mathbf{x}}_{o})={q}_{\mathit{\varphi}}(\bm{\theta}{\mathbf{x}}_{o}).$$  (3.55) 
Unlike SNPEA where the neural density estimator must be a mixturedensity network and the prior and the proposal must be Gaussian, SNPEB imposes no restrictions in the form of the neural density estimator, the proposal or the prior. This means that the proposal doesn’t need to be Gaussian, but can be an arbitrary density model. Moreover, since the neural density estimator targets the posterior directly, we are guaranteed that the posterior estimate is a proper density, and the algorithm can proceed for any number of rounds without the risk of terminating early.
Nonetheless, SNPEB has its own weaknesses. If the proposal $\stackrel{~}{p}\left(\bm{\theta}\right)$ is not a good match to the prior $p\left(\bm{\theta}\right)$ (which is likely in later rounds), the importance weights will have high variance, which means that the average log likelihood $L\left(\mathit{\varphi}\right)$ will have high variance too. As a consequence, training can be unstable, and the trained model may not be an accurate posterior estimate. As we will see in the next chapter, SNPEB typically produces less accurate results than SNPEA, at least when SNPEA doesn’t terminate early.
In conclusion, although SNPEB is an improvement over SNPEA in terms of robustness and flexibility, I would argue that there is still scope for further developments in order to improve the performance and robustness of current methods. In the next chapter, I will discuss such a development, which I refer to as Sequential Neural Likelihood; SNL is an alternative to both SNPEA and SNPEB in that it targets the likelihood instead of the posterior. Nonetheless, SNPE remains an actively researched topic, so it would be reasonable to expect further improvements in the near future.
Chapter 4 Sequential Neural Likelihood: Fast Likelihoodfree Inference with Autoregressive Flows
The previous chapter discussed the challenge of likelihoodfree inference, and described Sequential Neural Posterior Estimation, a method that estimates the unknown posterior with a neural density model. This chapter delves further into likelihoodfree inference via neural density estimation, and introduces Sequential Neural Likelihood, an alternative approach that estimates the intractable likelihood instead. We begin with the paper Sequential Neural Likelihood: Fast Likelihoodfree Inference with Autoregressive Flows, which introduces SNL and is the main contribution of this chapter (section 4.1). We continue with a discussion of the paper, and a comparison with a related approach based on active learning (section 4.2). Finally, we conclude the part on likelihoodfree inference with a summary and a discussion of this and the previous chapter (section 4.3).
4.1 The paper
This section presents the paper Sequential Neural Likelihood: Fast Likelihoodfree Inference with Autoregressive Flows, which is the main contribution of this chapter. The paper introduces Sequential Neural Likelihood, a new method for likelihoodfree inference via neural density estimation that was designed to overcome some of the limitations of SNPE.
The paper was first published as a preprint on arXiv in May 2018. Afterwards, it was accepted for publication at the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) in April 2019. There were $1,111$ submissions to AISTATS that year, of which $360$ were accepted for publication.
Author contributions
The paper is coauthored by me, David Sterratt and Iain Murray. As the leading author, I developed the method, implemented the code, and performed the experiments. David Sterratt implemented the simulator for the Hodgkin–Huxley model, and advised on neuroscience matters. Iain Murray supervised the project and revised the final draft. The paper was written by me, except for sections A.4 and B.4 of the appendix which were written by David Sterratt.
See pages  of papers/snl.pdf
4.2 Discussion
In this section, I further discuss the paper Sequential Neural Likelihood: Fast Likelihoodfree Inference with Autoregressive Flows, which was presented in the previous section. In what follows, I evaluate the contribution and impact the paper has had so far, and I compare Sequential Neural Likelihood with an alternative approach for estimating the simulator’s intractable likelihood based on active learning.
4.2.1 Contribution and impact
The paper builds upon our previous work on Sequential Neural Posterior Estimation (Type A) and Masked Autoregressive Flow. Its main contribution is to identify the limitations of SNPEA and SNPEB, and to introduce a new algorithm, Sequential Neural Likelihood, that overcomes these limitations. Compared to SNPE, SNL has the following advantages:

1.
SNL is more general, since it can be used with any neural density estimator and any prior distribution. In contrast, SNPEA can only be used with mixturedensity networks and exponentialfamily priors.

2.
SNL makes more efficient use of simulations, as it reuses simulated data from previous rounds to train the neural density estimator. In contrast, SNPE only uses the simulations from the last round for training.

3.
SNL is more robust, as it doesn’t need to correct for proposing parameter samples from a distribution other than the prior. As a result, it doesn’t suffer from early termination due to negativedefinite covariances (such as SNPEA) or from high variance due to importance weights (such as SNPEB).
A disadvantage of SNL is that it requires an additional inference step to generate posterior samples. The paper uses MCMC in the form of axisaligned slice sampling; however, in principle any likelihoodbased inference method that can generate samples (such as variational inference) can be used instead. In contrast, SNPE trains a model of the posterior directly, which can generate independent samples without requiring an additional inference step.
The paper suggests using SNL with Masked Autoregressive Flow. MAF is a flexible model for generalpurpose density estimation, which makes it a good default choice. However, I should emphasize that SNL is not tied to a particular model, and it can be used with any neural density estimator. In fact, even though this thesis has focused on continuous parameter and data spaces, SNL can in principle be used even when the parameters and/or the data are discrete.
It is too early to evaluate the impact of the paper, as it was published shortly before this thesis was written. As of April 2019, the paper has received $7$ citations according to Google Scholar. Along with similar methods, SNL has so far been used for likelihoodfree inference in cosmology (Alsing et al., 2019). It remains to be seen whether SNL will prove useful in other applications, and whether it will inspire further research on the topic.
4.2.2 Comparison with activelearning methods
In each round, SNL selects parameters to simulate at by independently proposing from the posterior estimate obtained in the round before. The same strategy is used in SNPE, and a similar strategy is used in SMCABC. As we have seen, proposing parameters from the previous posterior estimate can reduce the number of simulations by orders of magnitude compared to proposing parameters from the prior, and can improve posteriorestimation accuracy significantly. Nonetheless, this strategy was motivated by intuition and empirical validation, rather than a theoretically justified argument. Therefore, it is reasonable to ask whether we can find a more principled strategy that can reduce the number of simulations even further.
In contrast with estimating the posterior, estimating the likelihood can easily accommodate alternative parameteracquisition strategies. As we saw in the previous chapter, when we estimate the posterior with a neural density model it is necessary to correct for the parameteracquisition strategy, otherwise the posterior estimate will be biased. On the other hand, as we discussed in section 3 of the paper, the parameteracquisition strategy does not bias the estimation of the likelihood by a neural density model. Hence, estimating the likelihood instead of the posterior offers increased flexibility in selecting parameters, which enables more sophisticated strategies to be considered.
In principle, we can formulate the problem of selecting parameters as a task of decisionmaking under uncertainty. Ideally, we would like to select the next parameters such that the resulting simulation gives us the most information about what the posterior should be; such a scheme would select the next simulation optimally in an informationtheoretic sense. In the next few paragraphs, I describe a precise theoretical formulation of the above idea.
Suppose that the conditional density of data given parameters is modelled by a Bayesian neural density estimator ${q}_{\mathit{\varphi}}\left(\mathbf{x}\bm{\theta}\right)$. Given a dataset of simulation results $\mathcal{D}=\{({\bm{\theta}}_{1},{\mathbf{x}}_{1}),\mathrm{\dots},({\bm{\theta}}_{N},{\mathbf{x}}_{N})\}$, our beliefs about $\mathit{\varphi}$ are encoded by a distribution $p\left(\mathit{\varphi}\mathcal{D}\right)$. Under this Bayesian model, the predictive distribution of $\mathbf{x}$ given $\bm{\theta}$ is:
$$p\left(\mathbf{x}\bm{\theta},\mathcal{D}\right)=\underset{p\left(\mathit{\varphi}\mathcal{D}\right)}{\mathbb{E}}\left({q}_{\mathit{\varphi}}\left(\mathbf{x}\bm{\theta}\right)\right),$$  (4.1) 
whereas the predictive distribution of $\bm{\theta}$ given $\mathbf{x}$ is:
$$p\left(\bm{\theta}\mathbf{x},\mathcal{D}\right)=\underset{p\left(\mathit{\varphi}\mathcal{D}\right)}{\mathbb{E}}\left(\frac{{q}_{\mathit{\varphi}}\left(\mathbf{x}\bm{\theta}\right)p\left(\bm{\theta}\right)}{\int {q}_{\mathit{\varphi}}\left(\mathbf{x}{\bm{\theta}}^{\prime}\right)p\left({\bm{\theta}}^{\prime}\right)\mathrm{d}{\bm{\theta}}^{\prime}}\right).$$  (4.2) 
Given observed data ${\mathbf{x}}_{o}$, the distribution $p\left(\bm{\theta}{\mathbf{x}}_{o},\mathcal{D}\right)$ is the predictive posterior under the Bayesian model. The predictive posterior encodes two distinct kinds of uncertainty:

1.
Our irreducible uncertainty about what parameters might have generated ${\mathbf{x}}_{o}$, due to the fact that the simulator is stochastic.

2.
Our uncertainty about what the posterior should be, due to not having seen enough simulation results. This type of uncertainty can be reduced by simulating more data.
Our goal is to select the next parameters to simulate at, such that the second kind of uncertainty is maximally reduced.
Now, suppose we obtain a new simulation result $({\bm{\theta}}^{\prime},{\mathbf{x}}^{\prime})$ and hence an updated dataset ${\mathcal{D}}^{\prime}=\mathcal{D}\cup \left\{({\bm{\theta}}^{\prime},{\mathbf{x}}^{\prime})\right\}$. In that case, our beliefs about $\mathit{\varphi}$ will be updated by Bayes’ rule:
$$p\left(\mathit{\varphi}{\mathcal{D}}^{\prime}\right)\propto {q}_{\mathit{\varphi}}\left({\mathbf{x}}^{\prime}{\bm{\theta}}^{\prime}\right)p\left(\mathit{\varphi}\mathcal{D}\right),$$  (4.3) 
and the predictive posterior will be updated to use $p\left(\mathit{\varphi}{\mathcal{D}}^{\prime}\right)$ instead of $p\left(\mathit{\varphi}\mathcal{D}\right)$. We want to select parameters ${\bm{\theta}}^{\prime}$ to simulate at such that, on average, the updated predictive posterior $p\left(\bm{\theta}{\mathbf{x}}_{o},{\mathcal{D}}^{\prime}\right)$ becomes as certain as possible. We can quantify how uncertain the updated predictive posterior is by its entropy:
$$H\left(\bm{\theta}{\mathbf{x}}_{o},{\mathcal{D}}^{\prime}\right)=\underset{p\left(\bm{\theta}{\mathbf{x}}_{o},{\mathcal{D}}^{\prime}\right)}{\mathbb{E}}\left(\mathrm{log}p\left(\bm{\theta}{\mathbf{x}}_{o},{\mathcal{D}}^{\prime}\right)\right).$$  (4.4) 
The larger the entropy, the more uncertain the predictive posterior is. Hence, under the above framework, the optimal parameters ${\bm{\theta}}^{*}$ to simulate at will be those that minimize the expected predictiveposterior entropy:
$${\bm{\theta}}^{*}=\underset{{\bm{\theta}}^{\prime}}{\mathrm{arg}\mathrm{min}}\underset{p\left({\mathbf{x}}^{\prime}{\bm{\theta}}^{\prime},\mathcal{D}\right)}{\mathbb{E}}\left(H\left(\bm{\theta}{\mathbf{x}}_{o},{\mathcal{D}}^{\prime}\right)\right).$$  (4.5) 
The above expectation is taken with respect to our predictive distribution of ${\mathbf{x}}^{\prime}$ given ${\bm{\theta}}^{\prime}$, and not with respect to the actual simulator. Intuitively, this can be though of as running the simulator in our imagination rather than in reality, hence we must take into account our uncertainty over the outcome ${\mathbf{x}}^{\prime}$ due to not knowing the simulator’s likelihood exactly.
We can justify the above strategy in terms of the mutual information between $\bm{\theta}$ (the parameters that might have generated ${\mathbf{x}}_{o}$) and ${\mathbf{x}}^{\prime}$ (the data generated by simulating at ${\bm{\theta}}^{\prime}$). This mutual information can be written as:
$$\mathrm{\mathit{M}\mathit{I}}(\bm{\theta},{\mathbf{x}}^{\prime}{\bm{\theta}}^{\prime},{\mathbf{x}}_{o},\mathcal{D})=H\left(\bm{\theta}{\mathbf{x}}_{o},\mathcal{D}\right)\underset{p\left({\mathbf{x}}^{\prime}{\bm{\theta}}^{\prime},\mathcal{D}\right)}{\mathbb{E}}\left(H\left(\bm{\theta}{\mathbf{x}}_{o},{\mathcal{D}}^{\prime}\right)\right).$$  (4.6) 
As we can see, the mutual information between $\bm{\theta}$ and ${\mathbf{x}}^{\prime}$ is equal to the expected reduction in the predictiveposterior entropy due to simulating at ${\bm{\theta}}^{\prime}$. Hence, selecting the next parameters by minimizing the updated expected predictiveposterior entropy is equivalent to selecting the next parameters such that the simulated data gives us on average the most information about what the parameters that produced the observed data might be. The above strategy is a special case of the framework of Järvenpää et al. (2018), if we take the loss function that appears in their framework to be the entropy of the updated predictive posterior.
Although optimal in an informationtheoretic sense, the above strategy is intractable, as it involves a number of highdimensional integrals and the global optimization of an objective. In order to implement the above method, we would need to:

1.
train a Bayesian neural density model,

2.
compute the predictive distribution of $\mathbf{x}$ given $\bm{\theta}$ under the model,

3.
compute the predictive posterior and its entropy, and

4.
globally minimize the expected predictiveposterior entropy with respect to the next parameters to simulate at.
With current methods, each of the above steps can only be partially or approximately solved. It wouldn’t be inaccurate to say that the above strategy reduces the problem of likelihoodfree inference to an even harder problem.
Despite being intractable, the above strategy is useful as a guiding principle for the development of other parameteracquisition strategies, which may approximate the optimal strategy to a certain extent. One such heuristic is the MaxVar rule (Järvenpää et al., 2018; Lueckmann et al., 2018); given a Bayesian neural density estimator ${q}_{\mathit{\varphi}}\left(\mathbf{x}\bm{\theta}\right)$, the MaxVar rule uses the variance of the unnormalized posterior density at parameters ${\bm{\theta}}^{\prime}$ (which is due to the uncertainty about $\mathit{\varphi}$) as a proxy for the information we would gain by simulating at ${\bm{\theta}}^{\prime}$. According to the MaxVar rule, we select the next parameter to simulate at by maximizing the above variance, that is:
$${\bm{\theta}}^{*}=\underset{{\bm{\theta}}^{\prime}}{\mathrm{arg}\mathrm{max}}\underset{p\left(\mathit{\varphi}\mathcal{D}\right)}{\mathbb{V}}\left({q}_{\mathit{\varphi}}\left({\mathbf{x}}_{o}{\bm{\theta}}^{\prime}\right)p\left({\bm{\theta}}^{\prime}\right)\right).$$  (4.7) 
Assuming we can (approximately) generate samples $\{{\mathit{\varphi}}_{1},\mathrm{\dots},{\mathit{\varphi}}_{M}\}$ from $p\left(\mathit{\varphi}\mathcal{D}\right)$ (using e.g. MCMC or variational inference), we can approximate the above variance using the empirical variance on the samples as follows:
$$\underset{p\left(\mathit{\varphi}\mathcal{D}\right)}{\mathbb{V}}\left({q}_{\mathit{\varphi}}\left({\mathbf{x}}_{o}{\bm{\theta}}^{\prime}\right)p\left({\bm{\theta}}^{\prime}\right)\right)\approx \frac{1}{M}\sum _{m}{\left({q}_{{\mathit{\varphi}}_{m}}\left({\mathbf{x}}_{o}{\bm{\theta}}^{\prime}\right)p\left({\bm{\theta}}^{\prime}\right)\right)}^{2}{\left(\frac{1}{M}\sum _{m}{q}_{{\mathit{\varphi}}_{m}}\left({\mathbf{x}}_{o}{\bm{\theta}}^{\prime}\right)p\left({\bm{\theta}}^{\prime}\right)\right)}^{2}.$$  (4.8) 
The above empirical variance is differentiable with respect to ${\bm{\theta}}^{\prime}$, so it can be maximized with gradientbased methods and automatic differentiation.
Following the publication of SNL, Conor Durkan, Iain Murray and I coauthored a paper (Durkan et al., 2018) in which we compared the parameteracquisition strategy of SNL and SNPE with the MaxVar rule described above. The paper, titled Sequential Neural Methods for Likelihoodfree Inference was presented in December 2018 at the Bayesian Deep Learning Workshop, held at the conference Advances in Neural Information Processing Systems (NeurIPS). Conor Durkan was the leading author: he performed the experiments and wrote the paper. Iain Murray and I served as advisors.
The paper compares SNPEB, SNL and MaxVar. In the experiments, all three algorithms use the same Bayesian neural density estimator, namely a mixturedensity network trained with variational dropout, which is the same as that used in the previous chapter. Figure 4.1, reproduced from the original paper with Conor Durkan’s permission, shows the negative log posterior density of the true parameters versus the number of simulations. As we can see, despite being to a certain extent theoretically motivated, MaxVar doesn’t improve upon SNL. Both SNL and MaxVar improve upon SNPEB, which agrees with the rest of our experiments on SNL that were presented in this chapter.
In addition to the tradeoff between accuracy and simulation cost, it is worth looking at the wallclock time of each algorithm. Table 4.1 shows the wallclock time (in hours) of SNL and MaxVar for each simulator model, using a total of ${10}^{4}$ simulations. As we can see, SNL is about $10$ times faster than MaxVar. The reason for MaxVar being significantly slower than SNL is that MaxVar needs to solve an optimization problem for every parameter it selects. In contrast, SNL proposes parameters via MCMC, which is more efficient.
Toy model  Lotka–Volterra  M/G/1 queue  

SNL  $7.48$  $8.27$  $7.83$ 
MaxVar  $91.73$  $89.39$  $70.16$ 
The above comparison shows that SNL takes about the same number of simulations to achieve a given accuracy as a more sophisticated method that is based on active learning. Yet, SNL is an order of magnitude faster in terms of wallclock time than the activelearning alternative. This comparison doesn’t mean that activelearning methods aren’t worthwhile in general; the more expensive a simulator is to run, the more important the parameteracquisition strategy becomes. However, I would argue that this comparison highlights the importance of evaluating the cost of the parameteracquisition strategy against the extent to which it makes a better use of simulations. After all, time spent on optimizing for the next parameters to simulate at may also be spent on running more simulations or training a better model.
4.3 Summary and conclusions
Simulators are useful modelling tools, because they are interpretable, flexible, and a good match to our understanding of mechanistic processes in the physical world. However, the likelihood of a simulator model is often intractable, which makes traditional Bayesian inference over the model’s parameters challenging. In this and the previous chapter, I discussed likelihoodfree inference, i.e. Bayesian inference based on simulations, and I introduced efficient likelihoodfree inference methods based on neural density estimation.
Approximate Bayesian computation is a family of methods that have been traditionally used for likelihoodfree inference. ABC methods are based on repeatedly simulating the model, and rejecting simulations that don’t match the observed data. ABC methods don’t target the exact posterior but an approximate posterior, obtained by an alternative observation that the simulated data is at most a distance $\u03f5$ away from the observed data. The parameter $\u03f5$ trades off efficiency for accuracy; ABC becomes exact as $\u03f5$ approaches zero, but at the same time the simulation cost increases dramatically.
In the previous chapter, I presented a method for likelihoodfree inference that was later given the name Sequential Neural Posterior Estimation (Type A). SNPEA trains a Bayesian mixturedensity network on simulated data in order to estimate the exact posterior. The algorithm progresses over multiple rounds; in each round, parameters to simulate at are proposed by the posterior estimate obtained in the round before. Unlike ABC, SNPEA doesn’t reject simulations, and doesn’t involve setting a distance $\u03f5$. Experiments showed that SNPEA achieves orders of magnitude improvement over ABC methods such as rejection ABC, MCMCABC and SMCABC.
SNPEA has two drawbacks: it is tied to mixturedensity networks, and it may terminate early due to the posterior estimate becoming improper. A variant of SNPEA, which I refer to as SNPEB, was proposed by Lueckmann et al. (2017) to overcome these limitations. However, as we demonstrated in this chapter, SNPEB is typically less accurate than SNPEA (at least when SNPEA doesn’t terminate early) due to increased variance in the posterior estimate.
To overcome the limitations of both SNPEA and SNPEB, this chapter presents Sequential Neural Likelihood, a method that is similar to SNPE but targets the likelihood instead of the posterior. In our experiments, SNL was more accurate and more robust than both SNPEA and SNPEB. In addition, SNL enables the use of diagnostics such as a twosample goodnessoffit test between the simulator and the neural density estimator. A comparison between SNL and an activelearning method for selecting parameters, performed by Durkan et al. (2018), showed that SNL is at least as simulationefficient as the activelearning method, but significantly faster in terms of wallclock time.
As I discussed in the previous chapter, when using ABC it is often necessary to transform data into lowerdimensional summary statistics, in order to maintain the acceptance rate at a high enough level. One limitation of this thesis is that it didn’t explore how the methods that are based on neural density estimation scale with the dimensionality of the parameters or the data. Nonetheless, research in deep learning has demonstrated that neural networks with suitable architectures scale well to highdimensional inputs or outputs. It is reasonable to expect that, with suitable neural architectures, SNPE or SNL may scale better than ABC to highdimensional parameters or data.
Some preliminary work in this direction already exists. For example, Lueckmann et al. (2017) used a recurrent neural network with SNPE in order to estimate the posterior directly from timeseries data. Similarly, Chan et al. (2018) used an exchangeable neural network to estimate the posterior when the data is a set of exchangeable (e.g. independent) datapoints. As I argued in the chapter on Masked Autoregressive Flow, building invariances in our neural architectures that reflect reasonable assumptions about the data is a key element in scaling neural networks to high dimensions. As far as I’m aware, a careful evaluation of how neural likelihoodfree methods such as SNPE and SNL scale to high dimensions hasn’t been performed yet. However, given the progress of deeplearning research, I would argue that engineering neural architectures to use with SNPE and SNL, or alternative neural methods, is a promising direction for future work.
Having explored in the last two chapters neural methods both for estimating the posterior and for estimating the likelihood, it is natural to ask which approach is better. Despite the success of SNL over SNPE, I would argue that there is no definite answer to this question, as each approach has its own strengths and weaknesses. Ultimately the right approach depends on what the user is interested in. Hence, I would argue that it is worthwhile to continue exploring both approaches, as well as improving our current methods and developing new ones.
Bibliography
 Abadi et al. (2015) M. Abadi et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/.
 Agostinelli et al. (2003) S. Agostinelli et al. Geant4—a simulation toolkit. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 506(3):250–303, 2003.
 AlRfou et al. (2016) R. AlRfou et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688, 2016.
 Alemi et al. (2018) A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken ELBO. Proceedings of the 35th International Conference on Machine Learning, 2018.
 Alsing et al. (2018a) J. Alsing, B. D. Wandelt, and S. M. Feeney. Massive optimal data compression and density estimation for scalable, likelihoodfree inference in cosmology. Monthly Notices of the Royal Astronomical Society, 477(3):2874–2885, 2018a.
 Alsing et al. (2018b) J. Alsing, B. D. Wandelt, and S. M. Feeney. Optimal proposals for approximate Bayesian computation. arXiv:1808.06040, 2018b.
 Alsing et al. (2019) J. Alsing, T. Charnock, S. M. Feeney, and B. D. Wandelt. Fast likelihoodfree cosmology with neural density estimators and active learning. arXiv:1903.00007, 2019.
 Andrieu and Roberts (2009) C. Andrieu and G. O. Roberts. The pseudomarginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. Proceedings of the 34th International Conference on Machine Learning, 2017.
 Basri and Jacobs (2003) R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2):218–233, 2003.
 Bauer and Mnih (2019) M. Bauer and A. Mnih. Resampled priors for variational autoencoders. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
 Beaumont (2010) M. A. Beaumont. Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics, 41(1):379–406, 2010.
 Beaumont et al. (2002) M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162:2025–2035, 2002.
 Beaumont et al. (2009) M. A. Beaumont, J.M. Cornuet, J.M. Marin, and C. P. Robert. Adaptive approximate Bayesian computation. Biometrika, 96(4):983–990, 2009.
 Behrmann et al. (2018) J. Behrmann, W. Grathwohl, R. T. Q. Chen, D. K. Duvenaud, and J.H. Jacobsen. Invertible residual networks. arXiv:1811.00995, 2018.
 Billingsley (1995) P. Billingsley. Probability and Measure. Wiley, 1995.
 Bishop (2006) C. M. Bishop. Pattern recognition and machine learning. Springer New York, 2006.
 Blei et al. (2017) D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Blum and François (2010) M. G. B. Blum and O. François. Nonlinear regression models for approximate Bayesian computation. Statistics and Computing, 20(1):63–73, 2010.
 Blum et al. (2013) M. G. B. Blum, M. A. Nunes, D. Prangle, and S. A. Sisson. A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2):189–208, 2013.
 Bonassi and West (2015) F. V. Bonassi and M. West. Sequential Monte Carlo with adaptive weights for approximate Bayesian computation. Bayesian Analysis, 10(1):171–187, 2015.
 Bornn et al. (2017) L. Bornn, N. S. Pillai, A. Smith, and D. Woodard. The use of a single pseudosample in approximate Bayesian computation. Statistics and Computing, 27(3):583–590, 2017.
 Bottou (2012) L. Bottou. Stochastic gradient descent tricks. Neural Networks, Tricks of the Trade, Reloaded, 7700:430–445, 2012.
 Brehmer et al. (2018a) J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez. Constraining effective field theories with machine learning. Physical Review Letters, 121(11):111801, 2018a.
 Brehmer et al. (2018b) J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez. A guide to constraining effective field theories with machine learning. Physical Review Letters, 98(5):052004, 2018b.
 Brehmer et al. (2018c) J. Brehmer, G. Louppe, J. Pavez, and K. Cranmer. Mining gold from implicit models to improve likelihoodfree inference. arXiv:1805.12244, 2018c.
 Brock et al. (2019) A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. Proceedings of the 7th International Conference on Learning Representations, 2019.
 Brubaker et al. (2012) M. Brubaker, M. Salzmann, and R. Urtasun. A family of MCMC methods on implicitly defined manifolds. Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, 2012.
 Buesing et al. (2018) L. Buesing, T. Weber, S. Racanière, S. M. A. Eslami, D. J. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, and D. Wierstra. Learning and querying fast generative models for reinforcement learning. arXiv:1802.03006, 2018.
 Cappé and Moulines (2008) O. Cappé and E. Moulines. Online EM algorithm for latent data models. Journal of the Royal Statistical Society, Series B, 71(3):593–613, 2008.
 Caticha (2004) A. Caticha. Relative entropy and inductive inference. AIP Conference Proceedings, 707(1):75–96, 2004.
 Chan et al. (2018) J. Chan, V. Perrone, J. Spence, P. Jenkins, S. Mathieson, and Y. Song. A likelihoodfree inference framework for population genetic data using exchangeable neural networks. Advances in Neural Information Processing Systems 31, 2018.
 Charnock et al. (2018) T. Charnock, G. Lavaux, and B. D. Wandelt. Automatic physical inference with information maximizing neural networks. Physical Review D, 97(8), 2018.
 Chen et al. (2018) R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems 31, 2018.
 Chen and Gutmann (2019) Y. Chen and M. U. Gutmann. Adaptive Gaussian copula ABC. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
 Choi et al. (2019) H. Choi, E. Jang, and A. A. Alemi. WAIC, but why? Generative ensembles for robust anomaly detection. arXiv:1810.01392, 2019.
 Chwialkowski et al. (2016) K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. Proceedings of the 33rd International Conference on Machine Learning, 2016.
 CusumanoTowner et al. (2017) M. F. CusumanoTowner, A. Radul, D. Wingate, and V. K. Mansinghka. Probabilistic programs for inferring the goals of autonomous agents. arXiv:1704.04977, 2017.
 Danihelka et al. (2017) I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and GANbased training of Real NVPs. arXiv:1705.05263, 2017.
 De Cao et al. (2019) N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive flow. arXiv:1904.04676, 2019.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
 Devlin et al. (2018) J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
 Dillon et al. (2017) J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. A. Alemi, M. D. Hoffman, and R. A. Saurous. TensorFlow distributions. arXiv:1711.10604, 2017.
 Dinh et al. (2017) L. Dinh, J. SohlDickstein, and S. Bengio. Density estimation using Real NVP. Proceedings of the 5th International Conference on Learning Representations, 2017.
 Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 Durkan et al. (2018) C. Durkan, G. Papamakarios, and I. Murray. Sequential neural methods for likelihoodfree inference. Bayesian Deep Learning Workshop at Neural Information Processing Systems, 2018.
 Dziugaite et al. (2015) G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 2015.
 Edwards and Storkey (2017) H. Edwards and A. J. Storkey. Towards a neural statistician. Proceedings of the 5th International Conference on Learning Representations, 2017.
 Epanechnikov (1969) V. A. Epanechnikov. Nonparametric estimation of a multivariate probability density. Theory of Probability and its Applications, 14(1):153–158, 1969.
 Eslami et al. (2016) S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, repeat: Fast scene understanding with generative models. Advances in Neural Information Processing Systems 29, 2016.
 Fan et al. (2013) Y. Fan, D. J. Nott, and S. A. Sisson. Approximate Bayesian computation via regression density estimation. Stat, 2(1):34–48, 2013.
 Fearnhead and Prangle (2012) P. Fearnhead and D. Prangle. Constructing summary statistics for approximate Bayesian computation: Semiautomatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74(3):419–474, 2012.
 Germain et al. (2015) M. Germain, K. Gregor, I. Murray, and H. Larochelle. MADE: Masked autoencoder for distribution estimation. Proceedings of the 32nd International Conference on Machine Learning, 2015.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems 27, 2014.
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 Graham and Storkey (2017) M. M. Graham and A. J. Storkey. Asymptotically exact inference in differentiable generative models. Electronic Journal of Statistics, 11(2):5105–5164, 2017.
 Grathwohl et al. (2018) W. Grathwohl, R. T. Q. Chen, J. Betterncourt, I. Sutskever, and D. K. Duvenaud. FFJORD: Freeform continuous dynamics for scalable reversible generative models. Proceedings of the 7th International Conference on Learning Representations, 2018.
 Gregor et al. (2019) K. Gregor, G. Papamakarios, F. Besse, L. Buesing, and T. Weber. Temporal difference variational autoencoder. Proceedings of the 7th International Conference on Learning Representations, 2019.
 Gretton et al. (2012) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Joural of Machine Learning Research, 13(1):723–773, 2012.
 Grover et al. (2018) A. Grover, M. Dhar, and S. Ermon. FlowGAN: Combining maximum likelihood and adversarial learning in generative models. Proceedings of 32nd AAAI Conference on Artificial Intelligence, 2018.
 Gu et al. (2015) S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential Monte Carlo. Advances in Neural Information Processing Systems 28, 2015.
 Gutmann and Hyvärinen (2012) M. U. Gutmann and A. Hyvärinen. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13:307–361, 2012.
 Hastie et al. (2001) T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer New York, 2001.
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 Hinton (2002) G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
 Hinton et al. (1997) G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1):65–74, 1997.
 Ho et al. (2019) J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. arXiv:1902.00275, 2019.
 Hoogeboom et al. (2019) E. Hoogeboom, R. van den Berg, and M. Welling. Emerging convolutions for generative normalizing flows. arXiv:1901.11137, 2019.
 Huang et al. (2018) C.W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. Proceedings of the 35th International Conference on Machine Learning, 2018.
 Hutchinson (1990) M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics—Simulation and Computation, 19(2):433–450, 1990.
 Hyvärinen (2005) A. Hyvärinen. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6:695–709, 2005.
 Hyvärinen and Pajunen (1999) A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
 Järvenpää et al. (2018) M. Järvenpää, M. U. Gutmann, A. Pleska, A. Vehtari, and P. Marttinen. Efficient acquisition rules for modelbased approximate Bayesian computation. Bayesian Analysis, 2018.
 Jitkrittum et al. (2017) W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A lineartime kernel goodnessoffit test. Advances in Neural Information Processing Systems 30, 2017.
 Karras et al. (2018a) T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. Proceedings of the 6th International Conference on Learning Representations, 2018a.
 Karras et al. (2018b) T. Karras, S. Laine, and T. Aila. A stylebased generator architecture for generative adversarial networks. arXiv:1812.04948, 2018b.
 Keskar et al. (2017) N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On largebatch training for deep learning: Generalization gap and sharp minima. Proceedings of 5th International Conference on Learning Representations, 2017.
 Kim et al. (2018) S. Kim, S. gil Lee, J. Song, and S. Yoon. FloWaveNet: A generative flow for raw audio. arXiv:1811.02155, 2018.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations, 2015.
 Kingma and Dhariwal (2018) D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible $1\times 1$ convolutions. Advances in Neural Information Processing Systems 31, 2018.
 Kingma and Welling (2014) D. P. Kingma and M. Welling. Autoencoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2014.
 Kingma et al. (2016) D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems 29, 2016.
 Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 2012.
 Kucukelbir et al. (2015) A. Kucukelbir, R. Ranganath, A. Gelman, and D. M. Blei. Automatic variational inference in Stan. Advances in Neural Information Processing Systems 28, 2015.
 Kulkarni et al. (2015) T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. K. Mansinghka. Picture: A probabilistic programming language for scene perception. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 Li and Grathwohl (2018) X. Li and W. Grathwohl. Training Glow with constant memory cost. Bayesian Deep Learning Workshop at Neural Information Processing Systems, 2018.
 Liu et al. (2016) Q. Liu, J. D. Lee, and M. Jordan. A kernelized Stein discrepancy for goodnessoffit tests. Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016.
 Lueckmann et al. (2017) J.M. Lueckmann, P. J. Goncalves, G. Bassetto, K. Öcal, M. Nonnenmacher, and J. H. Macke. Flexible statistical inference for mechanistic models of neural dynamics. Advances in Neural Information Processing Systems 30, 2017.
 Lueckmann et al. (2018) J.M. Lueckmann, G. Bassetto, T. Karaletsos, and J. H. Macke. Likelihoodfree inference with emulator networks. Symposium on Advances in Approximate Bayesian Inference at Neural Information Processing Systems, 2018.
 MacKay (2002) D. J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cambridge University Press, 2002.
 Mansinghka et al. (2013) V. K. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. B. Tenenbaum. Approximate Bayesian image interpretation using generative probabilistic graphics programs. Advances in Neural Information Processing Systems 26, 2013.
 Marjoram et al. (2003) P. Marjoram, J. Molitor, V. Plagnol, and S. Tavaré. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26):15324–15328, 2003.
 Markram et al. (2015) H. Markram et al. Reconstruction and simulation of neocortical microcircuitry. Cell, 163:456–492, 2015.
 McLachlan and Basford (1988) G. J. McLachlan and K. E. Basford. Mixture models: Inference and applications to clustering. Marcel Dekker, 1988.
 Menick and Kalchbrenner (2019) J. Menick and N. Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. Proceedings of the 7th International Conference on Learning Representations, 2019.
 Minka (2001) T. P. Minka. Expectation propagation for approximate Bayesian inference. Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 2001.
 Müller et al. (2018) T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák. Neural importance sampling. arXiv:1808.03856, 2018.
 Murray (2007) I. Murray. Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational neuroscience unit, University College London, 2007.
 Nash and Durkan (2019) C. Nash and C. Durkan. Autoregressive energy machines. arXiv:1904.05626, 2019.
 Nash et al. (2017) C. Nash, S. M. A. Eslami, C. Burgess, I. Higgins, D. Zoran, T. Weber, and P. Battaglia. The multientity variational autoencoder. Learning Disentangled Features Workshop at Neural Information Processing Systems, 2017.
 Neal (1993) R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRGTR931, Department of Computer Science, University of Toronto, 1993.
 Nowozin (2015) S. Nowozin. Effective sample size in importance sampling, Aug. 2015. URL http://www.nowozin.net/sebastian/blog/effectivesamplesizeinimportancesampling.html. Published online.
 Oliva et al. (2018) J. Oliva, A. Dubey, M. Zaheer, B. Poczos, R. Salakhutdinov, E. Xing, and J. Schneider. Transformation autoregressive networks. Proceedings of the 35th International Conference on Machine Learning, 2018.
 Paige and Wood (2016) B. Paige and F. Wood. Inference networks for sequential Monte Carlo in graphical models. Proceedings of the 33rd International Conference on Machine Learning, 2016.
 Papamakarios (2015) G. Papamakarios. Distilling model knowledge. MSc by Research thesis, School of Informatics, University of Edinburgh, 2015.
 Papamakarios (2018) G. Papamakarios. Preprocessed datasets for MAF experiments, 2018. URL https://doi.org/10.5281/zenodo.1161203.
 Papamakarios and Murray (2015) G. Papamakarios and I. Murray. Distilling intractable generative models. Probabilistic Integration Workshop at Neural Information Processing Systems, 2015.
 Papamakarios and Murray (2016) G. Papamakarios and I. Murray. Fast $\u03f5$free inference of simulation models with Bayesian conditional density estimation. Advances in Neural Information Processing Systems 29, 2016.
 Papamakarios et al. (2017) G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems 30, 2017.
 Papamakarios et al. (2019) G. Papamakarios, D. C. Sterratt, and I. Murray. Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
 Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. Autodiff Workshop at Neural Information Processing Systems, 2017.
 Pospischil et al. (2008) M. Pospischil, M. ToledoRodriguez, C. Monier, Z. Piwkowska, T. Bal, Y. Frégnac, H. Markram, and A. Destexhe. Minimal Hodgkin–Huxley type models for different classes of cortical and thalamic neurons. Biological Cybernetics, 99(4):427–441, 2008.
 Prangle (2017) D. Prangle. Adapting the ABC distance function. Bayesian Analysis, 12(1):289–309, 2017.
 Prangle et al. (2014) D. Prangle, P. Fearnhead, M. P. Cox, P. J. Biggs, and N. P. French. Semiautomatic selection of summary statistics for ABC model choice. Statistical applications in genetics and molecular biology, 13(1):67–82, 2014.
 Prenger et al. (2018) R. Prenger, R. Valle, and B. Catanzaro. WaveGlow: A flowbased generative network for speech synthesis. arXiv:1811.00002, 2018.
 Pritchard et al. (1999) J. K. Pritchard, M. T. Seielstad, A. PerezLezaun, and M. W. Feldman. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791–1798, 1999.
 Qian (1999) N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999.
 Radford et al. (2016) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Proceedings of the 4th International Conference on Learning Representations, 2016.
 Ranganath et al. (2014) R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 2014.
 Ratmann et al. (2007) O. Ratmann, O. Jørgensen, T. Hinkley, M. Stumpf, S. Richardson, and C. Wiuf. Using likelihoodfree inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum. PLoS Computational Biology, 3(11):2266–2278, 2007.
 Reddi et al. (2018) S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. Proceedings of 6th International Conference on Learning Representations, 2018.
 Rezende (2018) D. J. Rezende. Short notes on divergence measures, July 2018. URL https://danilorezende.com/wpcontent/uploads/2018/07/divergences.pdf. Published online.
 Rezende and Mohamed (2015) D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. Proceedings of the 32nd International Conference on Machine Learning, 2015.
 Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. Proceedings of the 31st International Conference on Machine Learning, 2014.
 Romaszko et al. (2017) L. Romaszko, C. K. I. Williams, P. Moreno, and P. Kohli. Visionasinversegraphics: Obtaining a rich 3D explanation of a scene from a single image. IEEE International Conference on Computer Vision Workshop, 2017.
 Salakhutdinov and Murray (2008) R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. Proceedings of the 25th Annual International Conference on Machine Learning, 2008.
 Salimans et al. (2017) T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. Proceedings of the 5th International Conference on Learning Representations, 2017.
 Salman et al. (2018) H. Salman, P. Yadollahpour, T. Fletcher, and K. Batmanghelich. Deep diffeomorphic normalizing flows. arXiv:1810.03256, 2018.
 Schafer and Freeman (2012) C. M. Schafer and P. E. Freeman. Likelihoodfree inference in cosmology: Potential for the estimation of luminosity functions. Statistical Challenges in Modern Astronomy V, pages 3–19, 2012.
 Schroecker et al. (2019) Y. Schroecker, M. Vecerik, and J. Scholz. Generative predecessor models for sampleefficient imitation learning. Proceedings of 7th International Conference on Learning Representations, 2019.
 Scott (1992) D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 1992.
 Silver et al. (2016) D. Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silverman (1986) B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.
 Sisson et al. (2007) S. A. Sisson, Y. Fan, and M. M. Tanaka. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760–1765, 2007.
 Sjöstrand et al. (2008) T. Sjöstrand, S. Mrenna, and P. Skands. A brief introduction to PYTHIA 8.1. Computer Physics Communications, 178(11):852–867, 2008.
 Sterratt et al. (2011) D. C. Sterratt, B. Graham, A. Gillies, and D. Willshaw. Principles of Computational Modelling in Neuroscience. Cambridge University Press, 2011.
 Tipping and Bishop (1999) M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21(3):611–622, 1999.
 Toni et al. (2009) T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. H. Stumpf. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of The Royal Society Interface, 6(31):187–202, 2009.
 Tran et al. (2018) D. Tran, M. D. Hoffman, D. Moore, C. Suter, S. Vasudevan, and A. Radul. Simple, distributed, and accelerated probabilistic programming. Advances in Neural Information Processing Systems 31, 2018.
 van den Berg et al. (2018) R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing flows for variational inference. Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, 2018.
 van den Oord et al. (2016a) A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016a.
 van den Oord et al. (2016b) A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves. Conditional image generation with PixelCNN decoders. Advances in Neural Information Processing Systems 29, 2016b.
 Vikram et al. (2019) S. Vikram, M. D. Hoffman, and M. J. Johnson. The LORACs prior for VAEs: Letting the trees speak for the data. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
 Wasserman (2010) L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, 2010.
 Welling et al. (2004) M. Welling, C. K. I. Williams, and F. V. Agakov. Extreme components analysis. Advances in Neural Information Processing Systems 16, 2004.
 Wilkinson (2011) D. J. Wilkinson. Stochastic Modelling for Systems Biology, Second Edition. Taylor & Francis, 2011.
 Wilkinson (2013) R. D. Wilkinson. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Statistical applications in genetics and molecular biology, 12(2):129–141, 2013.
 Williams and Agakov (2002) C. K. I. Williams and F. V. Agakov. Products of Gaussians and probabilistic minor component analysis. Neural Computation, 14(5):1169–1182, 2002.
 Wood (2010) S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102–1104, 2010.
 Wu et al. (2015) J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. B. Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. Advances in Neural Information Processing Systems 28, 2015.
 Zeiler (2012) M. D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv:1212.5701, 2012.
 Zhang et al. (2018) L. Zhang, W. E, and L. Wang. Monge–Ampère flow for generative modeling. arXiv:1809.10188, 2018.
 Ziegler and Rush (2019) Z. M. Ziegler and A. M. Rush. Latent normalizing flows for discrete sequences. arXiv:1901.10548, 2019.
 Zoran and Weiss (2011) D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. Proceedings of the 13rd International Conference on Computer Vision, 2011.