Abstract
A normalizing flow models a complex probability density as an invertibletransformation of a simple base density. Flows based on either coupling orautoregressive transforms both offer exact density evaluation and sampling, butrely on the parameterization of an easily invertible elementwisetransformation, whose choice determines the flexibility of these models.Building upon recent work, we propose a fullydifferentiable module based onmonotonic rationalquadratic splines, which enhances the flexibility of bothcoupling and autoregressive transforms while retaining analytic invertibility.We demonstrate that neural spline flows improve density estimation, variationalinference, and generative modeling of images.
Quick Read (beta)
Neural Spline Flows
Abstract
A normalizing flow models a complex probability density as an invertible transformation of a simple base density. Flows based on either coupling or autoregressive transforms both offer exact density evaluation and sampling, but rely on the parameterization of an easily invertible elementwise transformation, whose choice determines the flexibility of these models. Building upon recent work, we propose a fullydifferentiable module based on monotonic rationalquadratic splines, which enhances the flexibility of both coupling and autoregressive transforms while retaining analytic invertibility. We demonstrate that neural spline flows improve density estimation, variational inference, and generative modeling of images.
plus 2pt minus 2pt
Neural Spline Flows
Conor Durkan^{†}^{†}thanks: Equal contribution Artur Bekasov^{1}^{1}footnotemark: 1 Iain Murray George Papamakarios School of Informatics, University of Edinburgh {conor.durkan, artur.bekasov, i.murray, g.papamakarios}@ed.ac.uk
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Models that can reason about the joint distribution of highdimensional random variables are central to modern unsupervised machine learning. Explicit density evaluation is required in many statistical procedures, while synthesis of novel examples can enable agents to imagine and plan in an environment prior to choosing a action. In recent years, the variational autoencoder [VAE, Kingma and Welling, 2014, Rezende et al., 2014] and generative adversarial network [GAN, Goodfellow et al., 2014] have received particular attention in the generativemodeling community, and both are capable of sampling with a single forward pass of a neural network. However, these models do not offer exact density evaluation, and can be difficult to train. On the other hand, autoregressive density estimators [Uria et al., 2013, Germain et al., 2015, van den Oord et al., 2016c, b, Salimans et al., 2017, van den Oord et al., 2016a] can be trained by maximum likelihood, but sampling requires a sequential loop over the output dimensions.
Flowbased models present an alternative approach to the above methods, and in some cases provide both exact density evaluation and sampling in a single neuralnetwork pass. A normalizing flow models data $\mathbf{x}$ as the output of an invertible, differentiable transformation $\mathbf{f}$ of noise $\mathbf{u}$:
$$\mathbf{x}=\mathbf{f}\left(\mathbf{u}\right)\mathit{\hspace{1em}}\text{where}\mathit{\hspace{1em}}\mathbf{u}\sim \pi \left(\mathbf{u}\right).$$  (1) 
The probability density of $\mathbf{x}$ under the flow is obtained by a change of variables:
$$p\left(\mathbf{x}\right)=\pi \left({\mathbf{f}}^{1}\left(\mathbf{x}\right)\right)\leftdet\left(\frac{\partial {\mathbf{f}}^{1}}{\partial \mathbf{x}}\right)\right.$$  (2) 
Intuitively, the function $\mathbf{f}$ compresses and expands the density of the noise distribution $\pi \left(\mathbf{u}\right)$, and this change is quantified by the determinant of the Jacobian of the transformation. The noise distribution $\pi \left(\mathbf{u}\right)$ is typically chosen to be simple, such as a standard normal, whereas the transformation $\mathbf{f}$ and its inverse ${\mathbf{f}}^{1}$ are often implemented by composing a series of invertible neuralnetwork modules. Given a dataset $\mathcal{D}={\left\{{\mathbf{x}}^{(n)}\right\}}_{n=1}^{N}$, the flow is trained by maximizing the total log likelihood ${\sum}_{n}\mathrm{log}p\left({\mathbf{x}}^{(n)}\right)$ with respect to the parameters of the transformation $\mathbf{f}$. In recent years, normalizing flows have received widespread attention in the machinelearning literature, seeing successful use in density estimation [Dinh et al., 2017, Papamakarios et al., 2017], variational inference [Kingma et al., 2016, Rezende and Mohamed, 2015, Louizos and Welling, 2017, van den Berg et al., 2018], image, audio and video generation [Kingma and Dhariwal, 2018, Kim et al., 2018, Prenger et al., 2018, Kumar et al., 2019], likelihoodfree inference [Papamakarios et al., 2019], and learning maximumentropy distributions [LoaizaGanem et al., 2017].
A flow is defined by specifying the bijective function $\mathbf{f}$ or its inverse ${\mathbf{f}}^{1}$, usually with a neural network. Depending on the flow’s intended use cases, there are practical constraints in addition to formal invertibility:

•
To train a density estimator, we need to be able to evaluate the Jacobian determinant and the inverse function ${\mathbf{f}}^{1}$ quickly. We don’t evaluate $\mathbf{f}$, so the flow is usually defined by specifying ${\mathbf{f}}^{1}$.

•
If we wish to draw samples using creftype 1, we would like $\mathbf{f}$ to be available analytically, rather than having to invert ${\mathbf{f}}^{1}$ with iterative or approximate methods.

•
Ideally, we would like both $\mathbf{f}$ and ${\mathbf{f}}^{1}$ to require only a single pass of a neural network to compute, so that both density evaluation and sampling can be performed quickly.
Autoregressive flows such as inverse autoregressive flow [IAF, Kingma et al., 2016] or masked autoregressive flow [MAF, Papamakarios et al., 2017] are $D$ times slower to invert than to evaluate, where $D$ is the dimensionality of $\mathbf{x}$. Subsequent work which enhances their flexibility has resulted in models which do not have an analytic inverse, and require numerical optimization to invert [Huang et al., 2018]. Flows based on coupling layers [NICE, RealNVP, Dinh et al., 2015, 2017] have an analytic onepass inverse, but are often less flexible than their autoregressive counterparts.
In this work, we propose a fullydifferentiable module based on monotonic rationalquadratic splines which has an analytic inverse. The module acts as a dropin replacement for the affine or additive transformations commonly found in coupling and autoregressive transforms. We demonstrate that this module significantly enhances the flexibility of both classes of flows, and in some cases brings the performance of coupling transforms on par with the bestknown autoregressive flows. An illustration of our proposed transform is shown in creftype 1.
2 Background
2.1 Coupling transforms
A coupling transform $\mathit{\varphi}$ [Dinh et al., 2015] maps an input $\mathbf{x}$ to an output $\mathbf{y}$ in the following way:

1.
Split the input $\mathbf{x}$ into two parts, $\mathbf{x}=[{\mathbf{x}}_{1:d1},{\mathbf{x}}_{d:D}]$.

2.
Compute parameters $\bm{\theta}=\text{NN}({\mathbf{x}}_{1:d1})$, where NN is an arbitrary neural network.

3.
Compute ${y}_{i}={g}_{{\bm{\theta}}_{i}}({x}_{i})$ for $i=d,\mathrm{\dots},D$ in parallel, where ${g}_{{\bm{\theta}}_{i}}$ is an invertible function parameterized by ${\bm{\theta}}_{i}$.

4.
Set ${\mathbf{y}}_{1:d1}={\mathbf{x}}_{1:d1}$, and return $\mathbf{y}=[{\mathbf{y}}_{1:d1},{\mathbf{y}}_{d:D}]$.
The Jacobian matrix of a coupling transform is lower triangular, since ${\mathbf{y}}_{d:D}$ is given by transforming ${\mathbf{x}}_{d:D}$ elementwise as a function of ${\mathbf{x}}_{1:d1}$, and ${\mathbf{y}}_{1:d1}$ is equal to ${\mathbf{x}}_{1:d1}$. Thus, the Jacobian determinant of the coupling transform $\mathit{\varphi}$ is given by $det\left(\frac{\partial \mathit{\varphi}}{\partial \mathbf{x}}\right)={\prod}_{i=d}^{D}\frac{\partial {g}_{{\bm{\theta}}_{i}}}{\partial {x}_{i}}$, the product of the diagonal elements of the Jacobian.
Coupling transforms solve two important problems for normalizing flows: they have a tractable Jacobian determinant, and they can be inverted exactly in a single pass. The inverse of a coupling transform can be easily computed by running steps 1–4 above, this time inputting $\mathbf{y}$, and using ${g}_{{\bm{\theta}}_{i}}^{1}$ to compute ${\mathbf{x}}_{d:D}$ in step 3. Multiple coupling layers can also be composed in a natural way to construct a normalizing flow with increased flexibility. A coupling transform can also be viewed as a special case of an autoregressive transform where we perform two splits of the input data instead of $D$, as noted by Papamakarios et al. [2017]. In this way, advances in flows based on coupling transforms can be applied to autoregressive flows, and vice versa.
2.2 Invertible elementwise transformations
Affine/additive
Typically, the function ${g}_{{\bm{\theta}}_{i}}$ takes the form of an additive [Dinh et al., 2015] or affine [Dinh et al., 2017] transformation for computational ease. The affine transformation is given by:
$${g}_{{\bm{\theta}}_{i}}({x}_{i})={\alpha}_{i}{x}_{i}+{\beta}_{i},\text{where}\mathit{\hspace{1em}}{\bm{\theta}}_{i}=\{{\alpha}_{i},{\beta}_{i}\},$$  (3) 
and ${\alpha}_{i}$ is usually constrained to be positive. The additive transformation corresponds to the special case ${\alpha}_{i}=1$. Both the affine and additive transformations are easy to invert, but they lack flexibility. Recalling that the base distribution of a flow is typically simple, flowbased models may struggle to model multimodal or discontinuous densities using just affine or additive transformations, since they may find it difficult to compress and expand the density in a suitably nonlinear fashion (for an illustration, see creftype C.1). We aim to choose a more flexible ${g}_{{\bm{\theta}}_{i}}$, that is still differentiable and easy to invert.
Polynomial splines
Recently, Müller et al. [2018] proposed a powerful generalization of the above affine transformations, based on monotonic piecewise polynomials. The idea is to restrict the input domain of ${g}_{{\bm{\theta}}_{i}}$ to the interval $[0,1]$, partition the input domain into $K$ bins, and define ${g}_{{\bm{\theta}}_{i}}$ to be a simple polynomial segment within each bin. Müller et al. [2018] restrict themselves to monotonicallyincreasing linear and quadratic polynomial segments, whose coefficients are parameterized by ${\bm{\theta}}_{i}$. Moreover, the polynomial segments are restricted to match at the bin boundaries so that ${g}_{{\bm{\theta}}_{i}}$ is continuous. Functions of this form, which interpolate between data using piecewise polynomials, are known as polynomial splines.
Cubic splines
In a previous iteration of this work [Durkan et al., 2019], we explored the cubicspline flow, a natural extension to the framework of Müller et al. [2018]. We proposed to implement ${g}_{{\bm{\theta}}_{i}}$ as a monotonic cubic spline Steffen [1990], where ${g}_{{\bm{\theta}}_{i}}$ is defined to be a monotonicallyincreasing cubic polynomial in each bin. By composing coupling layers featuring elementwise monotonic cubicspline transforms with invertible linear transformations, we found flows of this type to be much more flexible than the standard couplinglayer models in the style of RealNVP [Dinh et al., 2017], achieving similar results to autoregressive models on a suite of densityestimation tasks.
Like Müller et al. [2018], our spline transform and its inverse were defined only on the interval $[0,1]$. To ensure that the input is always between $0$ and $1$, we placed a sigmoid transformation before each coupling layer, and a logit transformation after each coupling layer. These transformations allow the spline transform to be composed with linear layers, which have an unconstrained domain. However, the limitations of 32bit floating point precision mean that in practice the sigmoid saturates for inputs outside the approximate range of $[13,13]$, which results in numerical difficulties. In addition, computing the inverse of the transform requires inverting a cubic polynomial, which is prone to numerical instability if not carefully treated [Blinn, 2007]. In creftype 3.1 we propose a modified method based on rationalquadratic splines which overcomes these difficulties.
2.3 Invertible linear transformations
To ensure all input variables can interact with each other, it is common to randomly permute the dimensions of intermediate layers in a normalizing flow. Permutation is an invertible linear transformation, with absolute determinant equal to $1$. Kingma and Dhariwal [2018] generalize permutations to a more general class of linear transformations, demonstrating improvements on a range of image tasks. In particular, they parameterize a linear transformation with matrix $\mathbf{W}$ in terms of its LUdecomposition $\mathbf{W}=\mathrm{\mathbf{P}\mathbf{L}\mathbf{U}}$, where $\mathbf{P}$ is a fixed permutation matrix, $\mathbf{L}$ is lower triangular with ones on the diagonal, and $\mathbf{U}$ is upper triangular. By restricting the diagonal elements of $\mathbf{U}$ to be positive, $\mathbf{W}$ is guaranteed to be invertible.
By making use of the LUdecomposition, both the determinant and the inverse of the linear transformation can be computed efficiently. First, the determinant of $\mathbf{W}$ can be calculated in $\mathcal{O}\left(D\right)$ time as the product of the diagonal elements of $\mathbf{U}$. Second, inverting the linear transformation can be done by solving two triangular systems, one for $\mathbf{U}$ and one for $\mathbf{L}$, each of which costs $\mathcal{O}({D}^{2}M)$ time where $M$ is the batch size. Alternatively, we can pay a onetime cost of $\mathcal{O}({D}^{3})$ to explicitly compute ${\mathbf{W}}^{1}$, which can then be cached for reuse.
3 Method
3.1 Monotonic rationalquadratic transforms
We propose to implement the function ${g}_{{\bm{\theta}}_{i}}$ using monotonic rationalquadratic splines as a building block, where each bin is defined by a monotonicallyincreasing rationalquadratic function. A rationalquadratic function takes the form of a quotient of two quadratic polynomials. Rationalquadratic functions are easily differentiable, more flexible than a polynomial in that they have an infinite Taylorseries expansion, and since we consider only monotonic segments of these functions, they are also analytically invertible. In our implementation, we use the method of Gregory and Delbourgo [1982] to parameterize a monotonic rationalquadratic spline. The spline itself maps an interval $[B,B]$ to $[B,B]$. We define the transformation outside this range as the identity, resulting in linear ‘tails’, so that the overall transformation can take unconstrained inputs.
The spline uses $K$ different rationalquadratic functions, with boundaries set by $K+1$ coordinates ${\left\{({x}^{(k)},{y}^{(k)})\right\}}_{k=0}^{K}$ known as knots. The knots monotonically increase between $({x}^{(0)},{y}^{(0)})=(B,B)$ and $(x{}^{(K)},y{}^{(K)})=(B,B)$. We give the spline $K1$ arbitrary positive values for the derivatives at the internal points, and set the boundary derivatives to $1$ to match the linear tails, which we found to be important for stable training. Then, the method constructs a monotonic, continuouslydifferentiable, rationalquadratic spline which passes through the knots, with the given derivatives at the knots. The transform is illustrated in creftype 1, and creftype A.1 gives full details.
Implementation
The practical implementation of the monotonic rationalquadratic coupling transform is as follows:

1.
A neural network NN takes ${\mathbf{x}}_{1:d1}$ as input and outputs an unconstrained parameter vector ${\bm{\theta}}_{i}$ of length $3K1$ for each $i=d,\mathrm{\dots},D$.

2.
Vector ${\bm{\theta}}_{i}$ is partitioned as ${\bm{\theta}}_{i}=[{\bm{\theta}}_{i}^{w},{\bm{\theta}}_{i}^{h},{\bm{\theta}}_{i}^{d}]$, where ${\bm{\theta}}_{i}^{w}$ and ${\bm{\theta}}_{i}^{h}$ have length $K$, and ${\bm{\theta}}_{i}^{d}$ has length $K1$.

3.
Vectors ${\bm{\theta}}_{i}^{w}$ and ${\bm{\theta}}_{i}^{h}$ are each passed through a softmax and multiplied by $2B$; the outputs are interpreted as the widths and heights of the $K$ bins, which must be positive and span the $[B,B]$ interval. Cumulative sums of the $K$ bin widths and heights, each starting at $B$, yield the $K+1$ knots ${\left\{({x}^{(k)},{y}^{(k)})\right\}}_{k=0}^{K}$.

4.
The vector ${\bm{\theta}}_{i}^{d}$ is passed through a softplus function and is interpreted as the values of the derivatives at the internal knots.
Evaluating a rationalquadratic spline transform at location $x$ requires finding the bin in which $x$ lies, which can be done efficiently with binary search, since the bins are sorted. The Jacobian determinant can be computed in closedform as a product of quotient derivatives, while the inverse requires solving a quadratic equation whose coefficients depend on the value to invert; we provide details of these procedures in creftype A.2 and creftype A.3. Unlike the additive and affine transformations, which have limited flexibility, a monotonic spline with sufficiently many bins can approximate any continuous monotonic function on the specified interval $[B,B]$, yet has a closedform, tractable Jacobian determinant, and can be inverted analytically. Finally, our parameterization is fullydifferentiable, which allows for training by gradient methods.
The above formulation can also easily be adapted for autoregressive transforms; each ${\bm{\theta}}_{i}$ can be computed as a function of ${\mathbf{x}}_{1:i1}$ using an autoregressive neural network, and then all elements of $\mathbf{x}$ can be transformed at once. Inspired by this, we also introduce a set of splines for our coupling layers which act elementwise on ${\mathbf{x}}_{1:d1}$ (the typically nontransformed variables), and whose parameters are optimized directly by stochastic gradient descent. This means that our coupling layer transforms
Training data  Flow density  Flow samples 

all elements of $\mathbf{x}$ at once as follows:
${\bm{\theta}}_{1:d1}$  $=\text{Trainable parameters}$  (4)  
${\bm{\theta}}_{d:D}$  $=\text{NN}({\mathbf{x}}_{1:d1})$  (5)  
${y}_{i}$  $={g}_{{\bm{\theta}}_{i}}({x}_{i})\mathit{\hspace{1em}}\text{for}i=1,\mathrm{\dots},D.$  (6) 
creftypecap 2 demonstrates the flexibility of our rationalquadratic coupling transform on synthetic twodimensional datasets. Using just two coupling layers, each with $K=128$ bins, the monotonic rationalquadratic spline transforms have no issue fitting complex, discontinuous densities with potentially hundreds of modes. In contrast, a coupling layer with affine transformations has significant difficulty with these tasks (see creftype C.1).
3.2 Neural spline flows
The monotonic rationalquadratic spline transforms described in the previous section act as dropin replacements for affine or additive transformations in both coupling and autoregressive transforms. When combined with alternating invertible linear transformations, we refer to the resulting class of normalizing flows as rationalquadratic neural spline flows (RQNSF), which may feature coupling layers, RQNSF (C), or autoregressive layers, RQNSF (AR). RQNSF (C) corresponds to Glow [Kingma and Dhariwal, 2018] with affine or additive transformations replaced with monotonic rationalquadratic transforms, where Glow itself is exactly RealNVP with permutations replaced by invertible linear transformations. RQNSF (AR) corresponds to either IAF or MAF, depending on whether the flow parameterizes $\mathbf{f}$ or ${\mathbf{f}}^{1}$, again with affine transformations replaced by monotonic rationalquadratic transforms, and also with permutations replaced with invertible linear layers. Overall, RQNSF resembles a traditional feedforward neural network architecture, alternating between linear transformations and elementwise nonlinearities, while retaining an exact, analytic inverse. In the case of RQNSF (C), the inverse is available in a single neuralnetwork pass.
4 Related Work
Invertible linear transformations
Invertible linear transformations have found significant use in normalizing flows. Glow [Kingma and Dhariwal, 2018] replaces the permutation operation of RealNVP with an LUdecomposed linear transformation interpreted as a $1\times 1$ convolution, yielding superior performance for image modeling. WaveGlow [Prenger et al., 2018] and FloWaveNet [Kim et al., 2018] have also successfully adapted Glow for generative modeling of audio. Expanding on the invertible $1\times 1$ convolution presented in Glow, Hoogeboom et al. [2019] propose the emerging convolution, based on composing autoregressive convolutions in a manner analogous to an LUdecomposition, and the periodic convolution, which uses multiplication in the Fourier domain to perform convolution. Hoogeboom et al. [2019] also introduce linear transformations based on the QRdecomposition, where the orthogonal matrix is parameterized by a sequence of Householder transformations [Tomczak and Welling, 2016].
Invertible elementwise transformations
Outside of those discussed in creftype 2.2, there has been much recent work in developing more flexible invertible elementwise transformations for normalizing flows. Flow++ [Ho et al., 2019] uses the CDF of a mixture of logistic distributions as a monotonic transformation in coupling layers, but requires bisection search to compute an inverse, since a closed form is not available. Nonlinear squared flow [Ziegler and Rush, 2019] adds an inversequadratic perturbation to an affine transformation in an autoregressive flow, which is invertible under certain restrictions of the parameterization. Computing this inverse requires solving a cubic polynomial, and the overall transform is less flexible than a monotonic rationalquadratic spline. Sumofsquares polynomial flow [SOS, Jaini et al., 2019] parameterizes a monotonic transformation by specifying the coefficients of a polynomial of some chosen degree which can be written as a sum of squares. For lowdegree polynomials, an analytic inverse may be available, but the method would require an iterative solution in general.
Neural autoregressive flow [NAF, Huang et al., 2018] replaces the affine transformation in MAF by parameterizing a monotonic neural network for each dimension. This greatly enhances the flexibility of the transformation, but the resulting model is again not analytically invertible. Block neural autoregressive flow [BlockNAF, De Cao et al., 2019] directly fits an autoregressive monotonic neural network endtoend rather than parameterizing a sequence for each dimension as in NAF, but is also not analytically invertible.
Continuoustime flows
Rather than constructing a normalizing flow as a series of discrete steps, it is also possible to use a continuoustime flow, where the transformation from noise $\mathbf{u}$ to data $\mathbf{x}$ is described by an ordinary differential equation. Deep diffeomorphic flow [Salman et al., 2018] is one such instance, where the model is trained by backpropagation through an Euler integrator, and the Jacobian is computed approximately using a truncated power series and the Hutchison trace estimator [Hutchinson, 1990]. Neural ordinary differential equations [Neural ODEs, Chen et al., 2018] define an additional ODE which describes the trajectory of the flow’s gradient, avoiding the need to backpropagate through an ODE solver. A third ODE can be used to track the evolution of the log density, and the entire system can be solved with a suitable integrator. The resulting continuoustime flow is known as FFJORD [Grathwohl et al., 2018]. Like flows based on coupling layers, FFJORD is also invertible in ‘one pass’, but here this term refers to solving a system of ODEs, rather than performing a single neuralnetwork pass.
5 Experiments
Model  POWER  GAS  HEPMASS  MINIBOONE  BSDS300 

FFJORD${}^{\star}$  $0.46$  $8.59$  $14.92$  $10.43$  $157.40$ 
Glow  $0.42\pm 0.01$  $12.24\pm 0.03$  $16.99\pm 0.02$  $10.55\pm 0.45$  $156.95\pm 0.28$ 
QNSF (C)  $0.64\pm 0.01$  $12.80\pm 0.02$  $15.35\pm 0.02$  $9.35\pm 0.44$  $157.65\pm 0.28$ 
RQNSF (C)  $0.64\pm 0.01$  $13.09\pm 0.02$  $14.75\pm 0.03$  $9.67\pm 0.47$  $157.54\pm 0.28$ 
MAF  $0.45\pm 0.01$  $12.35\pm 0.02$  $17.03\pm 0.02$  $10.92\pm 0.46$  $156.95\pm 0.28$ 
QNSF (AR)  $0.66\pm 0.01$  $12.91\pm 0.02$  $14.67\pm 0.03$  $9.72\pm 0.47$  $157.42\pm 0.28$ 
NAF${}^{\star \u2020}$  $0.62\pm 0.01$  $11.96\pm 0.33$  $15.09\pm 0.40$  $8.86\pm 0.15$  $157.73\pm 0.04$ 
BlockNAF${}^{\star \u2020}$  $0.61\pm 0.01$  $12.06\pm 0.09$  $14.71\pm 0.38$  $8.95\pm 0.07$  $157.36\pm 0.03$ 
SOS${}^{\star \u2020}$  $0.60\pm 0.01$  $11.99\pm 0.41$  $15.15\pm 0.10$  $8.90\pm 0.11$  $157.48\pm 0.41$ 
RQNSF (AR)  $0.66\pm 0.01$  $13.09\pm 0.02$  $14.01\pm 0.03$  $9.22\pm 0.48$  $157.31\pm 0.28$ 
In our experiments, the neural network NN which computes the parameters of the elementwise transformations is a residual network [He et al., 2016a] with preactivation residual blocks [He et al., 2016b]. For autoregressive transformations, the layers must be masked so as to preserve autoregressive structure, and so we use the ResMADE architecture outlined by Nash and Durkan [2019]. Preliminary results indicated only minor differences in setting the tail bound $B$ within the range $[1,5]$, and so we fix a value $B=3$ across experiments, and find this to work robustly. We also fix the number of bins $K=8$ across our experiments, unless otherwise noted. We implement all invertible linear transformations using the LUdecomposition, where the permutation matrix $\mathbf{P}$ is fixed at the beginning of training, and the product $\mathrm{\mathbf{L}\mathbf{U}}$ is initialized to the identity. For all nonimage experiments, we define a flow ‘step’ as the composition of an invertible linear transformation with either a coupling or autoregressive transform, and we use 10 steps per flow in all our experiments, unless otherwise noted. All flows use a standardnormal noise distribution. We use the Adam optimizer [Kingma and Ba, 2014], and anneal the learning rate according to a cosine schedule [Loshchilov and Hutter, 2017]. In some cases, we find applying dropout [Srivastava et al., 2014] in the residual blocks beneficial for regularization. Full experimental details are provided in creftype B. Code is available online at https://github.com/bayesiains/nsf.
5.1 Density estimation of tabular data
We first evaluate our proposed flows using a selection of datasets from the UCI machinelearning repository Dheeru and Karra Taniskidou [2017] and BSDS300 collection of natural images [Martin et al., 2001]. We follow the experimental setup and preprocessing of Papamakarios et al. [2017], who make their data available online [Papamakarios, 2018]. We also update their MAF results using our codebase with ResMADE and invertible linear layers instead of permutations, providing a stronger baseline. For comparison, we modify the quadratic splines of Müller et al. [2018] to match the rationalquadratic transforms, by defining them on the range $[B,B]$ instead of $[0,1]$, and adding linear tails, also matching the boundary derivatives as in the rationalquadratic case. We denote this model QNSF. Our results are shown in creftype 1, where the midrule separates flows with onepass inverse from autoregressive flows.
Both RQNSF (C) and RQNSF (AR) achieve stateoftheart results for a normalizing flow on the Power, Gas, and Hepmass datasets, tied with QNSF (C) and QNSF (AR) on the Power dataset. Moreover, RQNSF (C) significantly outperforms both Glow and FFJORD, achieving scores competitive with the best autoregressive models. These results close the gap between autoregressive flows and flows based on coupling layers, and demonstrate that, in some cases, it may not be necessary to sacrifice onepass sampling for densityestimation performance.
5.2 Improving the variational autoencoder
Next, we examine our proposed flows in the context of the variational autoencoder [VAE, Kingma and Welling, 2014, Rezende et al., 2014], where they can act as both flexible prior and approximate posterior distributions. For our experiments, we use dynamically binarized versions of the MNIST dataset of handwritten digits [LeCun et al., 1998], and the EMNIST dataset variant featuring handwritten letters [Cohen et al., 2017]. We measure the capacity of our flows to improve over the commonly used baseline of a standardnormal prior and diagonalnormal approximate posterior, as well as over either coupling or autoregressive distributions with affine transformations. Quantitative results are shown in creftype 2, and image samples in creftype C.
All models improve significantly over the baseline, but perform very similarly otherwise, with most featuring overlapping error bars. Considering the disparity in densityestimation performance in the previous section, this is likely due to flows with affine transformations being sufficient to model the latent space for these datasets, with little scope for RQNSF flows to demonstrate their increased flexibility. Nevertheless, it is worthwhile to highlight that RQNSF (C) is the first class of model which can potentially match the flexibility of autoregressive models, and which requires no modification for use as either a prior or approximate posterior, due to its onepass invertibility.
MNIST  EMNIST  
Posterior/Prior  ELBO  $\mathrm{log}p(\mathbf{x})$  ELBO  $\mathrm{log}p(\mathbf{x})$ 
Baseline  $85.61\pm 0.51$  $81.31\pm 0.43$  $125.89\pm 0.41$  $120.88\pm 0.38$ 
Glow  $82.25\pm 0.46$  $79.72\pm 0.42$  $120.04\pm 0.40$  $117.54\pm 0.38$ 
RQNSF (C)  $82.08\pm 0.46$  $79.63\pm 0.42$  $119.74\pm 0.40$  $117.35\pm 0.38$ 
IAF/MAF  $82.56\pm 0.48$  $79.95\pm 0.43$  $119.85\pm 0.40$  $117.47\pm 0.38$ 
RQNSF (AR)  $82.14\pm 0.47$  $79.71\pm 0.43$  $119.49\pm 0.40$  $117.28\pm 0.38$ 
5.3 Generative modeling of images
Finally, we evaluate neural spline flows as generative models of images, measuring their capacity to improve upon baseline models with affine transforms. In this section, we focus solely on flows with a onepass inverse in the style of RealNVP [Dinh et al., 2017] and Glow [Kingma and Dhariwal, 2018]. We use the CIFAR10 [Krizhevsky and Hinton, 2009] and downsampled $64\times 64$ ImageNet [van den Oord et al., 2016c, Russakovsky et al., 2015] datasets, with original 8bit colour depth and with reduced 5bit colour depth. We use Glowlike architectures with either affine (in the baseline model) or rationalquadratic coupling transforms, and provide full experimental detail in creftype B. Quantitative results are shown in creftype 3, and samples are shown in creftype 3 and creftype C.
RQNSF (C) improves upon the affine baseline in three out of four tasks, and the improvement is most significant on the 8bit version of ImageNet64. At the same time, RQNSF (C) achieves scores that are competitive with the original Glow model, while significantly reducing the number of parameters required, in some cases by almost an order of magnitude. creftypecap 3 demonstrates that the model is capable of producing diverse, globally coherent samples which closely resemble real data. There is potential to further improve our results by replacing the uniform dequantization used in Glow with variational dequantization, and using more powerful networks with gating and selfattention mechanisms to parameterize the coupling transforms, both of which are explored by Ho et al. [2019].
CIFAR10 5bit  CIFAR10 8bit  ImageNet64 5bit  ImageNet64 8bit  

Model  BPD  Params  BPD  Params  BPD  Params  BPD  Params 
Baseline  $1.70$  5.2M  $3.41$  11.1M  $1.81$  14.3M  $3.91$  14.3M 
RQNSF (C)  $1.70$  5.3M  $3.38$  11.8M  $1.77$  15.6M  $3.82$  15.6M 
Glow${}^{\star}$  $1.67$  44.0M  $3.35$  44.0M  $1.76$  110.9M  $3.81$  110.9M 
6 Discussion
Longstanding probabilistic models such as copulas [Elidan, 2013] and Gaussianization [Chen and Gopinath, 2001] can simply represent complex marginal distributions that would require many layers of transformations in flowbased models like RealNVP and Glow. Differentiable splinebased coupling layers allow these flows, which are powerful ways to represent highdimensional dependencies, to model distributions with complex shapes more quickly. Our results show that when we have enough data, the extra flexibility of splinebased layers leads to better generalization.
For tabular density estimation, both RQNSF (C) and RQNSF (AR) excel on Power, Gas, and Hepmass, the datasets with the highest ratio of data points to dimensionality from the five considered. In image experiments, RQNSF (C) achieves the best results on the ImageNet dataset, which has over an order of magnitude more data points than CIFAR10. When the dimension is increased without a corresponding increase in dataset size, RQNSF still performs competitively with other approaches, but does not outperform them.
Overall, neural spline flows demonstrate that there is significant performance to be gained by upgrading the commonlyused affine transformations in coupling and autoregressive layers, without the need to sacrifice analytic invertibility. Monotonic spline transforms enable models based on coupling layers to achieve densityestimation performance on par with the best autoregressive flows, while retaining exact onepass sampling. These models strike a novel middle ground between flexibility and practicality, providing a useful offtheshelf tool for the enhancement of architectures like the variational autoencoder, while also improving parameter efficiency in generative modeling.
Rationalquadratic transforms are also a useful differentiable and invertible module in their own right, which could be included in many models that can be trained endtoend. For instance, monotonic warping functions with a tractable Jacobian determinant are useful for supervised learning [Snelson et al., 2004]. More generally, invertibility can be useful for training very large networks, since activations can be recomputed onthefly for backpropagation, meaning gradient computation requires memory which is constant instead of linear in the depth of the network [Gomez et al., 2017, MacKay et al., 2018]. Monotonic splines are one way of constructing invertible elementwise transformations, but there may be others. The benefits of research in this direction are clear, and so we look forward to future work in this area.
Acknowledgements
This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. George Papamakarios was also supported by Microsoft Research through its PhD Scholarship Programme.
References
 Blinn [2007] J. F. Blinn. How to solve a cubic equation, part 5: Back to numerics. IEEE Computer Graphics and Applications, 27(3):78–89, 2007.
 Burda et al. [2016] Y. Burda, R. B. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. International Conference on Learning Representations, 2016.
 Chen et al. [2018] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018.
 Chen and Gopinath [2001] S. S. Chen and R. A. Gopinath. Gaussianization. Advances in Neural Information Processing Systems, 2001.
 Cohen et al. [2017] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. EMNIST: an extension of MNIST to handwritten letters. arXiv:1702.05373, 2017.
 De Cao et al. [2019] N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive flow. Conference on Uncertainty in Artificial Intelligence, 2019.
 Dheeru and Karra Taniskidou [2017] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Dinh et al. [2015] L. Dinh, D. Krueger, and Y. Bengio. NICE: Nonlinear independent components estimation. International Conference on Learning Representations, Workshop track, 2015.
 Dinh et al. [2017] L. Dinh, J. SohlDickstein, and S. Bengio. Density estimation using Real NVP. International Conference on Learning Representations, 2017.
 Durkan et al. [2019] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Cubicspline flows. Workshop on Invertible Neural Networks and Normalizing Flows, International Conference on Machine Learning, 2019.
 Elidan [2013] G. Elidan. Copulas in machine learning. Copulae in Mathematical and Quantitative Finance, 2013.
 Germain et al. [2015] M. Germain, K. Gregor, I. Murray, and H. Larochelle. MADE: Masked autoencoder for distribution estimation. International Conference on Machine Learning, 2015.
 Gomez et al. [2017] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Backpropagation without storing activations. Advances in Neural Information Processing Systems, 2017.
 Goodfellow et al. [2014] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 2014.
 Grathwohl et al. [2018] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. K. Duvenaud. FFJORD: Freeform continuous dynamics for scalable reversible generative models. International Conference on Learning Representations, 2018.
 Gregory and Delbourgo [1982] J. Gregory and R. Delbourgo. Piecewise rational quadratic interpolation to monotonic data. IMA Journal of Numerical Analysis, 2(2):123–130, 1982.
 He et al. [2016a] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2016a.
 He et al. [2016b] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. European Conference on Computer Vision, 2016b.
 Ho et al. [2019] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. International Conference on Machine Learning, 2019.
 Hoogeboom et al. [2019] E. Hoogeboom, R. van den Berg, and M. Welling. Emerging convolutions for generative normalizing flows. International Conference on Machine Learning, 2019.
 Huang et al. [2018] C.W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. International Conference on Machine Learning, 2018.
 Hutchinson [1990] M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics  Simulation and Computation, 19(2):433–450, 1990.
 Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, 2015.
 Jaini et al. [2019] P. Jaini, K. A. Selby, and Y. Yu. Sumofsquares polynomial flow. International Conference on Machine Learning, 2019.
 Kim et al. [2018] S. Kim, S.g. Lee, J. Song, and S. Yoon. FloWaveNet: A generative flow for raw audio. arXiv:1811.02155, 2018.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
 Kingma and Dhariwal [2018] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible $1\times 1$ convolutions. Advances in Neural Information Processing Systems, 2018.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Autoencoding variational Bayes. International Conference on Learning Representations, 2014.
 Kingma et al. [2016] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 2016.
 Krizhevsky and Hinton [2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 Kumar et al. [2019] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. VideoFlow: A flowbased generative model for video. arXiv:1903.01434, 2019.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 LoaizaGanem et al. [2017] G. LoaizaGanem, Y. Gao, and J. P. Cunningham. Maximum entropy flow networks. International Conference on Learning Representations, 2017.
 Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, 2017.
 Louizos and Welling [2017] C. Louizos and M. Welling. Multiplicative normalizing flows for variational Bayesian neural networks. International Conference on Machine Learning, 2017.
 MacKay et al. [2018] M. MacKay, P. Vicol, J. Ba, and R. B. Grosse. Reversible recurrent neural networks. Advances in Neural Information Processing Systems, 2018.
 Martin et al. [2001] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. International Conference on Computer Vision, 2001.
 Müller et al. [2018] T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák. Neural importance sampling. arXiv:1808.03856, 2018.
 Nash and Durkan [2019] C. Nash and C. Durkan. Autoregressive energy machines. International Conference on Machine Learning, 2019.
 Papamakarios [2018] G. Papamakarios. Preprocessed datasets for MAF experiments, 2018. URL https://doi.org/10.5281/zenodo.1161203.
 Papamakarios et al. [2017] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 2017.
 Papamakarios et al. [2019] G. Papamakarios, D. C. Sterratt, and I. Murray. Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. International Conference on Artificial Intelligence and Statistics, 2019.
 Prenger et al. [2018] R. Prenger, R. Valle, and B. Catanzaro. WaveGlow: A flowbased generative network for speech synthesis. arXiv:1811.00002, 2018.
 Rezende and Mohamed [2015] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. International Conference on Machine Learning, 2015.
 Rezende and Viola [2018] D. J. Rezende and F. Viola. Taming VAEs. arXiv:1810.00597, 2018.
 Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning, 2014.
 Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Salimans et al. [2017] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. International Conference on Learning Representations, 2017.
 Salman et al. [2018] H. Salman, P. Yadollahpour, T. Fletcher, and K. Batmanghelich. Deep diffeomorphic normalizing flows. arXiv:1810.03256, 2018.
 Snelson et al. [2004] E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped Gaussian processes. Advances in Neural Information Processing Systems, 2004.
 Srivastava et al. [2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Steffen [1990] M. Steffen. A simple method for monotonic interpolation in one dimension. Astronomy and Astrophysics, 239:443, 1990.
 Tomczak and Welling [2016] J. M. Tomczak and M. Welling. Improving variational autoencoders using Householder flow. arXiv:1611.09630, 2016.
 Uria et al. [2013] B. Uria, I. Murray, and H. Larochelle. RNADE: The realvalued neural autoregressive densityestimator. Advances in Neural Information Processing Systems, 2013.
 van den Berg et al. [2018] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing flows for variational inference. Conference on Uncertainty in Artificial Intelligence, 2018.
 van den Oord et al. [2016a] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. ISCA Speech Synthesis Workshop, 2016a.
 van den Oord et al. [2016b] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves. Conditional image generation with PixelCNN decoders. Advances in Neural Information Processing Systems, 2016b.
 van den Oord et al. [2016c] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. International Conference on Machine Learning, 2016c.
 Ziegler and Rush [2019] Z. M. Ziegler and A. M. Rush. Latent normalizing flows for discrete sequences. arXiv:1901.10548, 2019.
Appendix A Monotonic rationalquadratic transforms
A.1 Parameterization of the spline
We here include an outline of the method of Gregory and Delbourgo [1982] which closely matches the original paper. Let ${\left\{({x}^{(k)},{y}^{(k)})\right\}}_{k=0}^{K}$ be a given set of knot points in the plane which satisfy
$({x}^{(0)},{y}^{(0)})=(B,B),$  (7)  
$({x}^{(K)},{y}^{(K)})=(B,B),$  (8)  
$$  (9) 
Let ${\left\{{d}^{(k)}\right\}}_{k=0}^{K}$ be the nonnegative derivative values at these knot points (we take ${d}^{(0)}={d}^{(K)}=1$ so that the spline matches the derivative of the linear tails). Given these quantities, the algorithm of Gregory and Delbourgo [1982] defines a monotonic rationalquadratic spline which passes through each knot and has the given derivative value at each knot as follows:

1.
Let ${w}^{(k)}={x}^{(k+1)}{x}^{(k)}$ be the bin widths, and ${s}^{(k)}=\left({y}^{(k+1)}{y}^{(k)}\right)/{w}^{(k)}$ be the slopes of the lines joining the coordinates.

2.
For $x\in [{x}^{(k)},{x}^{(k+1)}]$, let $\xi =\left(x{x}^{(k)}\right)/{w}^{(k)}$, so that $\xi \in [0,1]$.

3.
Then, for $x\in [{x}^{(k)},{x}^{(k+1)}]$, $k=0,\mathrm{\dots},K1$, define
$g(x)={\displaystyle \frac{{\alpha}^{(k)}(\xi )}{{\beta}^{(k)}(\xi )}},$ (10) where
${\alpha}^{(k)}(\xi )$ $={s}^{(k)}{y}^{(k+1)}{\xi}^{2}+\left[{y}^{(k)}{d}^{(k+1)}+{y}^{(k+1)}{d}^{(k)}\right]\xi (1\xi )+{s}^{(k)}{y}^{(k)}{(1\xi )}^{2},$ (11) ${\beta}^{(k)}(\xi )$ $={s}^{(k)}{\xi}^{2}+\left[{d}^{(k+1)}+{d}^{(k)}\right]\xi (1\xi )+{s}^{(k)}{(1\xi )}^{2}.$ (12)
Gregory and Delbourgo [1982] note that ${\beta}^{(k)}(\xi )$ can be rewritten as
${\beta}^{(k)}(\xi )={s}^{(k)}+\left[{d}^{(k+1)}+{d}^{(k)}2{s}^{(k)}\right]\xi (1\xi ),$  (13) 
so that the quotient can be written as
$\frac{{\alpha}^{(k)}(\xi )}{{\beta}^{(k)}(\xi )}}={y}^{(k)}+{\displaystyle \frac{({y}^{(k+1)}{y}^{(k)})\left[{s}^{(k)}{\xi}^{2}+{d}^{(k)}\xi (1\xi )\right]}{{s}^{(k)}+\left[{d}^{(k+1)}+{d}^{(k)}2{s}^{(k)}\right]\xi (1\xi )}},$  (14) 
which is less prone to numerical issues, especially for small values of ${s}^{(k)}$. Gregory and Delbourgo [1982] show that the spline defined by creftype 10 interpolates between the given knots, satisfies the derivative constraints at the knot points, and is monotonic on each bin. Rationalquadratic functions also provide flexibility over previous approaches: it is not possible to match arbitrary values and derivatives of a function at two boundary knots with a quadratic polynomial, or a monotonic segment of a cubic polynomial.
A.2 Computing the derivative
The derivative of creftype 10 is given by the quotient rule:
$\frac{\mathrm{d}}{\mathrm{d}x}}\left[{\displaystyle \frac{{\alpha}^{(k)}(\xi )}{{\beta}^{(k)}(\xi )}}\right]={\displaystyle \frac{\mathrm{d}}{\mathrm{d}\xi}}\left[{\displaystyle \frac{{\alpha}^{(k)}(\xi )}{{\beta}^{(k)}(\xi )}}\right]{\displaystyle \frac{\mathrm{d}\xi}{\mathrm{d}x}}={\displaystyle \frac{1}{{w}^{(k)}}}{\displaystyle \frac{{\beta}^{(k)}(\xi )\frac{\mathrm{d}}{\mathrm{d}\xi}\left[{\alpha}^{(k)}(\xi )\right]{\alpha}^{(k)}(\xi )\frac{\mathrm{d}}{\mathrm{d}\xi}\left[{\beta}^{(k)}(\xi )\right]}{{\left[{\beta}^{(k)}(\xi )\right]}^{2}}}.$  (15) 
It can be shown that
${\beta}^{(k)}(\xi ){\displaystyle \frac{\mathrm{d}{\alpha}^{(k)}(\xi )}{\mathrm{d}\xi}}{\alpha}^{(k)}(\xi ){\displaystyle \frac{\mathrm{d}{\beta}^{(k)}(\xi )}{\mathrm{d}\xi}}={w}^{(k)}{\left({s}^{(k)}\right)}^{2}\left[{d}^{(k+1)}{\xi}^{2}+2{s}^{(k)}\xi (1\xi )+{d}^{(k)}{(1\xi )}^{2}\right],$  (16) 
so that
$\frac{\mathrm{d}}{\mathrm{d}x}}\left[{\displaystyle \frac{{\alpha}^{(k)}(\xi )}{{\beta}^{(k)}(\xi )}}\right]={\displaystyle \frac{{\left({s}^{(k)}\right)}^{2}\left[{d}^{(k+1)}{\xi}^{2}+2{s}^{(k)}\xi (1\xi )+{d}^{(k)}{(1\xi )}^{2}\right]}{{\left[{s}^{(k)}+\left[{d}^{(k+1)}+{d}^{(k)}2{s}^{(k)}\right]\xi (1\xi )\right]}^{2}}}.$  (17) 
Since the rationalquadratic transform is monotonic and acts elementwise, the logarithm of the absolute value of the determinant of its Jacobian is given by a sum of the logarithm of creftype 17 for each transformed $x$.
A.3 Computing the inverse
Computing the inverse of a monotonic rationalquadratic transformation when the value to invert lies in the tails is trivial. The problem of inversion is thus reduced to computing the inverse of the monotonic rationalquadratic spline. Consider a rationalquadratic function
$y={\displaystyle \frac{\alpha (\xi (x))}{\beta (\xi (x))}}={\displaystyle \frac{{\alpha}_{0}+{\alpha}_{1}x+{\alpha}_{2}{x}^{2}}{{\beta}_{0}+{\beta}_{1}x+{\beta}_{2}{x}^{2}}},$  (18) 
which arises as the result of the algorithm outlined in creftype A.1. The coefficients are such that the function is monotonicallyincreasing in its associated bin. Inverting the function involves solving a quadratic equation:
$q(x)$  $=\alpha (\xi (x))y\beta (\xi (x))$  (19)  
$=a{x}^{2}+bx+c=\mathrm{\hspace{0.33em}0},$  (20) 
where the coefficients depend on the target output $y$:
$a={\alpha}_{2}{\beta}_{2}y,b={\alpha}_{1}{\beta}_{1}y,c={\alpha}_{0}{\beta}_{0}y.$  (21) 
Only one of the two solutions lies in the function’s associated bin. To identify the solution in general, we identify that along the line of corresponding $(x,y)$ values, $q(x)=0$ and so $\frac{\mathrm{d}q}{\mathrm{d}x}=0$, where
$\frac{\mathrm{d}q}{\mathrm{d}x}$  $={\displaystyle \frac{\partial q}{\partial x}}+{\displaystyle \frac{\partial q}{\partial y}}{\displaystyle \frac{\partial y}{\partial x}}$  (22)  
$={\displaystyle \frac{\partial q}{\partial x}}\underset{\text{+ve}}{\underset{\u23df}{\beta (\xi (x))}}\underset{\text{+ve}}{\underset{\u23df}{{\displaystyle \frac{\partial y}{\partial x}}}}=0.$  (23) 
We substituted a partial derivative of creftype 19, noted creftype 12 is positive, and noted that the spline $y(x)$ is monotonic and increasing. To satisfy creftype 23, $\frac{\partial q}{\partial x}>0$, which corresponds to this solution to creftype 20:
$x={\displaystyle \frac{b+\sqrt{{b}^{2}4ac}}{2a}}={\displaystyle \frac{2c}{b\sqrt{{b}^{2}4ac}}},$  (24) 
where the first form is more commonly quoted, but the second form is numerically more precise when $4ac$ is small.
We can rearrange creftype 14 to show
$a$  $=\left({y}^{(k+1)}{y}^{(k)}\right)\left[{s}^{(k)}{d}^{(k)}\right]+\left(y{y}^{(k)}\right)\left[{d}^{(k+1)}+{d}^{(k)}2{s}^{(k)}\right],$  (25)  
$b$  $=\left({y}^{(k+1)}{y}^{(k)}\right){d}^{(k)}\left(y{y}^{(k)}\right)\left[{d}^{(k+1)}+{d}^{(k)}2{s}^{(k)}\right],$  (26)  
$c$  $={s}^{(k)}\left(y{y}^{(k)}\right).$  (27) 
Appendix B Experimental details
B.1 Tabular density estimation
Model selection is performed using the standard validation splits for these datasets. We clip the norm of gradients to the range $[5,5]$, and find this helps stabilize training. We modify MAF by replacing permutations with invertible linear layers. Hyperparameter settings are shown for coupling flows in creftype 4 and autoregressive flows in creftype 5. We include the dimensionality and number of training data points in each table for reference. For higher dimensional datasets such as Hepmass and BSDS300, we found increasing the number of coupling layers beneficial. This was not necessary for Miniboone, where overfitting was an issue due to the low number of data points.
Power  Gas  Hepmass  Miniboone  BSDS300  
Dimension  6  8  21  43  63 
Train data points  1,615,917  852,174  315,123  29,556  1,000,000 
Batch size  512  512  256  128  512 
Training steps  400,000  400,000  400,000  200,000  400,000 
Learning rate  0.0005  0.0005  0.0005  0.0003  0.0005 
Flow steps  10  10  20  10  20 
Residual blocks  2  2  1  1  1 
Hidden features  256  256  128  32  128 
Bins  8  8  8  4  8 
Dropout  0.0  0.1  0.2  0.2  0.2 
Power  Gas  Hepmass  Miniboone  BSDS300  
Dimension  6  8  21  43  63 
Train data points  1,615,917  852,174  315,123  29,556  1,000,000 
Batch size  512  512  512  64  512 
Training steps  400,000  400,000  400,000  250,000  400,000 
Learning rate  0.0005  0.0005  0.0005  0.0003  0.0005 
Flow steps  10  10  10  10  10 
Residual blocks  2  2  2  1  2 
Hidden features  256  256  256  64  512 
Bins  8  8  8  4  8 
Dropout  0.0  0.1  0.2  0.2  0.2 
B.2 Improving the variational autoencoder
We use the Adam optimizer [Kingma and Ba, 2014] with default hyperparameters, annealing an initial learning rate of $0.0005$ to $0$ using a cosine schedule [Loshchilov and Hutter, 2017] over $150,000$ training steps with batch size $256$. We use a ‘warmup’ phase for the KL divergence term of the loss, where the multiplier for this term is initialized to $0.5$ and linearly increased to $1$ over the first $10\%$ of training. This modification initially reduces the penalty incurred by the approximate posterior in deviating from the prior, and similar schemes have been shown to improve VAE training dynamics [Rezende and Viola, 2018]. Model selection is performed using a heldout validation set of $10,000$ samples for MNIST, and $20,000$ samples for EMNIST.
We use 32 latent features, and residual nets use 2 blocks, with 64 latent features for coupling layers, and 128 latent features for autoregressive layers. Both coupling and autoregressive flows use 10 steps. As with the tabular densityestimation experiments, we modify IAF [Kingma et al., 2016] and MAF [Papamakarios et al., 2017] by replacing permutations with invertible linear layers using an LUdecomposition. All NSF models use 8 bins. The encoder and decoder architectures are set up exactly as described by Nash and Durkan [2019], and are similar to those used in IAF [Kingma et al., 2016] and NAF [Huang et al., 2018].
Conditioning the approximate posterior distribution $q(\mathbf{z}\mathbf{x})$ follows a multistage procedure. First, the encoder computes a context vector $\mathbf{h}$ of dimension $64$ as a function of the input $\mathbf{x}$. This vector is then mapped to the mean and diagonal covariance of a Gaussian distribution in the latent space. Then, $\mathbf{h}$ is also given as input to the residual nets in each of the flow’s coupling or autoregressive layers, where it is concatenated with the input $\mathbf{z}$, mapped to the required number of hidden features, and also used to modulate the additive update of each residual block with a sigmoid gate. We found this scheme to work well across experiments.
B.3 Generative modeling of images
For imagemodeling experiments we use a Glowlike model architecture introduced by Kingma and Dhariwal [2018, Section 3]. This involves stacking multiple steps for each level in the multiscale architecture of Dinh et al. [2017], where each step consists of an actnorm layer, an invertible $1\times 1$ convolution and a coupling transform. For our RQNSF (C) model, we make the following modifications to the original Glow model: we replace affine coupling transforms with rationalquadratic coupling transforms, we go back to residual convolutional networks as used in RealNVP Dinh et al. [2017], and we use an additional $1\times 1$ convolution at the end of each level of transforms. The basline model is the same as RQNSF (C), except that it uses affine coupling transforms instead of rationalquadratic ones. For CIFAR10 experiments we do not factor out dimensions at the end of each level, but still use the squeezing operation to trade spatial resolution for depth.
For all experiments we use 3 residual blocks and batch normalization [Ioffe and Szegedy, 2015] in the residual networks which parameterize the coupling transforms. We use 7 steps per level for all experiments, resulting in a total of 21 coupling transforms for CIFAR10, and 28 coupling transform for ImageNet64 (Glow models used by Kingma and Dhariwal [2018] use 96 and 192 affine coupling transforms for CIFAR10 and ImageNet64 respectively).
We use the Adam [Kingma and Ba, 2014] optimizer with default ${\beta}_{1}$ and ${\beta}_{2}$ values. An initial learning rate of 0.0005 is annealed to 0 following a cosine schedule [Loshchilov and Hutter, 2017]. We train for $100,000$ steps for 5bit experiments, and for $200,000$ steps for 8bit experiments. To track the performance of our models, we split off 1% of the training data to use as a development set. Due to the resource requirements of the experiments, we perform a limited manual hyperparameter exploration. Final values are reported in creftype 6.
We use a single NVIDIA Tesla P100 GPU card per CIFAR10 experiment, and two such cards per ImageNet64 experiment. Training for $200,000$ steps takes about $5$ days with this setup.
Dataset  Batch Size  Levels  Hidden Channels  Bins  Dropout  

CIFAR10  5bit  512  3  64  2  0.2 
8bit  512  3  96  4  0.2  
ImageNet64  5bit  256  4  96  8  0.1 
8bit  256  4  96  8  0.0 
Appendix C Additional experimental results
C.1 Affine coupling transforms for 2D datasets
Training data  Flow density  Flow samples 

C.2 Samples
Image samples for VAE experiments are shown in creftype 5. Additional samples for generative imagemodeling experiments are shown in creftype 6.
(a) MNIST  (b) EMNISTletters 