### Abstract

Normalising flows (NFS) map two density functions via a differentiablebijection whose Jacobian determinant can be computed efficiently. Recently, asan alternative to hand-crafted bijections, Huang et al. (2018) proposed neuralautoregressive flow (NAF) which is a universal approximator for densityfunctions. Their flow is a neural network (NN) whose parameters are predictedby another NN. The latter grows quadratically with the size of the former andthus an efficient technique for parametrization is needed. We propose blockneural autoregressive flow (B-NAF), a much more compact universal approximatorof density functions, where we model a bijection directly using a singlefeed-forward network. Invertibility is ensured by carefully designing eachaffine transformation with block matrices that make the flow autoregressive and(strictly) monotone. We compare B-NAF to NAF and other established flows ondensity estimation and approximate inference for latent variable models. Ourproposed flow is competitive across datasets while using orders of magnitudefewer parameters.

### Quick Read (beta)

# Block Neural Autoregressive Flow

###### Abstract

Normalising flows (NFs) map two density functions via a differentiable bijection whose Jacobian determinant can be computed efficiently. Recently, as an alternative to hand-crafted bijections, Huang et al. (2018) proposed neural autoregressive flow (NAF) which is a universal approximator for density functions. Their flow is a neural network (NN) whose parameters are predicted by another NN. The latter grows quadratically with the size of the former and thus an efficient technique for parametrization is needed. We propose *block neural autoregressive flow* (B-NAF), a much more compact universal approximator of density functions, where we model a bijection directly using a single feed-forward network. Invertibility is ensured by carefully designing each affine transformation with block matrices that make the flow autoregressive and (strictly) monotone. We compare B-NAF to NAF and other established flows on density estimation and approximate inference for latent variable models. Our proposed flow is competitive across datasets while using orders of magnitude fewer parameters.

oddsidemargin has been altered.

textheight has been altered.

marginparsep has been altered.

paperwidth has been altered.

textwidth has been altered.

marginparwidth has been altered.

marginparpush has been altered.

paperheight has been altered.

The page layout violates the UAI style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

Block Neural Autoregressive Flow

Nicola De Cao University of Edinburgh University of Amsterdam [email protected] Wilker Aziz University of Amsterdam [email protected] Ivan Titov University of Edinburgh University of Amsterdam [email protected]

## 1 INTRODUCTION

Normalizing flows (NFs) map two probability density functions via an invertible transformation with tractable Jacobian (Tabak et al., 2010). They have been employed in contexts where we need to model a complex density while maintaining efficient sampling and/or density assessments. In density estimation, for example, NFs are used to map observations from a complex (and unknown) data distribution to samples from a simple base distribution (Rippel and Adams, 2013). In variational inference, NFs map from a simple fixed random source (e.g. a standard Gaussian) to a complex posterior approximation while allowing for reparametrized gradient estimates and density assessments (Rezende and Mohamed, 2015).

Much of the research in NFs focuses on designing expressive transformations while satisfying practical constraints. In particular, autoregressive flows (AFs) decompose a joint distribution over $y\in {\mathbb{R}}^{d}$ into a product of $d$ univariate conditionals. A transformation $y=f(x)$, that realizes such a decomposition, has a lower triangular Jacobian with the determinant (necessary for application of the change of variables theorem for densities) computable in $O(d)$-time. Kingma et al. (2016) proposed *inverse autoregressive flows* (IAFs), an AF based on transforming each conditional by a composition of a finite number of trivially invertible affine transformations.

Recently, Huang et al. (2018) introduced *neural autoregressive flows* (NAFs). They replace IAF’s transformation by a learned bijection realized as a strictly monotonic neural network. Notably, they prove that their method is a universal approximator of real and continuous distributions. Though, whereas parametrizing affine transformations in an IAF requires predicting $d$ pairs of scalars per step of the flow, parametrizing a NAF requires predicting all the parameters of a feed-forward *transformer* network. The *conditioner* network which parametrizes the transformer grows quadratically with the width of the transformer network, thus efficient parametrization techniques are necessary. A NAF is an instance of a hyper-network (Ha et al., 2017).

#### Contribution

We propose *block neural autoregressive flows* (B-NAFs)
^{1}^{1}
1
https://github.com/nicola-decao/BNAF,
which are AFs based on a novel transformer network which transforms conditionals directly, i.e. without the need for a conditioner network. To do that we exploit the fact that invertibility only requires $\text{pdv}*{y}_{i}{x}_{i}>0$, and therefore, careful design of a feed-forward network can ensure that the transformation is both autoregressive (with unconstrained manipulation of $$) and strictly monotone (with positive $\text{pdv}*{y}_{i}{x}_{i}$). We do so by organizing the weight matrices of dense layers in block matrices that independently transform subsets of the variables and constrain these blocks to guarantee that $\text{pdv}*{y}_{i}{x}_{j}=0$ for $j>i$ and that $\text{pdv}*{y}_{i}{x}_{i}>0$. Our B-NAFs are much more compact than NAFs while remaining universal approximators of density functions. We evaluate them both on density estimation and variational inference showing performance on par with state-of-the-art NFs, including NAFs, IAFs, and Sylvester flows (van den Berg et al., 2018), while using orders of magnitude fewer parameters.

## 2 BACKGROUND

In this section, we provide an introduction to normalizing flows and their applications (§ 2.1). Then, we motivate autoregressive flows in § 2.2 and present the necessary background for our contributions (§ 2.3).

### 2.1 NORMALIZING FLOW

A (finite) normalizing flow is a bijective function $f:\mathcal{X}\to \mathcal{Y}$ between two continuous random variables $X\in \mathcal{X}\subseteq {\mathbb{R}}^{d}$ and $Y\in \mathcal{Y}\subseteq {\mathbb{R}}^{d}$ (Tabak et al., 2010). The change of variables theorem expresses a relation between the probability density functions ${p}_{Y}(y)$ and ${p}_{X}(x)$:

$${p}_{Y}(y)={p}_{X}(x)\text{abs}det\mathbf{J}_{f(x)}{}^{-1},$$ | (1) |

where $y=f(x)$, and $\text{abs}det{\mathbf{J}}_{f(x)}$ is the absolute value of the determinant of the Jacobian of $f$ evaluated at $x$. The Jacobian matrix is defined as ${\left({\mathbf{J}}_{f(x)}\right)}_{ij}=\text{pdv}*f{(x)}_{i}{x}_{j}$. The determinant quantifies how $f$ locally expands or contracts regions of $\mathcal{X}$. Note that a composition of invertible functions remain invertible, thus a composition of NFs is itself a normalizing flow.

#### Density estimation

In parametric density estimation, NFs map draws from complex distributions to draws from simple ones allowing assessments of a complex likelihood. Effectively, observations $x\sim {p}_{\text{data}}$ are modeled as draws from an NF ${p}_{X|\theta}$ whose parameters $\theta $ are chosen to minimize the Kullback-Leibler divergence $\mathbb{K}\mathbb{L}({p}_{\text{data}}\parallel {p}_{X|\theta})$ of the model ${p}_{X|\theta}$ from ${p}_{\text{data}}$:

$$\mathbb{H}({p}_{\text{data}})-{\mathbb{E}}_{{p}_{\text{data}}}\left[\mathrm{log}{p}_{Y}(f(x))\text{abs}det{\mathbf{J}}_{{f}_{\theta}(x)}\right],$$ | (2) |

where $\mathbb{H}$ indicates the entropy. Minimizing such KL is equivalent to maximizing the log-likelihood of observations (see Appendix A for details of such derivation).

#### Variational inference

In variational inference for deep generative models, NFs map draws from a simple density ${q}_{X}$, e.g. $\mathcal{N}(0,I)$, to draws from a complex (multimodal) density ${q}_{Y|\theta}(y)$. The parameters $\theta $ of the flow are estimated to minimize the KL divergence $\mathbb{K}\mathbb{L}({q}_{Y|\theta}\parallel {p}_{Y})$ of the true posterior ${p}_{Y}$ from ${q}_{Y|\theta}$,

$${\mathbb{E}}_{{q}_{X}(x)}\left[\mathrm{log}\frac{{q}_{X}(x)\text{abs}det\mathbf{J}_{{f}_{\theta}(x)}{}^{-1}}{{p}_{Y}({f}_{\theta}(x))}\right],$$ | (3) |

and note that this enables backpropagation through Monte Carlo (MC) estimates of the KL.

#### Tractability of NFs

When optimizing normalizing flows with stochastic gradient descent, we have to evaluate a base density and compute the gradient with respect to its inputs. This poses no challenge as we have the flexibility to choose a simple density (e.g. uniform or Gaussian). In addition, for every $x$ (i.e. an observation in parametric density estimation (Equation 2) or a base sample in variational inference (Equation 3)), the term $|det{\mathbf{J}}_{{f}_{\theta}(x)}|$ has to be evaluated and differentiated with respect to the parameters $\theta $. Note that ${f}_{\theta}(x)$ is required to be invertible, as expressive as possible, and ideally fast to compute. In general, it is non-trivial to construct invertible transformations with efficiently computable $|det{\mathbf{J}}_{{f}_{\theta}(x)}|$. Computing the determinant of a generic Jacobian ${\mathbf{J}}_{{f}_{\theta}(x)}\in {\mathbb{R}}^{d\times d}$ runs in $\mathcal{O}({d}^{3})$-time. Our work and current research on NFs aim at constructing parametrized flows which meet efficiency requirements while maximizing the expressiveness of the densities they can represent.

### 2.2 AUTOREGRESSIVE FLOWS

We can construct $f(x)$ such that its Jacobian is lower triangular, and thus has determinant ${\prod}_{i=1}^{d}\text{pdv}*f{(x)}_{i}{x}_{i}$, which is computed in time $\mathcal{O}(d)$. Flows based on autoregressive transformations meet precisely this requirement (Kingma et al., 2016; Oliva et al., 2018; Huang et al., 2018). For a multivariate random variable $X=\u27e8{X}_{1},\mathrm{\dots},{X}_{d}\u27e9$ with $d>1$, we can use the chain rule to express the joint probability of $x$ as product of $d$ univariate conditional densities:

$$ | (4) |

When we then apply a normalizing flow to each univariate density, we have an autoregressive flow.
Specifically, we can use a set of functions ${f}^{(i)}$ that can be decomposed via *conditioners* ${c}^{(i)}$, and invertible *transformers* ${t}^{(i)}$:

$$ | (5) |

where each transformer ${t}^{(i)}$ must be an invertible function with respect to ${x}_{i}$, and each ${c}^{(i)}$ is an unrestricted function. The resulting flow has a lower triangular Jacobian since each ${y}_{i}$ depends only on ${x}_{\le i}$. The flow is invertible when the Jacobian is constructed to have non-zero diagonal entries.

### 2.3 NEURAL AUTOREGRESSIVE FLOW

The invertibility of the flow as a whole depends on each ${t}^{(i)}$ being an invertible function of ${x}_{i}$.
For example, Dinh et al. (2014) and Kingma et al. (2016) model each ${t}^{(i)}$ as an affine transformation whose parameters are predicted by ${c}^{(i)}$. As argued by Huang et al. (2018), these transformations were constructed to be trivially invertible, but their simplicity leads to a cap on expressiveness of $f$, thus requiring complex conditioners and a composition of multiple flows. They propose instead to learn a complex bijection using a neural network monotonic in ${x}_{i}$ — this only requires constraining ${t}^{(i)}$ to having non-negative weights and using strictly increasing activation functions. Figure 0(a) outlines a NAF. Each conditioner ${c}^{(i)}$ is an unrestricted function of $$. To parametrize a monotonically increasing transformer network ${t}^{(i)}$, the outputs of each conditioner ${c}^{(i)}$ are mapped to the positive real coordinate space by application of an appropriate activation (e.g. $\mathrm{exp}$). The result is a flexible transformation with lower triangular Jacobian whose diagonal elements are positive.^{2}^{2}
2
Note that the expressiveness of a NAF comes at the cost of analytic invertibility, i.e. though each ${t}^{(i)}$ is bijective, thus invertible in principle, inverting the network itself is non-trivial.

For efficient computation of all pseudo-parameters, as Huang et al. (2018) call the conditioners’ outputs, they use a masked autoregressive network (Germain et al., 2015). The Jacobian of a NAF is computed using the chain rule on ${f}_{\theta}$ through all its hidden layers ${\{{h}^{(\mathrm{\ell})}\}}_{\mathrm{\ell}=1}^{l}$:

$${\mathbf{J}}_{{f}_{\theta}(x)}=\left[{\text{grad}}_{{h}^{(l)}}{y}^{{(}}\right]\left[{\text{grad}}_{{h}^{(l-1)}}{h}^{(l)}\right]\mathrm{\dots}\left[{\text{grad}}_{x}{h}^{(1)}\right].$$ | (6) |

Since ${f}_{\theta}$ is autoregressive, ${\mathbf{J}}_{{f}_{\theta}(x)}$ is lower triangular and only the diagonal needs to be computed, i.e. $\text{pdv}*{y}_{i}{x}_{i}$ for each $i$. Thus, this operation requires only computing the derivatives of each ${t}^{(i)}$, reducing the time complexity.

Because the universal approximation theorem for densities holds for NAFs (Huang et al., 2018), increasing the expressiveness of a NAF is only a matter of employing larger transformer networks. However, the conditioner grows quadratically with the size of the transformer network and a combination of restricting the size of the transformer and a technique similar to *conditional weight normalization* (Krueger et al., 2017) is necessary to reduce the number of parameters. In §3 we propose to parametrize the transformer network directly, i.e. without a conditioner network, by exploiting the fact that the monotonicity constraint only requires $\text{pdv}*{y}_{i}{x}_{i}>0$, and therefore, careful design of a single feed-forward network can directly realize a transformation that is both autoregressive (with unconstrained manipulation of $$) and strictly monotone (with positive $\text{pdv}*{y}_{i}{x}_{i}$).

## 3 BLOCK NEURAL AUTOREGRESSIVE FLOW

In the spirit of NAFs, we model each ${f}_{\theta}^{(i)}({x}_{\le i})$
as a neural network with parameters $\theta $, but differently from NAFs, we do not predict $\theta $ using a conditioner network, and instead, we learn $\theta $ directly. In dense layers of ${f}_{\theta}^{(i)}$, we employ affine transformations with strictly positive weights to process ${x}_{i}$. This ensures strict monotonicity and thus invertibility of each ${f}_{\theta}^{(i)}$ with respect to ${x}_{i}$. However, we do not impose this constraint on affine transformations of $$. Additionally, we need to always use invertible activation functions to ensure that the whole network is bijective (e.g., $\mathrm{tanh}$ or LeakyReLU). Each ${f}_{\theta}^{(i)}$ is then a univariate flow implemented as an arbitrarily wide and deep neural network which can approximate any invertible transformation. Much like other AFs, we can efficiently compute all ${f}_{\theta}^{(i)}$ in parallel by employing a single masked autoregressive network (Germain et al., 2015). In the next section, we show how to construct each affine transformation using block matrices. From now on, we will refer to our novel family of flows as *block neural autoregressive flows* (B-NAFs).

### 3.1 AFFINE TRANSFORMATIONS WITH BLOCK MATRICES

For each affine transformation of $x$, we parametrize the bias term freely and we construct the weight matrix $W\in {\mathbb{R}}^{ad\times bd}$ as a lower triangular *block* matrix for some $a,b\ge 1$. We use $d\times (d+1)/2$ blocks ${B}_{ij}\in {\mathbb{R}}^{a\times b}$ for $i\in \{1,..,d\}$ and $1\le j\le i$. We let ${B}_{ij}$ (with $i>j$) be freely parametrized and constrain diagonal blocks to be strictly positive applying an element-wise function $g:\mathbb{R}\to {\mathbb{R}}_{>0}$ to each of them. Thus:

$$W=\left[\begin{array}{cccc}\hfill g({B}_{11})\hfill & \hfill 0\hfill & \hfill \mathrm{\dots}\hfill & \hfill 0\hfill \\ \hfill {B}_{21}\hfill & \hfill g({B}_{22})\hfill & \hfill \mathrm{\dots}\hfill & \hfill 0\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill {B}_{d1}\hfill & \hfill {B}_{d2}\hfill & \hfill \mathrm{\dots}\hfill & \hfill g({B}_{dd})\hfill \end{array}\right],$$ | (7) |

where we chose $g(\cdot )=\mathrm{exp}(\cdot )$. Since the flow has to preserve the input dimensionality, the first and the last affine transformations in the network must have $b=1$ and $a=1$, respectively. Inside the network, the size of $a$ and $b$ can grow arbitrarily.

The intuition behind the construction of $W$ is that every row of blocks ${B}_{i1},..,{B}_{ii}$ is a set of affine transformations (projections) that are processing ${x}_{\le i}$. In particular, blocks in the upper triangular part of $W$ are set to zero to make the flow autoregressive. Since the blocks ${B}_{ii}$ are mapped to ${\mathbb{R}}_{>0}$ through $g$, each transformation in such set is strictly monotonic for ${x}_{i}$ and unconstrained on $$.

#### B-NAF with masked networks

In practice, a more convenient parameterization of $W$ consists of using a full matrix $\widehat{W}\in {\mathbb{R}}^{ad\times bd}$ which is then transformed applying two *masking* operations. One mask ${M}_{d}\in {\{0,1\}}^{ad\times bd}$ selects only elements in the diagonal blocks, and a second one ${M}_{o}$ selects only off-diagonal and lower diagonal blocks. Thus, for each layer $\mathrm{\ell}$ we get

$${W}^{(\mathrm{\ell})}=g\left({\widehat{W}}^{(\mathrm{\ell})}\right)\odot {M}_{d}^{(\mathrm{\ell})}+{\widehat{W}}^{(\mathrm{\ell})}\odot {M}_{o}^{(\mathrm{\ell})},$$ | (8) |

where $\odot $ is the element-wise product. Figure 0(b) shows an outline of our block neural autoregressive flow.

Since each weight matrix ${W}^{(\mathrm{\ell})}$ has some strictly positive and some zero entries, we need to take care of a proper initialization which should take that into account. Indeed, weights are usually initialized to have a zero centred normally distributed output with variance dependent on the output dimensionality (Glorot and Bengio, 2010). Instead of carefully designing a new initialization technique to take care of this, we choose to initialize all blocks with a simple distribution and to apply weight normalization (Salimans and Kingma, 2016) to better cope the effect of such initialization. See Appendix C for more details.

When constructing a stacked flow though a composition of $n$ B-NAF transformations, we add gated residual connections for improving stability such that the composition is ${\widehat{f}}_{n}\circ \mathrm{\cdots}\circ {\widehat{f}}_{2}\circ {\widehat{f}}_{1}$ where ${\widehat{f}}_{i}(x)=\alpha {f}_{i}(x)+(1-\alpha )x$ and $\alpha \in (0,1)$ is a trainable scalar parameter.

### 3.2 AUTOREGRESSIVENESS AND INVERTIBILITY

In this section, we show that our flow ${f}_{\theta}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$ meets the following requirements: i) its Jacobian ${\mathbf{J}}_{{f}_{\theta}(x)}$ is lower triangular (needed for efficiency in computing its determinant), and ii) the diagonal entries of such Jacobian are positive (to ensure that ${f}_{\theta}$ is a bijection).

###### Proposition 1.

The final Jacobian ${\mathrm{J}}_{{f}_{\theta}\mathit{}\mathrm{(}x\mathrm{)}}$ of such transformation is lower triangular.

###### Proof sketch 1.

When applying the chain rule (Equation 6), the Jacobian of each affine transformation is $W$ (Equation (8), a lower triangular block matrix), whereas the Jacobian of each element-wise activation function is a diagonal matrix. A matrix multiplication between a lower triangular block matrix and a diagonal matrix yields a lower triangular block matrix, and a multiplication between two lower triangular block matrices results in a lower triangular block matrix. Therefore, after multiplying all matrices in the chain, the overall Jacobian is lower triangular.

###### Proposition 2.

When using strictly increasing activation functions (e.g., $\mathrm{tanh}$ or LeakyReLU), the diagonal entries of ${\mathrm{J}}_{{f}_{\theta}\mathit{}\mathrm{(}x\mathrm{)}}$ are strictly positive.

###### Proof sketch 2.

When applying the chain rule (Equation 6), the Jacobian of each affine transformation has strictly positive values in its diagonal blocks where the Jacobian of each element-wise activation function is a diagonal matrix with strictly positive elements. When using matrix multiplication between two lower triangular block matrices (or one diagonal and one lower triangular block matrix) $C\mathrm{=}A\mathit{}B$ the resulting blocks on the diagonal of $C$ are the result of a multiplication between only diagonal blocks of $A\mathrm{,}B$. Indeed, such resulting blocks depend only on blocks of the same row and column partition. Using the notation of Equation 7, the resulting diagonal blocks of $C$ are ${B}_{i\mathit{}i}^{\mathrm{(}C\mathrm{)}}\mathrm{=}g\mathit{}\mathrm{(}{B}_{i\mathit{}i}^{\mathrm{(}A\mathrm{)}}\mathrm{)}\mathit{}g\mathit{}\mathrm{(}{B}_{i\mathit{}i}^{\mathrm{(}B\mathrm{)}}\mathrm{)}$. Therefore, they are always positive. Eventually, using the chain rule, the final Jacobian is a lower triangular matrix with strictly positive elements in its diagonal.

### 3.3 EFFICIENT JACOBIAN COMPUTATION

Proposition 1 is particularly useful for an efficient computation of $det{\mathbf{J}}_{{f}_{\theta}(x)}$ since we only need the product of its diagonal elements $\text{pdv}*{y}_{i}{x}_{i}$. Thus, we can avoid computing the other entries. Since the determinant is the result of a product of positive values, we also remove the absolute-value operation resulting in

$$\mathrm{log}|det{\mathbf{J}}_{{f}_{\theta}(x)}|=\sum _{i=0}^{d}\mathrm{log}{\left({\mathbf{J}}_{{f}_{\theta}(x)}\right)}_{ii}.$$ | (9) |

Additionally, as per Proposition 2, when using matrix multiplication, elements in the diagonal blocks (or entries) depend only on diagonal blocks of the same row and column partition. Since all diagonal blocks/entries are positive, we compute them directly in the log-domain to have more numerically stable operations:

$\mathrm{log}$ | ${\left({\mathbf{J}}_{{f}_{\theta}(x)}\right)}_{ii}=\mathrm{log}g({B}_{ii}^{(\mathrm{\ell})})\star $ | (10) | ||

$\mathrm{log}{\mathbf{J}}_{{\sigma}^{(\mathrm{\ell})}({h}_{\alpha}^{(\mathrm{\ell}-1)})}\star \mathrm{\cdots}\star \mathrm{log}g({B}_{ii}^{(1)}),$ |

where $\star $ denotes the log-matrix multiplication, ${\sigma}^{(\mathrm{\ell})}$ the strictly increasing non-linear activation function at layer $\mathrm{\ell}$, and $\alpha $ indicates the set of indices corresponding to diagonal elements that depend on ${x}_{i}$. Notice that, since we chose $g(\cdot )=\mathrm{exp}(\cdot )$ we can remove all redundant operations $\mathrm{log}g(\cdot )$. The log-matrix multiplication $C=A\star B$ of two matrices $A\in {\mathbb{R}}^{m\times n}=\mathrm{log}\widehat{A}$ and $B\in {\mathbb{R}}^{n\times p}=\mathrm{log}\widehat{B}$ can be implemented with a stable *log-sum-exp* operation since

$${C}_{ij}=\mathrm{log}\sum _{k=1}^{n}\mathrm{exp}\left({\widehat{A}}_{ik}+{\widehat{B}}_{kj}\right).$$ | (11) |

A similar idea is also employed in NAF.

### 3.4 UNIVERSAL DENSITY APPROXIMATOR

In this section, we expose an intuitive proof sketch that our block neural autoregressive flow can approximate any real continuous probability density function (PDF).

Given a multivariate real and continuous random variable $X=\u27e8{X}_{1},\mathrm{\dots},{X}_{d}\u27e9$, its joint distribution can be factorized into a set of univariate conditional distributions (as in Equation 4), using an arbitrary ordering of the variables, and we can define a set of univariate conditional cumulative distribution functions (CDFs) $$. According to Hyvärinen and Pajunen (1999), such decomposition exists and each individual ${Y}_{i}$ is independent as well as uniformly distributed in $[0,1]$. Therefore, we can see ${F}_{X}$ as a particular *normalizing flow* that maps $X\in {\mathbb{R}}^{n}$ to $Y\in {[0,1]}^{n}$ where the distribution ${p}_{Y}$ is uniform in the hyper-cube ${[0,1]}^{n}$. Note that ${F}_{X}$ is an autoregressive function and its Jacobian has a positive diagonal since $$.

If each univariate flow ${f}_{\theta}^{(i)}$ (see Equation 5) can approximate any invertible univariate conditional CDF, then ${f}_{\theta}$ can approximate any PDF (Huang et al., 2018). Note that in general, a CDF $$ is non-decreasing, thus not necessary invertible (Park and Park, 2018). Using B-NAF, each CDF is approximated with an arbitrarily large neural network and the output can be eventually mapped to $(0,1)$ with a sigmoidal function. Recalling that we only use positive weights for processing ${x}_{i}$, a neural network with non-negative weights is an universal approximator of monotonic functions (Daniels and Velikova, 2010). We use *strictly* positive weights to approximate a *strictly* monotonic function for ${x}_{i}$ and we use arbitrary weights for $$ (as there is no monotonicity constraint for them). Therefore, B-NAF can approximate any invertible CDF, and thus its corresponding PDF.

## 4 RELATED WORK

Current research on NFs focuses on constructing expressive parametrized invertible trasformations with tractable Jacobians. Rezende and Mohamed (2015) were the first to suggest the use of parameterized flows in the context of variational inference proposing two parametric families: the planar and the radial flow. A drawback and bottleneck of such flows is that their power comes from stacking a large number of such transformations. More recently, van den Berg et al. (2018) generalized the use of planar flows showing improvements without increasing the number of transformations, and instead, by making each transformation more expressive.

In the context of density estimation, Germain et al. (2015) proposed MADE, a masked feed-forward network that efficiently computes an autoregressive transformation. MADEs are important building blocks in AFs, such as the inverse autoregressive flows (IAFs) introduced by Kingma et al. (2016). IAFs are based on trivially invertible affine transformations of the preceding coordinates of the input vector. The parameters of each transformation (a location and a positive scale) are predicted in parallel with a MADE, and therefore IAFs have a lower triangular Jacobian whose determinant is fast to evaluate. IAFs extend the parametric families available for approximate posterior inference in variational inference. Neural autoregressive flow (NAF) by Huang et al. (2018) extend IAFs by generalizing the bijective transformation to one that can approximate any monotonically increasing function. They have been used both for parametric density estimation and approximate inference. In §2.3 we explain NAFs in detail and contrast them with our proposed B-NAFs (also, see Figure 1).

Larochelle and Murray (2011) were among the first to employ neural networks for autoregressive density estimation (NADE), in particular, for high-dimensional binary data. Non-linear independent components estimation (NICE) explored the direction of learning a map from high-dimensional data to a latent space with a simpler factorized distribution (Dinh et al., 2014). Papamakarios et al. (2017) proposed masked autoregressive flows (MAFs) as a generalization of real non-volume-preserving flows (Real NVP) by Dinh et al. (2017) showing improvements on density estimation.

In this work, we are modelling a discrete normalizing flow since at each transformation a discrete step is made. Continuous normalizing flows (CNF) were proposed by Chen et al. (2018) and modelled through a network that instead of predicting the output of the transformation predicts its derivative. The resulting transformation is computed using an ODE solver. Grathwohl et al. (2019) further improved such formulation proposing free-form Jacobian of reversible dynamics (FFJORD).

Orthogonal work has been done for constructing powerful invertible function such as invertible $1\times 1$ convolution (Glow) by Kingma and Dhariwal (2018), invertible $d\times d$ convolutions (Hoogeboom et al., 2019), and invertible residual networks (Behrmann et al., 2018). Additionally, Oliva et al. (2018) investigated different possibilities for the *conditioner* of an autoregressive transformation (e.g., recurrent neural networks).

## 5 EXPERIMENTS

Model | POWER$\uparrow $ | GAS$\uparrow $ | HEPMASS$\uparrow $ | MINIBOONE$\uparrow $ | BSDS300$\uparrow $ |
---|---|---|---|---|---|

$d$=6;$N$=2,049,280 | $d$=8;$N$=1,052,065 | $d$=21;$N$=525,123 | $d$=43;$N$=36,488 | $d$=63;$N$=1,300,000 | |

Real NVP | $0.17$$\pm .01$ | $8.33$$\pm .14$ | $-18.71$$\pm .02$ | $-13.55$$\pm .49$ | $153.28$$\pm 1.78$ |

Glow | $0.17$$\pm .01$ | $8.15$$\pm .40$ | $-18.92$$\pm .08$ | $-11.35$$\pm .07$ | $155.07$$\pm .03$ |

MADE MoG | $0.40$$\pm .01$ | $8.47$$\pm .02$ | $-15.15$$\pm .02$ | $-12.27$$\pm .47$ | $153.71$$\pm .28$ |

MAF-affine | $0.24$$\pm .01$ | $10.08$$\pm .02$ | $-17.73$$\pm .02$ | $-12.24$$\pm .45$ | $155.69$$\pm .28$ |

MAF-affine MoG | $0.30$$\pm .01$ | $9.59$$\pm .02$ | $-17.39$$\pm .02$ | $-11.68$$\pm .44$ | $156.36$$\pm .28$ |

FFJORD | $0.46$$\pm .01$ | $8.59$$\pm .12$ | $-14.92$$\pm .08$ | $-10.43$$\pm .04$ | $157.40$$\pm .19$ |

NAF-DDSF | $0.62$$\pm .01$ | $11.96$$\pm .33$ | $-15.09$$\pm .40$ | $-8.86$$\pm .15$ | $157.43$$\pm .30$ |

TAN | $0.60$$\pm .01$ | $12.06$$\pm .02$ | $-13.78$$\pm .02$ | $-11.01$$\pm .48$ | $159.80$$\pm .07$ |

Ours | $0.61$$\pm .01$ | $12.06$$\pm .09$ | $-14.71$$\pm .38$ | $-8.95$$\pm .07$ | $157.36$$\pm .03$ |

### 5.1 DENSITY ESTIMATION ON TOY 2D DATA

In this experiment, we use our B-NAF to perform density estimation on $2$-dimensional data as this helps us visualizing the model capabilities to learn. We use the same toy data as Grathwohl et al. (2019) comparing the results with Glow (Kingma and Dhariwal, 2018), as they do. Given samples from a dataset with empirical distribution ${p}_{\text{data}}$, we parametrize a density ${p}_{X|\theta}$ with a normalizing flow ${p}_{X|\theta}(x)={p}_{Y}({f}_{\theta}(x))|det{\text{\mathbf{J}}}_{{f}_{\theta}(x)}|$ using B-NAF with ${p}_{Y}$ a standard Normal distribution. We train for $20$k iterations a single flow of B-NAF with $3$ hidden layers of $100$ units each using maximum likelihood estimation (i.e., maximizing ${\mathbb{E}}_{{p}_{\text{data}}}[\mathrm{log}{p}_{X|\theta}(x)]$, see Appendix A for more details and derivation of the objective). We used Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of $\alpha ={10}^{-1}$ (and decay of $0.5$ with patience of $2$k steps), default ${\beta}_{1},{\beta}_{2}$, and a batch size of $200$. We took figures of Glow from (Grathwohl et al., 2019) who trained such models with $100$ layers.

#### Results

The learned distributions of both Glow and our method can be seen in Figure 2. Glow is capable of learning a multi-modal distribution, but it has issues assigning the correct density in areas of low probability between disconnected regions. Our model is instead able to perfectly capture both multi-modality and discontinuities.

### 5.2 DENSITY MATCHING ON TOY 2D DATA

In this experiment, we use B-NAF to perform density matching on $2$-dimensional target energy functions to visualize the model capabilities of matching them. We use the same energy functions described by Rezende and Mohamed (2015) comparing the results with them (using planar flows). For this task, we train a parameterized flow minimizing the KL divergence between the learned ${q}_{Y|\theta}$ and the given target ${p}_{Y}$. We used a single flow using a B-NAF with $2$ hidden layers of $100$ units each. We train by minimizing $\mathbb{K}\mathbb{L}({q}_{Y|\theta}\parallel {p}_{Y})$ (see Appendix B for a detailed derivation) using Monte Carlo sampling. We optimized using Adam for $20$k iterations with an initial learning rate of $\alpha ={10}^{-2}$ (and decay of $0.5$ with patience of $2$k steps), default ${\beta}_{1},{\beta}_{2}$, and a batch size of $200$. Planar flow figures were taken from Chen et al. (2018). Note that planar flows were trained for 500k iterations using RMSProp (Tieleman and Hinton, 2012).

#### Results

Figure 3 shows that our model perfectly matches all target distributions. Indeed, on functions $3$ and $4$ it looks like B-NAF can better learn the density in certain areas. The model capacity of planar normalizing flows is determined by their depth ($K$) and Rezende and Mohamed (2015) had to stack $32$ flows to match the energy function reasonably well. Deeper networks are harder to optimize, and our flow matches all the targets using a neural network with only $2$ hidden layers.

### 5.3 REAL DATA DENSITY ESTIMATION

In this experiment, we use a B-NAF to perform density estimation on $5$ real datasets. Similarly to Section 5.1, we train using MLE maximizing ${\mathbb{E}}_{{p}_{\text{data}}}[\mathrm{log}{p}_{X|\theta}(x)]$. We compare our results against Real NVP (Dinh et al., 2017), Glow (Kingma and Dhariwal, 2018), MADE (Germain et al., 2015), MAF (Papamakarios et al., 2017), TAN (Oliva et al., 2018), NAF (Huang et al., 2018), and FFJORD (Grathwohl et al., 2019). For our B-NAF, we stacked $5$ flows and we employed a small grid search on the number of layers and the size of hidden units per flow ($L\in \{1,2\}$ and $H\in \{10d,20d,40d\}$, respectively, where $d$ is the input size of datapoints which is different for each dataset). When stacking B-NAF flows, the elements of each output vector are permuted so that a different set of elements is considered at each flow. This technique is not novel and it is also used by Dinh et al. (2017); Papamakarios et al. (2017); Kingma et al. (2016). We trained using Adam with Polyak averaging (with $\varphi =0.998$) as in NAF (Polyak and Juditsky, 1992). We also applied an exponentially decaying learning rate schedule (from $\alpha ={10}^{-2}$ with rate $\lambda =0.5$) based on no-improvement with patience of $20$ epochs. We trained until convergence (but maximum $1$k epochs), stopping after $100$ epochs without improvement on validation set.

#### Datasets

Following Papamakarios et al. (2017), we perform unconditional density estimation on four datasets (Dua and Karra Taniskidou, 2017) from UCI machine learning repository^{3}^{3}
3
http://archive.ics.uci.edu/ml as well as one dataset of patches of images (Martin et al., 2001): POWER containing electric power consumption in a household over a period of $4$ years, GAS containing logs of $8$ chemical sensors exposed to a mixture of gases, HEPMASS, a dataset from a Monte Carlo simulation for high energy physics experiments, MINIBOONE that contains examples of electron neutrino and muon neutrino, and BSDS300 which is obtained by extracting random patches from the homonym datasets of natural images.

#### Results

Table 1 shows the results of this experiment reporting log-likelihood on test set. In all datasets, our B-NAF is better than Real NVP, Glow, MADE, and MAF and it performs comparable or better to NAF. B-NAF also outperforms FFJORD in all dataset except on BSDS300 where there is a marginal difference ($$) between the two methods. On GAS and HEPMASS, B-NAF performs better than most of the other models and even better than NAF. In the other datasets, the gap in performance compared to NAF is marginal. We observed that in most datasets, the best performing model was the largest one in the grid search ($L=2$ and $H=40d$). It is possible that we do a too narrow hyper-parameter search compared to what other methods do. For instance, FFJORD results come from a wider grid search than ours. Grathwohl et al. (2019), Huang et al. (2018), and Oliva et al. (2018) also varied the number of flows during tuning.

We compare NAF and our B-NAF in terms of the number of parameters employed and report the ratio between the two for each dataset in Table 2. For datasets with low-dimensional datapoints (i.e, GAS and POWER) our model uses a comparable number of parameters to NAF. For high-dimensional datapoints the gap between the parameters used by NAF and B-NAF grows, with B-NAF much smaller, as we intended. For instance, on both HEPMASS and MINIBOONE, our models have marginal differences in performance with NAF while having respectively $\sim 18\times $ and $\sim 40\times $ fewer parameters than NAF. This evidence supports our argument that NAF models are over-parametrized and it is possible to achieve similar performance with an order of magnitude fewer parameters. Besides, when training models on GPUs, being memory efficiency allows to train more models in parallel on the same device. Additionally, in general, a normalizing flow can be a component of a larger architecture that might require more memory than the flow itself (as the models for experiments in the next Section).

Dataset | NAF (5) | NAF (10) | |
---|---|---|---|

POWER | ($d=6$) | $2.29$ | $4.57$ |

GAS | ($d=8$) | $1.30$ | $2.60$ |

HEPMASS | ($d=21$) | $17.94$ | $35.88$ |

MINIBOONE | ($d=43$) | $43.97$ | $87.91$ |

BSDS300 | ($d=63$) | $8.24$ | $16.48$ |

Model | MNIST | Freyfaces | Omniglot | Caltech 101 | ||||
---|---|---|---|---|---|---|---|---|

-ELBO$\downarrow $ | NLL$\downarrow $ | -ELBO$\downarrow $ | NLL$\downarrow $ | -ELBO$\downarrow $ | NLL$\downarrow $ | -ELBO$\downarrow $ | NLL$\downarrow $ | |

VAE | $86.55$$\pm .06$ | $82.14$$\pm .07$ | $4.53$$\pm .02$ | $4.40$$\pm .03$ | $104.28$$\pm .39$ | $97.25$$\pm .23$ | $110.80$$\pm .46$ | $99.62$$\pm .74$ |

Planar | $86.06$$\pm .31$ | $81.91$$\pm .22$ | $4.40$$\pm .06$ | $4.31$$\pm .06$ | $102.65$$\pm .42$ | $96.04$$\pm .28$ | $109.66$$\pm .42$ | $98.53$$\pm .68$ |

IAF | $84.20$$\pm .17$ | $80.79$$\pm .12$ | $4.47$$\pm .05$ | $4.38$$\pm .04$ | $102.41$$\pm .04$ | $96.08$$\pm .16$ | $111.58$$\pm .38$ | $99.92$$\pm .30$ |

Sylvester | $83.32$$\pm .06$ | $80.22$$\pm .03$ | $4.45$$\pm .04$ | $4.35$$\pm .04$ | $99.00$$\pm .04$ | $93.77$$\pm .03$ | $104.62$$\pm .29$ | $93.82$$\pm .62$ |

Ours | $83.59$$\pm .15$ | $80.71$$\pm .09$ | $4.42$$\pm .05$ | $4.33$$\pm .04$ | $100.08$$\pm .07$ | $94.83$$\pm .10$ | $105.42$$\pm .49$ | $94.91$$\pm .51$ |

### 5.4 VARIATIONAL AUTO-ENCODERS

An interesting application of our framework is modelling more flexible posterior distributions in a variational auto-encoder (VAE) setting (Kingma and Welling, 2013). In this setting, we assume that an observation $x$ (i.e., the data) is drawn from the marginal of a deep latent model, i.e. $X\sim {p}_{X|\theta}$, where ${p}_{X|\theta}(x)=\int {p}_{Z}(z){p}_{X|Z,\theta}(x|z)\text{dd}z$ where $Z\sim \mathcal{N}(0,I)$ is unobserved. The goal is performing maximum likelihood estimation of the marginal. Since $Z$ is not observed, maximizing the objective would require marginalization over the latent variables, which is generally intractable. Using variational inference (Jordan et al., 1999), we can maximize a lower bound on log-likelihood:

$$\mathrm{log}{p}_{X|\theta}(x)\ge {\mathbb{E}}_{{q}_{Z|X,\varphi}(z)}\left[\mathrm{log}\frac{{p}_{XZ|\theta}(x,z)}{{q}_{Z|X,\varphi}(z|x)}\right],$$ | (12) |

where ${p}_{X|Z,\theta}$ and ${q}_{Z|X,\varphi}$ are parametrized via neural networks with learnable parameters $\theta $ and $\varphi $ (Kingma and Welling, 2013), in particular, ${q}_{Z|X,\varphi}$ is an approximation to the intractable posterior ${p}_{Z|X,\theta}$.
This bound is called the evidence lower bound (ELBO), and maximizing the ELBO is equivalent to minimizing $\mathbb{K}\mathbb{L}({q}_{Z|X,\varphi}\parallel {p}_{Z|X,\theta})$. The more expressive the approximating family is, the more likely we are to obtain a tight bound. Recent literature approaches tighter bounds by approximating the posterior with normalizing flows. Also note that NFs reparametrize ${q}_{Z|X,\varphi}(z|x)={q}_{Y}({f}_{\varphi}(z;x))\text{abs}det{\mathbf{J}}_{{f}_{\varphi}(z;x)}$ via a simpler fixed base distribution, e.g. a standard Gaussian, and therefore we can follow stochastic gradient estimates of the ELBO with respect to both sets of parameters. In this experiment, we use our flow for posterior approximation showing that B-NAF compares with recently proposed NFs for variational inference. We reproduce experiments by van den Berg et al. (2018) (Sylvester flows or SNF) while replacing their flow with ours. We keep the encoder and decoder networks exactly the same to fairly compare with all models trained with such procedure. We compare our B-NAF to their flows on the same $4$ datasets as well as to a normal VAE (Kingma and Welling, 2013), planar flows (Rezende and Mohamed, 2015), and IAFs (Kingma et al., 2016).^{4}^{4}
4
Although also Huang et al. (2018) proposed an experiment with VAEs for NAF, they used only one dataset (MNIST) employed a different encoder/decoder architecture than van den Berg et al. (2018). Therefore, results are not comparable.

In this experiment, the input dimensionality of the flow is fixed to $d=64$. We employed a small grid search on the MNIST dataset on the number of flows $K\in \{4,8\}$, and on thee size of hidden units per flow $H\in \{2d,4d,8d\}$ while keeping the number of layers fixed at $L=1$. The elements of each output vector are permuted after each B-NAF flow (as we do in § 5.3). We keep the best hyper-parameters of this search for the other datasets. We train using Adamax with $\alpha =5\cdot {10}^{-4}$. We point to Appendix A of van den Berg et al. (2018) for details on the network architectures for the encoder and decoder.

#### Datasets

Following van den Berg et al. (2018) we carried our experiments on $4$ datasets: statically binarized MNIST (Larochelle and Murray, 2011), Freyfaces^{5}^{5}
5
http://www.cs.nyu.edu/~roweis/data/frey_rawface.mat, Omniglot (Lake et al., 2015) and Caltech 101 Silhouettes (Marlin et al., 2010). All those datasets consist of black and white images of different sizes.

#### Amortizing flow parameters

When using NFs in an amortized inference setting, the parameters of each flow are not learned directly but predicted with another function from each datapoint (Rezende and Mohamed, 2015). In our case, we do not amortize all parameters of B-NAF since that would require very large predictors and we want to keep our flow memory efficient. Alternatively, every affine matrix $W\in {\mathbb{R}}^{n\times m}$ is shared among all datapoints. Then, for each affine transformation, we achieve a degree of amortization by predicting $3$ vectors, the bias $b\in {\mathbb{R}}^{n}$ and $2$ vectors ${v}_{1}\in {\mathbb{R}}^{n}$ and ${v}_{2}\in {\mathbb{R}}^{m}$ that we multiply row- and column-wise respectively to $W$.

#### Results

Table 3 shows the results of these experiments. From the grid search, it turned out that the best B-NAF model has $K=8$ (flows) and $H=4d$ (hidden units). Note that the best models reported by van den Berg et al. (2018) used $16$ flows. Our model is quite flexible without being as deep as other models. Results show that B-NAF is better than normal VAE, planar flows, and IAFs in all four datasets. Although B-NAF performs slightly worse than Sylvester flows, van den Berg et al. (2018) applied a full amortization for the parameters of the flow, while we do not. They proposed two alternative parametrizations to construct Sylvester flows: orthogonal SNF and Houseolder SNF. For each datapoint, SNF has to predict from $50.7$k to $76.8$k values (depending on the parametrization) to fully amortize parameters of the flow, while we use only $7.7$k (i.e., $6.64\times $ to $10.0\times $ fewer). Notably, recalling that these are not trainable parameters, we use $6.16\times $ (orthogonal SNF) and $9.35\times $ (Householder SNF) fewer trainable parameters as well. Besides, we also use $14.45\times $ fewer parameters than IAF. This shows that IAF and SNF are over-parametrized too, and it is possible to achieve similar performance in the context of variational inference with an order of magnitude fewer parameters.

## 6 CONCLUSIONS

We present a new family of flexible normalizing flows, *block neural autoregressive flows*. B-NAFs are universal approximators of density functions and maintain an efficient parametrization. Our flow is based on directly parametrizing a transformation that guarantees autoregressiveness and monotonicity without the need for large conditioner networks and without compromising parallelism. Compared to the only other flow (to the best of our knowledge) which is also a universal approximator, our B-NAFs require orders of magnitudes fewer parameters. We validate B-NAFs on parametric density estimation on toy and real datasets, as well as, on approximate posterior inference for deep latent variable models, showing favorable performance across datasets and against various established flows. For future work, we are interested in at least two directions. One concerns gaining access to the inverse of the flow—note that, while B-NAFs and NAFs are invertible in principle, their inverses are not available in closed form. Another concerns deep generative models with large decoders (e.g. in natural language processing applications): since we achieve high flexibility at a low memory footprint our flows seem to be a good fit.

### Acknowledgements

We would like to thank George Papamakarios and Luca Falorsi for insightful discussions. This project is supported by SAP Innovation Center Network, ERC Starting Grant BroadSem (678254) and the Dutch Organization for Scientific Research (NWO) VIDI 639.022.518. Wilker Aziz is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299 (Gourmet).

## References

- Behrmann et al. (2018) Behrmann, J., Duvenaud, D., and Jacobsen, J.-H. (2018). Invertible residual networks. arXiv preprint arXiv:1811.00995.
- Chen et al. (2018) Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural ordinary differential equations. Advances in Neural Information Processing Systems.
- Daniels and Velikova (2010) Daniels, H. and Velikova, M. (2010). Monotone and partially monotone neural networks. IEEE Transactions on Neural Networks, 21(6):906–917.
- Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
- Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real NVP. Proceedings of the 5th International Conference on Learning Representations (ICLR).
- Dua and Karra Taniskidou (2017) Dua, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
- Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889.
- Glorot and Bengio (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
- Grathwohl et al. (2019) Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. (2019). FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. International Conference on Learning Representations.
- Ha et al. (2017) Ha, D., Dai, A., and Le, Q. V. (2017). Hypernetworks. International Conference on Learning Representations.
- Hoogeboom et al. (2019) Hoogeboom, E., Berg, R. v. d., and Welling, M. (2019). Emerging convolutions for generative normalizing flows. arXiv preprint arXiv:1901.11137.
- Huang et al. (2018) Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. (2018). Neural autoregressive flows. International Conference on Learning Representations.
- Hyvärinen and Pajunen (1999) Hyvärinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439.
- Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233.
- Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR).
- Kingma and Dhariwal (2018) Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751.
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. International Conference on Learning Representations.
- Krueger et al. (2017) Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., and Courville, A. (2017). Bayesian hypernetworks. arXiv preprint arXiv:1710.04759.
- Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
- Larochelle and Murray (2011) Larochelle, H. and Murray, I. (2011). The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29–37.
- Marlin et al. (2010) Marlin, B., Swersky, K., Chen, B., and Freitas, N. (2010). Inductive principles for restricted boltzmann machine learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 509–516.
- Martin et al. (2001) Martin, D., Fowlkes, C., Tal, D., and Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference, volume 2, pages 416–423. IEEE.
- Oliva et al. (2018) Oliva, J., Dubey, A., Zaheer, M., Poczos, B., Salakhutdinov, R., Xing, E., and Schneider, J. (2018). Transformation autoregressive networks. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3898–3907, Stockholmsmässan, Stockholm Sweden. PMLR.
- Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
- Park and Park (2018) Park, K. I. and Park (2018). Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer.
- Polyak and Juditsky (1992) Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855.
- Rezende and Mohamed (2015) Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pages 1530–1538. JMLR. org.
- Rippel and Adams (2013) Rippel, O. and Adams, R. P. (2013). High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125.
- Salimans and Kingma (2016) Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909.
- Tabak et al. (2010) Tabak, E. G., Vanden-Eijnden, E., et al. (2010). Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233.
- Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report.
- van den Berg et al. (2018) van den Berg, R., Hasenclever, L., Tomczak, J. M., and Welling, M. (2018). Sylvester normalizing flows for variational inference. 34th Conference on Uncertainty in Artificial Intelligence (UAI18).

## Appendix A OBJECTIVE FOR DENSITY ESTIMATION

When performing density estimation for a random variable $X$, we only have access to samples from the unknown target distribution $X\sim {p}_{\star}$ (i.e., the unknown data distribution) but we do not have access to ${p}_{\star}$ directly (Papamakarios et al., 2017). Using Equation 1, we can use a normalizing flow to transform a complex parametric model ${p}_{X|\theta}$ of the target distribution into a simpler distribution ${p}_{Y}$ (i.e., a uniform or a Normal distribution), which can be easily evaluated. In this case, we will learn the parameters $\theta $ of the model by minimizing $\mathbb{K}\mathbb{L}({p}_{\star}\parallel {p}_{X|\theta})$:

${\theta}^{*}$ | $=\underset{\theta}{\mathrm{min}}\mathbb{K}\mathbb{L}({p}_{\star}\parallel {p}_{X|\theta})$ | (13) | ||

$=\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{{p}_{\star}(x)}\left[\mathrm{log}{\displaystyle \frac{{p}_{\star}(x)}{{p}_{X|\theta}(x)}}\right]$ | (14) | |||

$=\underset{\theta}{\mathrm{min}}\underset{\text{=constant}}{\underset{\u23df}{{\mathbb{E}}_{{p}_{\star}(x)}[\mathrm{log}{p}_{\star}(x)]}}-{\mathbb{E}}_{{p}_{\star}(x)}[\mathrm{log}{p}_{X|\theta}(x)]$ | (15) | |||

$=\underset{\theta}{\mathrm{max}}{\mathbb{E}}_{{p}_{\star}(x)}\left[\mathrm{log}{p}_{Y}({f}_{\theta}(x))+\mathrm{log}\text{abs}det{\mathbf{J}}_{{f}_{\theta}(x)}\right].$ | (16) |

where ${p}_{X|\theta}(x)={p}_{Y}(y)\text{abs}det{\mathbf{J}}_{{f}_{\theta}(x)}$ and $y={f}_{\theta}(t)$. Notice that minimizing the KL is equivalent of doing maximum likelihood estimation (MLE).

## Appendix B OBJECTIVE FOR DENSITY MATCHING

We can learn how to sample from a complex target distribution ${p}_{\star}$ (or, more generally, an energy function) for which we have access to its analytical form but we do not have an available sampling procedure. Using Equation 1, we can use a normalizing flow to transform samples from a simple distribution ${p}_{X}$, which we can easily evaluate and sample from, to a complex one (the target). In this case, we estimate $\theta $ by minimizing $\mathbb{K}\mathbb{L}({p}_{Y|\theta}\parallel {p}_{\star})$:

${\theta}^{*}$ | $=\underset{\theta}{\mathrm{min}}\mathbb{K}\mathbb{L}({p}_{Y|\theta}\parallel {p}_{\star})$ | (17) | ||

$=\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{{p}_{Y|\theta}(y)}\left[\mathrm{log}{\displaystyle \frac{{p}_{Y|\theta}(y)}{{p}_{\star}(y)}}\right]$ | (18) | |||

$=\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{{p}_{Y|\theta}(y)}[\mathrm{log}{p}_{Y|\theta}(y)-\mathrm{log}{p}_{\star}(y)]$ | (19) | |||

$=\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{{p}_{X}(x)}[\mathrm{log}{p}_{X}(x)-\mathrm{log}\text{abs}det{\mathbf{J}}_{{f}_{\theta}(x)}-\mathrm{log}{p}_{\star}({f}_{\theta}(x))].$ | (20) |

where ${p}_{Y|\theta}(y)={p}_{X}(x)\text{abs}det\mathbf{J}_{{f}_{\theta}(x)}{}^{-1}$ and $y={f}_{\theta}(x)$. Notice that in general, with normalizing flows, it is possible to learn a flexible distribution from which we can sample and evaluate the density of its samples. These two proprieties are particularly useful in the context of variational inference (Rezende and Mohamed, 2015).

## Appendix C WEIGHT INITIALIZATION AND NORMALIZATION

Since the weight matrix $W$ has some strictly positive and some zero entries, we need to take care of a proper initialization. Indeed, it is well known that principled parameters initialization benefits not only training but also the generalization of neural networks (Glorot and Bengio, 2010). For instance, Xavier initialization is commonly used and it takes into account the size of the input and output spaces in the affine transformations. However, since we have some zero entries, we cannot benefit from it. We choose instead to initialize all blocks with a simple distribution and to apply weight normalization (Salimans and Kingma, 2016) to better regulate the effect of such initialization. Weight normalization decomposes each row $w\in {\mathbb{R}}^{b\cdot d}$ of $W$ in terms of the new parameters using $w=\mathrm{exp}(s)\cdot v/\parallel v\parallel $ where $v$ has the same dimensionality of $w$ and $s$ is a scalar. We initialize $v$ with a simple Normal distribution of zero mean and unit variance and $s=\mathrm{log}(u)$ with $u\sim \mathcal{U}(0,1)$. Such reparametrization disentangles the direction and magnitude of $w$ and it is known to improve and speed up optimization.