Abstract
In this paper, we study deep diagonal circulant neural networks, that is deepneural networks in which all weight matrices are the product of diagonal andcirculant ones. We show that these networks outperform the recently introduceddeep networks with other types of structured layers. Besides introducingprincipled techniques for training these models, we provide theoreticalguarantees regarding their expressivity. Indeed, we prove that the functionspace spanned by diagonal circulant networks of bounded depth includes the onespanned by dense networks with specific properties on their rank. We conduct athorough experimental study to compare the performance of deep diagonalcirculant networks with state of the art models based on structured matricesand with dense models. We show that our models achieve better accuracy thantheir structured alternatives while required 2x fewer weights as the next bestapproach. Finally we train deep diagonal circulant networks to build a compactand accurate models on a real world video classification dataset with over 3.8million training examples.
Quick Read (beta)
On the Expressive Power of Deep Diagonal Circulant Neural Networks
Abstract
In this paper, we study deep diagonal circulant neural networks, that is deep neural networks in which all weight matrices are the product of diagonal and circulant ones. We show that these networks outperform the recently introduced deep networks with other types of structured layers. Besides introducing principled techniques for training these models, we provide theoretical guarantees regarding their expressivity. Indeed, we prove that the function space spanned by diagonal circulant networks of bounded depth includes the one spanned by dense networks with specific properties on their rank. We conduct a thorough experimental study to compare the performance of deep diagonal circulant networks with state of the art models based on structured matrices and with dense models. We show that our models achieve better accuracy than their structured alternatives while required 2x fewer weights as the next best approach. Finally we train deep diagonal circulant networks to build a compact and accurate models on a real world video classification dataset with over 3.8 million training examples.
positioning \usetikzlibraryarrows.meta \pgfplotssetcompat=1.14
On the Expressive Power of Deep Diagonal Circulant Neural Networks
Alexandre Araujo${}^{\mathrm{1}\mathrm{,}\mathrm{2}}$ Benjamin Negrevergne${}^{\mathrm{2}}$ Yann Chevaleyre${}^{\mathrm{2}}$ Jamal Atif${}^{\mathrm{2}}$ ${}^{1}$Wavestone, Paris, France ${}^{2}$Université ParisDauphine, PSL Research University, CNRS, LAMSADE, Paris, France
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Deep neural networks with structured matrices (e.g. low rank matrices, Toeplitz matrices, LDR, etc.) have been been the focus of intensive research in the last years, in particular for building compact models with reduced memory footprints. Despite a number of successful results, it remains unclear whether structured layers alone are expressive enough to replace all dense layers in deep feedforward neural networks. In particular the following questions remain open: “What is the expressive power of compact representations compared to the original ones?”, “What functions can or cannot be approximated by neural networks when all dense matrices are replaced by a small set of structured matrices?” and “How can we efficiently train deep neural networks with a large number of structured layers?”.
In this paper, we provide a principled answer to the questions mentioned above on the particular case of deep neural networks based on diagonal and circulant matrices (a.k.a. Diagonalcirculant networks or DCNNs). The idea of using diagonal and circulant matrices together comes from a series of results in linear algebra by MüllerQuade et al. (1998) and Huhtanen & Perämäki (2015). The most recent result from Huhtanen & Perämäki demonstrates that any matrix $A$ in ${\u2102}^{n\times n}$ can be decomposed into the product of $2n1$ alternating diagonal and circulant matrices. This result was then used by Moczulski et al. (2015) to introduce the AFDF structured layer, which is the building block of DCNNs.
We follow the same line of research as Moczulski et al. (2015) and bring new contributions. Remarkably, we were able to train deep DCNNs (up to 40 layers) and apply this framework in the context of a large scale realworld scneraio (the YouTube8M video classification dataset). In addition, we endow these networks with strong theoretical guarantees by showing that DCNNs with ReLU activations and with bounded width and small depth can approximate any dense neural network with arbitrary precision. More precisely, we make the following contributions:

1.
We enhance the results by Huhtanen & Perämäki (2015) by introducing a new bound on the number diagonalcirculant required to approximate a matrix that depends on its rank.

2.
We introduce several results regarding DCNNs with ReLU activations. In particular, we are able to demonstrate that a DCNN with bounded width and small depth can approximate any dense networks with ReLU activations.

3.
We describe a theoretically sound initialization procedure for DCNN. These results simplify the training and removes a large number of complex hyperparameters which made DCNNs difficult to train in practice up to now.

4.
We provide a number of empirical insights to explain the behaviour of DCNNs. More specifically, we show the impact of the number of the nonlinearities in the network on the convergence rate and the accuracy of the network.

5.
By combining all these insights, we are able to train large and deep DCNNs. We demonstrate the good performance of DCNNs on a large scale application (the YouTube8M video classification problem) and obtain very competitive accuracy .
2 Related Work
Structured matrices exhibit a number of good properties which have been exploited by deep learning practitioners, mainly to compress large neural networks architectures into smaller ones. For example Hinrichs & Vybíral (2011) have demonstrated that a single circulant matrix can be used to approximate the JohsonLindenstrauss transform, often used in machine learning to perform dimensionality reduction. Building upon this result, Cheng et al. (2015) proposed to replace the weight matrix of a fully connected layer by a circulant matrix effectively replacing the complex transform modeled by the fully connected layer by a simple dimensionality reduction. Despite the reduction of expressivity, the resulting network demonstrated good accuracy using only a fraction of its original size (90% reduction).
ACDC. Moczulski et al. (2015) have introduced two Structured Efficient Linear Layers (SELL) called AFDF and ACDC. The AFDF structured layer benefits from the theoretical results introduced by Huhtanen & Perämäki and can be seen the building block of DCNNs. However, Moczulski et al. (2015) only experiment using ACDC, a different type of layer that does not involve circulant matrices. As far as we can tell, the theoretical guarantees available for the AFDF layer do not apply on the ACDC layer since the cosine transform does not diagonalize circulant matrices (Sanchez et al., 1995). Another possible limit of the ACDC paper is that they only train large neural networks involving ACDC layers combined with many other expressive layers. Although the resulting network demonstrates good accuracy, it is difficult the characterize the true contribution of the ACDC layers in this setting.
Low displacement rank structures. More recently, Thomas et al. (2018) have generalized these works by proposing neural networks with lowdisplacement rank matrices (LDR), that are structured matrices encompassing a large family of structured matrices, including Toeplitzlike, Vandermondelike, Cauchylike and more notably DCNNs. To obtain this result, LDR represents a structured matrix using two displacement operators and a lowrank residual. Despite being elegant and general, we found that the LDR framework suffers from several limits which are inherent to its generality, and makes it difficult to use in the context of large and deep neural networks. First, the training procedure for learning LDR matrices is highly involved and implies many complex mathematical objects such as Krylov matrices. Then, as acknowledged by the authors, the number of parameters required to represent a given structured matrix (e.g. a Toeplitz matrix) in practice is unnecessarily high (higher than required in theory).
Other compression techniques. Besides structured matrices, a variety of techniques have been proposed to build more compact deep learning models. These include model distillation (Hinton et al., 2015), Tensor Train (Novikov et al., 2015), Lowrank decomposition (Denil et al., 2013), to mention a few. However, Circulant networks show good performances in several contexts (the interested reader can refer to the results reported by Moczulski et al. (2015) and Thomas et al. (2018)).
3 A primer on circulant matrices and a new result
An nbyn circulant matrix $C$ is a matrix where each row is a cyclic right shift of the previous one as illustrated below.
$$C=circ(c)=\left[\begin{array}{ccccc}\hfill {c}_{0}\hfill & \hfill {c}_{n1}\hfill & \hfill {c}_{n2}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {c}_{1}\hfill \\ \hfill {c}_{1}\hfill & \hfill {c}_{0}\hfill & \hfill {c}_{n1}\hfill & & \hfill {c}_{2}\hfill \\ \hfill {c}_{2}\hfill & \hfill {c}_{1}\hfill & \hfill {c}_{0}\hfill & & \hfill {c}_{3}\hfill \\ \hfill \mathrm{\vdots}\hfill & & & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill {c}_{n1}\hfill & \hfill {c}_{n2}\hfill & \hfill {c}_{n3}\hfill & & \hfill {c}_{0}\hfill \end{array}\right]$$ 
Circulant matrices exhibit several interesting properties from the perspective of numerical computations. Most importantly, any $n$by$n$ circulant matrix $C$ can be represented using only $n$ coefficients instead of the ${n}^{2}$ coefficients required to represent classical unstructured matrices. In addition, the matrixvector product is simplified from $O({n}^{2})$ to $O(nlog(n))$ using the convolution theorem.
As we will show in this paper, circulant matrices also have a strong expressive power. So far, we know that a single circulant matrix can be used to represent a variety of important linear transforms such as random projections (Hinrichs & Vybíral, 2011). When they are combined with diagonal matrices, they can also be used as building blocks to represent any linear transform (Schmid et al., 2000; Huhtanen & Perämäki, 2015) with an arbitrary precision. Huhtanen & Perämäki were able to bound the number of factors that is required to approximate any matrix $A$ with arbitrary precision. We recall this result in Theorem 1 as it is the starting point of our theoretical analysis (note that in the rest of the paper, $\parallel \cdot \parallel $ denotes to the ${\mathrm{\ell}}_{2}$ norm when applied to vectors, and to the operator norm when applied to matrices).
Theorem 1.
(Reformulation Huhtanen & Perämäki (2015)) For every matrix $A\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$, for any $\u03f5\mathrm{>}\mathrm{0}$ , there exists a sequence of matrices ${B}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{B}_{\mathrm{2}\mathit{}n\mathrm{}\mathrm{1}}$ where ${B}_{i}$ is a circulant matrix if $i$ is odd, and a diagonal matrix otherwise, such that $$.
Unfortunately, this theorem is of little use to understand the expressive power of diagonalcirculant matrices when they are used in deep neural networks. This is because: 1) the bound does not depend on the matrix $A$, 2) the theorem does not provide any insights regarding the expressive power of $m$ diagonalcirculant factors when $m$ is much lower than $2n1$ as it is the case in most practical scenarios we consider in this paper.
In the following theorem, we enhance the result by Huhtanen & Perämäki by expressing the number of factors required to approximate $A$, as a function of the rank of $A$. This is useful when one deals with lowrank matrices, which is common in machine learning problems.
Theorem 2.
(Rankbased circulant decomposition) Let $A\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$ be a matrix of rank at most $k$. Assume that $n$ can be divided by $k$. For any $\u03f5\mathrm{>}\mathrm{0}$, there exists a sequence of $\mathrm{4}\mathit{}k\mathrm{+}\mathrm{1}$ matrices ${B}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{B}_{\mathrm{4}\mathit{}k\mathrm{+}\mathrm{1}}\mathrm{,}$ where ${B}_{i}$ is a circulant matrix if $i$ is odd, and a diagonal matrix otherwise, such that $$
A direct consequence of Theorem 2, is that if the number of diagonalcirculant factors is set to a value $K$, we can only approximate a matrix $A$ whose rank is at most $\frac{K1}{4}$. As we will show, this result will be useful to analyze neural networks based on diagonal and circulant matrices.
4 Theoretical analysis of Diagonal Circulant Neural Networks (DCNNs)
4.1 The expressive power of DCNNs with bounded width and small depth
Zhao et al. (2017) have shown that circulant networks with 2 layers and unbounded width are universal approximators. However, results on unbounded networks offer weak guarantees and two important questions have remained open until now: 1) Can we approximate any function with a boundedwidth circulant networks? 2) What function can we approximate with a circulant network that has a bounded width and a small depth? We answer these two questions in this section.
First, we introduce some necessary definitions regarding neural networks and we provide a theoretical analysis of their approximation capabilities.
Definition 1 (Deep ReLU network).
Given $L$ weight matrices $W\mathrm{=}\mathrm{(}{W}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{W}_{L}\mathrm{)}$ with ${W}_{i}\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$ and $L$ bias vectors $b\mathrm{=}\mathrm{(}{b}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{b}_{L}\mathrm{)}$ with ${b}_{i}\mathrm{\in}{\mathrm{C}}^{n}$, a deep ReLU network is a function ${f}_{{W}_{L}\mathrm{,}{b}_{L}}\mathrm{:}{\mathrm{C}}^{n}\mathrm{\to}{\mathrm{C}}^{n}$ such that ${f}_{W\mathrm{,}b}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}\mathrm{(}{f}_{{W}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{W}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}\mathrm{)}\mathit{}\mathrm{(}x\mathrm{)}$ where ${f}_{{W}_{i}\mathrm{,}{b}_{i}}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}\varphi \mathit{}\mathrm{(}{W}_{i}\mathit{}x\mathrm{+}{b}_{i}\mathrm{)}$ and $\varphi \mathrm{(}\mathrm{.}\mathrm{)}$ is a ReLU nonlinearity ^{1}^{1} 1 Because our networks deal with complex numbers, we use an extension of the ReLU function to the complex domain. The most straightforward extension defined in (Trabelsi et al., 2018) is as follows: $\mathrm{ReLU}\mathit{}\mathrm{(}z\mathrm{)}\mathrm{=}\mathrm{ReLU}\mathit{}\mathrm{\left(}\mathrm{R}\mathit{}\mathrm{(}z\mathrm{)}\mathrm{\right)}\mathrm{+}i\mathit{}\mathrm{ReLU}\mathit{}\mathrm{\left(}\mathrm{I}\mathit{}\mathrm{(}z\mathrm{)}\mathrm{\right)}$, where $\mathrm{R}$ and $\mathrm{I}$ refer to the real and imaginary parts of $z$. In the rest of this paper, we call $L$ and $n$ respectively the depth and the width of the network. Moreover, we call total rank $k$, the sum of the ranks of the matrices ${W}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{W}_{L}$. i.e. $k\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{L}r\mathit{}a\mathit{}n\mathit{}k\mathit{}\mathrm{(}{W}_{i}\mathrm{)}$.
We also need to introduce DCNNs, similarly to Moczulski et al. (2015).
Definition 2 (Diagonal Circulant Neural Networks).
Given $L$ diagonal matrices $D\mathrm{=}\mathrm{(}{D}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{D}_{L}\mathrm{)}$ with ${D}_{i}\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$, $L$ circulant matrices $C\mathrm{=}\mathrm{(}{C}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{C}_{L}\mathrm{)}$ with ${C}_{i}\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$ and $L$ bias vectors $b\mathrm{=}\mathrm{(}{b}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{b}_{L}\mathrm{)}$ with ${b}_{i}\mathrm{\in}{\mathrm{C}}^{n}$, a Diagonal Circulant Neural Networks (DCNN) is a function ${f}_{{W}_{L}\mathrm{,}{b}_{L}}\mathrm{:}{\mathrm{C}}^{n}\mathrm{\to}{\mathrm{C}}^{n}$ such that ${f}_{D\mathrm{,}C\mathrm{,}b}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}\mathrm{(}{f}_{{D}_{L}\mathrm{,}{C}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{D}_{\mathrm{1}}\mathrm{,}{C}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}\mathrm{)}\mathit{}\mathrm{(}x\mathrm{)}$ where ${f}_{{D}_{i}\mathrm{,}{C}_{i}\mathrm{,}{b}_{i}}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}{\varphi}_{i}\mathit{}\mathrm{(}{D}_{i}\mathit{}{C}_{i}\mathit{}x\mathrm{+}{b}_{i}\mathrm{)}$ and where ${\varphi}_{i}\mathrm{(}\mathrm{.}\mathrm{)}$ is a ReLU nonlinearity or the identity function.
We can now show that boundedwidth DCNNs can approximate any Deep ReLU Network, and as a corollary, that they are universal approximators.
Lemma 1.
Let $\mathrm{N}$ be a deep ReLU network of width $n$ and depth $L$, and let $\mathrm{X}\mathrm{\subset}{\mathrm{C}}^{n}$ be a bounded set. For any $\u03f5\mathrm{>}\mathrm{0}$, there exists a DCNN ${\mathrm{N}}^{\mathrm{\prime}}$ of width $n$ and of depth $\mathrm{(}\mathrm{2}\mathit{}n\mathrm{}\mathrm{1}\mathrm{)}\mathit{}L$ such that $$ for all $x\mathrm{\in}\mathrm{X}$.
The proof is in the supplemental material. We can now state the universal approximation corollary:
Corollary 1.
Bounded width DCNNs are universal approximators in the following sense: for any continuous function $f\mathrm{:}{\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}}^{n}\mathrm{\to}{\mathrm{R}}_{\mathrm{+}}$ of bounded supremum norm, for any $\u03f5\mathrm{>}\mathrm{0}$, there exists a DCNN ${\mathrm{N}}_{\u03f5}$ of width $n\mathrm{+}\mathrm{3}$ such that $\mathrm{\forall}x\mathrm{\in}{\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}}^{n\mathrm{+}\mathrm{3}}$, $$, where ${\mathrm{(}\mathrm{\cdot}\mathrm{)}}_{i}$ represents the ${i}^{t\mathit{}h}$ component of a vector.
This is a first result, however $(2n+5)L$ is not a small depth (in our experiments, $n$ can be over 300 000), and a number of work provided empirical evidences that DCNN with small depth can offer good performances (e.g. Araujo et al. (2018); Cheng et al. (2015)). To improve our result, we introduce our main theorem which studies the approximation properties of these small depth networks.
Theorem 3.
(Main result: Rankbased expressive power of DCNNs) Let $\mathrm{N}$ be a deep ReLU network of width $n$, depth $L$ and a total rank $k$ and assume $n$ is a power of $\mathrm{2}$. Let $\mathrm{X}\mathrm{\subset}{\mathrm{C}}^{n}$ be a bounded set. Then, for any $\u03f5\mathrm{>}\mathrm{0}$, there exists a DCNN with ReLU activation ${\mathrm{N}}^{\mathrm{\prime}}$ of width $n$ such that $$ for all $x\mathrm{\in}\mathrm{X}$ and the depth of ${\mathrm{N}}^{\mathrm{\prime}}$ is bounded by $\mathrm{9}\mathit{}k$.
Remark that in the theorem, we require that $n$ is a power of $2$. We conjecture that the result still holds even without this condition.
This result refines Lemma 1, and answer our second question: a DCNN of bounded width and small depth can approximate a Deep ReLU network of low total rank. Note that the converse is not true: because $n$by$n$ circulant matrix can be of rank $n$, approximating a DCNN of depth $1$ can require a deep ReLU network of total rank equals to $n$.
For the sake of clarity, we highlight the significance of these results in the following property:
Property 1.
Let ${\mathrm{R}}_{k\mathrm{,}n}$ be the set of all functions $f\mathrm{:}{\mathrm{R}}^{n}\mathrm{\to}{\mathrm{R}}^{n}$ representable by deep ReLU networks of total rank at most $k$ and let ${\mathrm{C}}_{l\mathrm{,}n}$ the set of all functions $f\mathrm{:}{\mathrm{R}}^{n}\mathrm{\to}{\mathrm{R}}^{n}$ representable by deep diagonalcirculant networks of depth at most $l$, then:
$\forall k,\exists l,\forall n$  ${\mathcal{R}}_{k,n}\u228a{\mathcal{C}}_{l,n}$  
$\forall l,\mathrm{\nexists}k,\forall n$  ${\mathcal{C}}_{l,n}\subseteq {\mathcal{R}}_{k,n}$ 
We illustrate the meaning of this Property 1 using Figure 4.1. As we can see, the set ${\mathcal{R}}_{k,n}$ of all the functions representable by a deep ReLU network of total rank $k$ is strictly included in the set ${\mathcal{C}}_{9k}$ of all circulant networks of depth $9k$ (as by Theorem 3).
Finally, what if we choose to use small depth networks to approximate deep ReLU networks where matrices are not of low rank? To answer this question, we first need to show the negative impact of replacing matrices by their low rank approximators in neural networks:
Proposition 4.
Let $\mathrm{N}\mathrm{=}{f}_{{A}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{A}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}$ be a Deep ReLU network, where ${A}_{i}\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}\mathrm{,}{b}_{i}\mathrm{\in}{\mathrm{C}}^{n}$ for all $i\mathrm{\in}\mathrm{[}L\mathrm{]}$. Let ${\stackrel{\mathrm{~}}{A}}_{i}$ be the matrix obtained by an SVD approximation of rank $k$ of matrix ${A}_{i}$. Let ${\sigma}_{i\mathrm{,}j}$ be the ${j}^{t\mathit{}h}$ singular value of ${A}_{i}$. Define $\stackrel{\mathrm{~}}{\mathrm{N}}\mathrm{=}{f}_{\stackrel{\mathrm{~}}{{A}_{L}}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{\stackrel{\mathrm{~}}{{A}_{\mathrm{1}}}\mathrm{,}{b}_{\mathrm{1}}}$. Then, for any $x\mathrm{\in}{\mathrm{C}}^{n}$, we have:
$$\parallel \mathcal{N}\left(x\right)\stackrel{~}{\mathcal{N}}\left(x\right)\parallel \le \frac{\left({\sigma}_{max,1}^{L}1\right)R{\sigma}_{max,k}}{{\sigma}_{max,1}1}$$ 
where $R$ is an upper bound on norm of the output of any layer in $\mathrm{N}$, and ${\sigma}_{m\mathit{}a\mathit{}x\mathrm{,}j}\mathrm{=}{\mathrm{max}}_{i}\mathit{}{\sigma}_{i\mathrm{,}j}$.
Proposition 4 shows that we can approximate matrices in a neural network by low rank matrices, and control the approximation error. In general, the term ${\sigma}_{max,1}^{L}$ could seem large, but in practice, it is likely that most singular values in deep neural network are small in order to avoid divergent behaviors. We can now prove the result on DCNNs:
Corollary 2.
Consider any deep ReLU network $\mathrm{N}\mathrm{=}{f}_{{A}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{A}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}$ of depth $L$ and width $n$. Let ${\sigma}_{m\mathit{}a\mathit{}x\mathrm{,}j}\mathrm{=}{\mathrm{max}}_{i}\mathit{}{\sigma}_{i\mathrm{,}j}$ where ${\sigma}_{i\mathrm{,}j}$ is the ${j}^{t\mathit{}h}$ singular value of ${A}_{i}$. Let $\mathrm{X}\mathrm{\subset}{\mathrm{R}}^{n}$ be a bounded set. Let $k$ be an integer dividing $n$. There exists a DCNN ${\mathrm{N}}^{\mathrm{\prime}}\mathrm{=}{f}_{{D}_{m}\mathit{}{C}_{m}\mathrm{,}{b}_{m}^{\mathrm{\prime}}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{D}_{\mathrm{1}}\mathit{}{C}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}^{\mathrm{\prime}}}$ of width $n$ and of depth $m\mathrm{=}L\mathit{}\mathrm{(}\mathrm{4}\mathit{}k\mathrm{+}\mathrm{1}\mathrm{)}$, such that for any $x\mathrm{\in}\mathrm{X}$:
$$ 
where $R$ is an upper bound on the norm of the outputs of each layer in $\mathrm{N}$.
4.2 Initialization of deep circulant layers
Training DCNNs has revealed to be a challenging problem. To facilitate training and convergence, we propose the following initialization procedure which is a variant of Xavier initialization. First, for each circulant matrix $C=circ({c}_{1}\mathrm{\dots}{c}_{n})$, each ${c}_{i}$ is randomly drawn from $\mathcal{N}(0,{\sigma}^{2})$, with $\sigma =\sqrt{\frac{2}{n}}$. Next, for each diagonal matrix $D=diag({d}_{1}\mathrm{\dots}{d}_{n})$, each ${d}_{i}$ is drawn randomly and uniformly from $\{1,1\}$ for all $i$. Finally, all biases in the network are randomly drawn from $\mathcal{N}(0,{\sigma}^{\prime 2})$, for some small value of ${\sigma}^{\prime}$.
The following proposition states that the covariance matrix at the output of any layer in a DCNN is constant.
Proposition 5.
Let $\mathrm{N}$ be a DCNN of depth $L$ initialized according to our procedure, with ${\sigma}^{\mathrm{\prime}}\mathrm{=}\mathrm{0}$. Assume that all layers $\mathrm{1}$ to $L\mathrm{}\mathrm{1}$ have ReLU activation functions, and that the last layer has the identity activation function. Then, for any $x\mathrm{\in}{\mathrm{R}}^{n}$, the covariance matrix of $\mathrm{N}\mathit{}\mathrm{(}x\mathrm{)}$ is $\frac{\mathrm{2}\mathrm{.}I\mathit{}d}{n}\mathit{}{\mathrm{\parallel}x\mathrm{\parallel}}_{\mathrm{2}}^{\mathrm{2}}$. Moreover, note that this covariance does not depend on the depth of the network.
With this initialization, the covariance matrix at the output of any layer does not depend on the depth of the network. This ensures that the signal is propagated accross the network without vanishing nor exploding.
5 Empirical evaluation
This experimental section aims at answering the following questions: (Q1) How to train DCNNs and tune their hyper parameters? (Q2) How do DCNNs compare to other approaches such as ACDC, LDR or other structured approaches? (Q3) How do DCNNs performs in the context of large scale realworld machine learning applications? All the details of our experiments are provided as supplemental material. We conducted experiments using architectures with real and complex weights. Since our results were consistently better using real valued matrices, we report our results on real values in this section. We provide some results with complex valued matrices in the supplemental material.
5.1 How to train very deep DCNNs (Q1)
We empirically found that reducing the number of nonlinearities in the networks simplifies the training of deep neural networks. To support this claim, we conduct a series of experiments on various DCNNs with a varying number of ReLU activations (to reduce the number of nonlinearities, we replace some ReLU activations with the identity function). In a second experiment, we replace the ReLU activations with LeakyReLU activations and vary the slope of the Leaky ReLU (a higher slope means an activation function that is closer to a linear function). The results of this experiment are presented in Figure 5.1 and 5.1. In 5.1, “ReLU(DC)” means that we interleave on ReLU activation functions between every diagonalcirculant matrix, whereas ReLU(DCDC) means we interleave a ReLU activation every other block etc. In both 5.1, Figure 5.1, we observe that reducing the nonlinearity of the networks can be used to train deeper networks. This is an interesting result, since we can use this technique to adjust the number of parameters in the network, without facing training difficulties. We obtain a maximum accuracy of 0.56 with one ReLU every three layers and leakyReLUs with a slope of 0.5. We hence rely on this setting in the rest of this experimental section.
5.2 Comparison with other structured approaches (Q2)
Comparison with Dense networks, Toeplitz networks and Low Rank networks. We now compare DCNNs with other stateoftheart structured networks by measuring the accuracy on CIFAR10. Our baseline is a dense feedforward network with a fixed number of weights (9 million weights). We compare with DCNNs and with DTNNs (see below), Toeplitz networks, and LowRank networks (Yu et al., 2017). We first consider Toeplitz networks which are stacked Toeplitz matrices interleaved with ReLU activations since Toeplitz matrices are closely related to circulant matrices. Since Toeplitz networks have a different structure (they do not include diagonal matrices), we also experiment using DTNNs, a variant of DCNNs where all the circulant matrices have been replaced by Toeplitz matrices. Finally we conduct experiments using networks based on lowrank matrices as they are also closely related to our work. For each approach, we report the accuracy of several networks with a varying depth ranging from 1 to 40 (DCNNs, Toeplitz networks) and from 1 to 30 (from DTNNs). For lowrank networks, we used a fixed depth network and increased the rank of each matrix from 7 to 40. We also tried to increase the depth of low rank matrices, but we found that deep lowrank networks are difficult to train so we do not report the results here. We compare all the networks based on the number of weights from 21K (0.2% of the dense network) to 370K weights (4% of the dense network) and we report the results in Figure 5.1. First we can see that the size of the networks correlates positively with their accuracy which demonstrate successful training in all cases. We can also see that the DCNNs achieves the maximum accuracy of 56% with 20 layers ($\sim $ 200K weights) which as as good as the dense networks with only 2% of the number of weights. Other approaches also offer good performance but they are not able to reach the accuracy of a dense network.
Comparison with ACDC (Moczulski et al., 2015). In Section 2, we have discussed the differences between the ACDC framework and our approach from a theoretical perspective. In this section, we conduct experiments to compare the performance of DCNNs with neural networks based on ACDC layers. We first reproduce the experimental setting from Moczulski et al. (2015), and compare both approaches using only linear networks (i.e. networks without any ReLU activations). The results are presented in Figure 5.2. On this simple setting, both architectures demonstrate good performance, however, DCNNs offer better convergence rate. In Figure 5.2, we compare neural networks with ReLU activations on CIFAR10. We found that networks which are based only on ACDC layers are difficult to train and offer poor accuracy on CIFAR. (We have tried different initialization schemes including the one from the original paper, and the one we propose in this paper.) In contrast, deep DCNNs can be trained and offer good performance without additional dense layers (these results are in line with our experiments on the YouTube8M dataset). We can conclude that DCNNs are able to model complex relations at a low cost.
Method  #Parameters  Accuracy 

Dense  9 470 986  0.562 
DCNN $\mathrm{(}\mathrm{5}\mathit{}\mathrm{l}\mathit{}\mathrm{a}\mathit{}\mathrm{y}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{s}\mathrm{)}$  49 172  0.543 
DCNN $\mathrm{(}\mathrm{2}\mathit{}\mathrm{l}\mathit{}\mathrm{a}\mathit{}\mathrm{y}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{s}\mathrm{)}$  21 524  0.536 
LDRTD $(r=2)$  64 522  0.511 
LDR – Toeplitzlike $(r=3)$  52 234  0.496 
LDR – Hankellike $(r=3)$  52 234  0.483 
Comparison with LDR networks (Thomas et al., 2018). We now compare DCNNs with the LDR framework using the network configuration experimented in the original paper: a single LDR structured layer followed by a dense layer. In the LDR framework, we can change the size of a network by adjusting the rank of the residual matrix, effectively capturing matrices with a structure that is close to a known structure but not exactly (e.g. in the LDR framework, Toeplitz matrices can be encoded with a residual matrice with rank=2, so a matrix that can be encoded with a residual of rank=3 can be seen as Toeplitzlike.). The results are presented in Table 1 and demonstrate that DCNNs outperforms all LDR networks both in terms in size and accuracy.
5.3 DCNNs for largescale video classification on the YouTube8M dataset (Q3)
To understand the performance of deep DCNNs on large scale applications, we conducted experiments on the YouTube8M video classification with 3.8 training examples introduced by (AbuElHaija et al., 2016b). Notice that we favour this experiment over ImageNet applications because modern image classification architectures involve a large number of convolutional layers, and compressing convolutional layers is out of our scope. Also, as mentioned earlier, testing the performance of DCNN architectures mixed with a large number of expressive layers makes little sense.
The YouTube8M includes two datasets describing 8 million labeled videos. Both datasets contain audio and video features for each video. In the first dataset (aggregated) all audio and video features have been aggregated every 300 frames. The second dataset (full) contains the descriptors for all the frames. To compare the models we use the GAP metric (Global Average Precision) proposed by (AbuElHaija et al., 2016b). On the simpler (aggregated) dataset we compared offtheshelf DCNNs with a dense baseline with 5.7M weights. On the full dataset, we designed three new compact architectures based on the stateoftheart architecture introduced by (AbuElHaija et al., 2016b). The original architecture is based on 3 blocks, a Deep BagofFrame embedding block (DBoF), a dimensionalilty reduction block (FC), and a Mixture of Experts block (MoE). We create three compact architectures by replacing one block by a more compact DCNN block in each architecture. Finally, we compared the performance of each compact architecture to the original one.
Table 2 shows the results of our experiments on the YouTube8M dataset in terms of number of weights, compression rate and GAP. The compact fully connected layer achieves a compression rate of 9.5 while having a close performance. On the full dataset, replacing the DBoF block reduces the size of the network without impacting the accuracy. We obtain the best compression ratio by replacing the MoE block with DCNNs (26%) of the size of the original dataset with a GAP score of 0.805 (95% of the score obtained with the original architecture). We conclude that DCNN are both theoretically sound and of practical interest in real, large scale applications.
Dataset  Architecture  #Weights  Compress.  [email protected] 
(% of baseline size)  
Aggregated  Dense  5.7M    0.773 
4 DC layers  25 410  0.44  0.599  
32 DC layers  122 178  2.11  0.685  
4 DC layers + 1 dense  4.46M  77  0.747  
Full  Original architecture  45M    0.846 
DBoF block replaced with DCNN  36M  80  0.838  
FC block replaced with DCNN  41M  91  0.845  
MoE block replaced with DCNN  12M  26  0.805 
6 Conclusion
This paper deals with the training of diagonal circulant neural networks. To the best of our knowledge, training such networks with a large number of layers had not been done before. We also endowed this kind of models with theoretical guarantees, hence enriching and refining previous theoretical work from the literature. More importantly, we showed that DCNNs outperform their competing structured alternatives, including the very recent general approach based on LDR networks. Our results suggest that stacking diagonal circulant layers with non linearities improves the convergence rate and the final accuracy of the network. Formally proving these statements constitutes the future directions of this work. As future work, we would like to generalize the good results of DCNNs to convolutions neural networks. We also believe that circulant matrices deserve a particular attention in deep learning because of their strong ties with convolutions: a circulant matrix operator is equivalent to the convolution operator with circular paddings (as shown in [5]). This fact makes any contribution to the area of circulant matrices particularly relevant to the field of deep learning with impacts beyond the problem of designing compact models. As future work, we would like to generalize our results to deep convolutional neural networks.
References
 Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 AbuElHaija et al. (2016a) AbuElHaija, S., Kothari, N., Lee, J., Natsev, A. P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. Youtube8m: A largescale video classification benchmark. In arXiv:1609.08675, 2016a.
 AbuElHaija et al. (2016b) AbuElHaija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. Youtube8m: A largescale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016b.
 Araujo et al. (2018) Araujo, A., Negrevergne, B., Chevaleyre, Y., and Atif, J. Training compact deep learning models for video classification using circulant matrices. In The 2nd Workshop on YouTube8M LargeScale Video Understanding at ECCV 2018, 2018.
 Chen et al. (2015) Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Chen, Y. Compressing neural networks with the hashing trick. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 2285–2294. JMLR.org, 2015.
 Cheng et al. (2015) Cheng, Y., Yu, F. X., Feris, R. S., Kumar, S., Choudhary, A., and Chang, S. F. An exploration of parameter redundancy in deep networks with circulant projections. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2857–2865, Dec 2015.
 Denil et al. (2013) Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., and de Freitas, N. Predicting parameters in deep learning. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 2148–2156. Curran Associates, Inc., 2013.
 Hanin (2017) Hanin, B. Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. ArXiv eprints, August 2017.
 Hinrichs & Vybíral (2011) Hinrichs, A. and Vybíral, J. Johnsonlindenstrauss lemma for circulant matrices. Random Structures & Algorithms, 39(3):391–398, 2011.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
 Huhtanen & Perämäki (2015) Huhtanen, M. and Perämäki, A. Factoring matrices into the product of circulant and diagonal matrices. Journal of Fourier Analysis and Applications, 21(5):1018–1033, Oct 2015. ISSN 15315851. doi: 10.1007/s0004101593950.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
 Miech et al. (2017) Miech, A., Laptev, I., and Sivic, J. Learnable pooling with context gating for video classification. CoRR, abs/1706.06905, 2017.
 Moczulski et al. (2015) Moczulski, M., Denil, M., Appleyard, J., and de Freitas, N. Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946, 2015.
 MüllerQuade et al. (1998) MüllerQuade, J., Aagedal, H., Beth, T., and Schmid, M. Algorithmic design of diffractive optical systems for information processing. Physica D: Nonlinear Phenomena, 120(12):196–205, 1998.
 Novikov et al. (2015) Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 Sanchez et al. (1995) Sanchez, V., Garcia, P., Peinado, A. M., Segura, J. C., and Rubio, A. J. Diagonalizing properties of the discrete cosine transforms. IEEE transactions on Signal Processing, 43(11):2631–2641, 1995.
 Schmid et al. (2000) Schmid, M., Steinwandt, R., MüllerQuade, J., Rötteler, M., and Beth, T. Decomposing a matrix into circulant and diagonal factors. Linear Algebra and its Applications, 306(13):131–143, 2000.
 Thomas et al. (2018) Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018.
 Trabelsi et al. (2018) Trabelsi, C., Bilaniuk, O., Zhang, Y., Serdyuk, D., Subramanian, S., Santos, J. F., Mehri, S., Rostamzadeh, N., Bengio, Y., and Pal, C. J. Deep complex networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018.
 Yang et al. (2015) Yang, Z., Moczulski, M., Denil, M., d. Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried convnets. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1476–1483, Dec 2015. doi: 10.1109/ICCV.2015.173.
 Yu et al. (2017) Yu, X., Liu, T., Wang, X., and Tao, D. On compressing deep models by low rank and sparse decomposition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 67–76, July 2017. doi: 10.1109/CVPR.2017.15.
 Zhao et al. (2017) Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., and Yuan, B. Theoretical properties for neural networks with weight matrices of low displacement rank. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4082–4090, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
0.5cm0.5cm
Supplemental Material
1 Notations
We note $\Re (z)$ and $\Im (z)$ the real and imaginary parts the complex number $z$. We note ${(\cdot )}_{t}$ is the ${t}^{th}$ component of a vector. Let $\mathbf{i}$ be the imaginary number defined by ${\mathbf{i}}^{2}=1$. Define ${\mathrm{\U0001d7cf}}_{n}$ as the nvector of ones. Also, we note $[n]=\{1,\mathrm{\dots},n\}$. The rectified linear unit on the complex domain is defined by $ReLU(z)=\mathrm{max}(0,\Re (z))+\mathbf{i}\mathrm{max}(0,\Im (z))$. The notation $\cdot $ refers to the complex modulus. Finally, define the cyclic shift matrix $S\in {\mathbb{R}}^{n\times n}$ as follows:
$$S=\left[\begin{array}{ccccc}\hfill 0\hfill & & & & \hfill 1\hfill \\ \hfill 1\hfill & \hfill 0\hfill & & & \\ & \hfill 1\hfill & \hfill \mathrm{\ddots}\hfill & & \\ & & \hfill \mathrm{\ddots}\hfill & \hfill 0\hfill & \\ & & & \hfill 1\hfill & \hfill 0\hfill \end{array}\right]$$ 
2 Proofs of Section 3
Theorem 1.
(Reformulation Huhtanen & Perämäki (2015)) For any given matrix $A\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$, for any $\u03f5\mathrm{>}\mathrm{0}$, there exists a sequence of matrices ${B}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{B}_{\mathrm{2}\mathit{}n\mathrm{}\mathrm{1}}$ where ${B}_{i}$ is a circulant matrix if $i$ is odd, and a diagonal matrix otherwise, such that $$. Moreover, if $A$ can be decomposed as $A\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{k}{D}_{i}\mathit{}{S}^{i\mathrm{}\mathrm{1}}$ where $S$ is the cyclicshift matrix and ${D}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{D}_{k}$ are diagonal matrices, then $A$ can be written as a product ${B}_{\mathrm{1}}\mathit{}{B}_{\mathrm{2}}\mathit{}\mathrm{\dots}\mathit{}{B}_{\mathrm{2}\mathit{}k\mathrm{}\mathrm{1}}$ where ${B}_{i}$ is a circulant matrix if $i$ is odd, and a diagonal matrix otherwise.
Theorem 2.
(Rankbased circulant decomposition) Let $A\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$ be a matrix of rank at most $k$. Assume that $n$ can be divided by $k$. For any $\u03f5\mathrm{>}\mathrm{0}$, there exists a sequence of $\mathrm{4}\mathit{}k\mathrm{+}\mathrm{1}$ matrices ${B}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{B}_{\mathrm{4}\mathit{}k\mathrm{+}\mathrm{1}}\mathrm{,}$ where ${B}_{i}$ is a circulant matrix if $i$ is odd, and a diagonal matrix otherwise, such that $$
Proof.
(Theorem 2) Let $U\mathrm{\Sigma}{V}^{T}$ be the SVD decomposition of $M$ where $U,V$ and $\mathrm{\Sigma}$ are $n\times n$ matrices. Because $M$ is of rank $k$, the last $nk$ columns of $U$ and $V$ are null. In the following, we will first decompose $U$ into a product of matrices $WRO$, where $R$ and $O$ are respectively circulant and diagonal matrices, and $W$ is a matrix which will be further decomposed into a product of diagonal and circulant matrices. Then, we will apply the same decomposition technique to $V$. Ultimately, we will get a product of $4k+2$ matrices alternatively diagonal and circulant.
Let $R=circ({r}_{1}\mathrm{\dots}{r}_{n})$. Let $O$ be a $n\times n$ diagonal matrix where ${O}_{i,i}=1$ if $i\le k$ and $0$ otherwise. The $k$ first columns of the product $RO$ will be equal to that of $R$, and the $nk$ last colomns of $RO$ will be zeros. For example, if $k=2$, we have:
$$RO=\left(\begin{array}{ccccc}\hfill {r}_{1}\hfill & \hfill {r}_{n}\hfill & \hfill 0\hfill & \hfill \mathrm{\cdots}\hfill & \hfill 0\hfill \\ \hfill {r}_{2}\hfill & \hfill {r}_{1}\hfill & & & \\ \hfill {r}_{3}\hfill & \hfill {r}_{2}\hfill & \hfill \mathrm{\vdots}\hfill & & \hfill \mathrm{\vdots}\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\vdots}\hfill & & & \\ \hfill {r}_{n}\hfill & \hfill {r}_{n1}\hfill & \hfill 0\hfill & \hfill \mathrm{\cdots}\hfill & \hfill 0\hfill \end{array}\right)$$ 
Let us define $k$ diagonal matrices ${D}_{i}=diag({d}_{i1}\mathrm{\dots}{d}_{in})$ for $i\in [k]$. For now, the values of ${d}_{ij}$ are unknown, but we will show how to compute them. Let $W={\sum}_{i=1}^{k}{D}_{i}{S}^{i1}$. Note that the $nk$ last columns of the product $WRO$ will be zeros. For example, with $k=2$, we have:
$$W=\left[\begin{array}{ccccc}\hfill {d}_{1,1}\hfill & & & & \hfill {d}_{2,1}\hfill \\ \hfill {d}_{2,2}\hfill & \hfill {d}_{1,2}\hfill & & & \\ & \hfill {d}_{2,3}\hfill & \hfill \mathrm{\ddots}\hfill & & \\ & & \hfill \mathrm{\ddots}\hfill & & \\ & & & \hfill {d}_{2,n}\hfill & \hfill {d}_{1,n}\hfill \end{array}\right]$$ 
$$WRO=\left(\begin{array}{ccccc}\hfill {r}_{1}{d}_{11}+{r}_{n}{d}_{21}\hfill & \hfill {r}_{n}{d}_{11}+{r}_{n1}{d}_{21}\hfill & \hfill 0\hfill & \hfill \mathrm{\cdots}\hfill & \hfill 0\hfill \\ \hfill {r}_{2}{d}_{12}+{r}_{1}{d}_{22}\hfill & \hfill {r}_{1}{d}_{12}+{r}_{n}{d}_{22}\hfill & & & \\ & & \hfill \mathrm{\vdots}\hfill & & \hfill \mathrm{\vdots}\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\vdots}\hfill & & & \\ \hfill {r}_{n}{d}_{1n}+{r}_{n1}{d}_{2n}\hfill & \hfill {r}_{n1}{d}_{1n}+{r}_{n2}{d}_{2n}\hfill & \hfill 0\hfill & \hfill \mathrm{\cdots}\hfill & \hfill 0\hfill \end{array}\right)$$ 
We want to find the values of ${d}_{ij}$ such that $WRO=U$. We can formulate this as linear equation system. In case $k=2$, we get:
$$\left(\begin{array}{cccccccc}\hfill {r}_{n}\hfill & \hfill {r}_{1}\hfill & & & & & & \\ \hfill {r}_{n1}\hfill & \hfill {r}_{n}\hfill & & & & & & \\ & & \hfill {r}_{1}\hfill & \hfill {r}_{2}\hfill & & & & \\ & & \hfill {r}_{n}\hfill & \hfill {r}_{1}\hfill & & & & \\ & & & & \hfill {r}_{2}\hfill & \hfill {r}_{3}\hfill & & \\ & & & & \hfill {r}_{1}\hfill & \hfill {r}_{2}\hfill & & \\ & & & & & & \hfill \mathrm{\ddots}\hfill & \\ & & & & & & & \hfill \mathrm{\ddots}\hfill \end{array}\right)\times \left(\begin{array}{c}\hfill {d}_{2,1}\hfill \\ \hfill {d}_{1,1}\hfill \\ \hfill {d}_{2,2}\hfill \\ \hfill {d}_{1,2}\hfill \\ \hfill {d}_{2,3}\hfill \\ \hfill {d}_{1,3}\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill \mathrm{\vdots}\hfill \end{array}\right)=\left(\begin{array}{c}\hfill {U}_{1,1}\hfill \\ \hfill {U}_{1,2}\hfill \\ \hfill {U}_{2,1}\hfill \\ \hfill {U}_{2,2}\hfill \\ \\ \\ \hfill \mathrm{\vdots}\hfill \\ \end{array}\right)$$ 
The ${i}^{th}$ bloc of the blocdiagonal matrix is a Toeplitz matrix induced by a subsequence of length $k$ of $({r}_{1},\mathrm{\dots}{r}_{n},{r}_{1}\mathrm{\dots}{r}_{n})$. Set ${r}_{j}=1$ for all $j\in \{k,2k,3k,\mathrm{\dots}n\}$ and set ${r}_{j}=0$ for all other values of $j$. Then it is easy to see that each bloc is a permutation of the identity matrix. Thus, all blocs are invertible. This entails that the block diagonal matrix above is also invertible. So by solving this set of linear equations, we find ${d}_{1,1}\mathrm{\dots}{d}_{k,n}$ such that $WRO=U$. We can apply the same idea to factorize $V={W}^{\prime}.R.O$ for some matrix ${W}^{\prime}$. Finally, we get
$$A=U\mathrm{\Sigma}{V}^{T}=WRO\mathrm{\Sigma}{O}^{T}{R}^{T}{W}^{{}^{\prime}T}$$ 
Thanks to Theorem 1, $W$ and ${W}^{\prime}$ can both be factorized in a product of $2k1$ circulant and diagonal matrices. Note that $O\mathrm{\Sigma}{O}^{T}$ is diagonal, because all three are diagonal. Overall, $A$ can be represented with a product of $4k+2$ matrices, alternatively diagonal and circulant. ∎
3 Proofs of Section 4
Lemma 1.
Let ${W}_{L}\mathrm{,}\mathrm{\dots}\mathit{}{W}_{\mathrm{1}}\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$, $b\mathrm{\in}{\mathrm{C}}^{n}$ and let $\mathrm{X}\mathrm{\subset}{\mathrm{C}}^{n}$ be a bounded set. There exists ${\beta}_{L}\mathit{}\mathrm{\dots}\mathit{}{\beta}_{\mathrm{1}}\mathrm{\in}{\mathrm{C}}^{n}$ such that for all $x\mathrm{\in}\mathrm{X}$ we have ${f}_{{W}_{L}\mathrm{,}{\beta}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{W}_{\mathrm{1}}\mathrm{,}{\beta}_{\mathrm{1}}}\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}R\mathit{}e\mathit{}L\mathit{}U\mathit{}\mathrm{\left(}{W}_{L}\mathit{}{W}_{L\mathrm{}\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{W}_{\mathrm{1}}\mathit{}x\mathrm{+}b\mathrm{\right)}$.
Proof.
(Lemma 1) Define $S=\{{\left(\left({\prod}_{k=1}^{j}{W}_{k}\right)x\right)}_{t}:x\in \mathcal{X},t\in [n],j\in [L]\}$. Let $\mathrm{\Omega}=\mathrm{max}\left\{\Re (v):v\in S\right\}+\mathbf{i}\mathrm{max}\left\{\Im (v):v\in S\right\}$. Intuitively, the real and imaginary parts of $\mathrm{\Omega}$ are the largest any activation in the network can have. Define ${h}_{j}(x)={W}_{j}x+{\beta}_{j}$. Let ${\beta}_{1}=\mathrm{\Omega}{\mathrm{\U0001d7cf}}_{n}$. Clearly, for all $x\in \mathcal{X}$ we have ${h}_{1}(x)\ge 0$, so $ReLU\circ {h}_{1}(x)={h}_{1}(x)$. More generally, for all $$ define ${\beta}_{j+1}={\mathrm{\U0001d7cf}}_{n}\mathrm{\Omega}{W}_{j+1}{\beta}_{j}$. It is easy to see that for all $$ we have ${h}_{j}\circ \mathrm{\dots}\circ {h}_{1}(x)={W}_{j}{W}_{j1}\mathrm{\dots}{W}_{1}x+{\mathrm{\U0001d7cf}}_{n}\mathrm{\Omega}$. This guarantees that for all $$, ${h}_{j}\circ \mathrm{\dots}\circ {h}_{1}(x)=ReLU\circ {h}_{j}\circ \mathrm{\dots}\circ ReLU\circ {h}_{1}(x)$. Finally, define ${\beta}_{L}=b{A}_{L}{\beta}_{L1}$. We have, $ReLU\circ {h}_{L}\circ \mathrm{\dots}\circ ReLU\circ {h}_{1}(x)=ReLU\left({W}_{j}\mathrm{\dots}{W}_{1}x+b\right)$. ∎
Lemma 2.
Let $\mathrm{N}$ be a deep ReLU network of width $n$ and depth $L$, and let $\mathrm{X}\mathrm{\subset}{\mathrm{C}}^{n}$ be a bounded set. For any $\u03f5\mathrm{>}\mathrm{0}$, there exists a DCNN ${\mathrm{N}}^{\mathrm{\prime}}$ of width $n$ and of depth $\mathrm{(}\mathrm{2}\mathit{}n\mathrm{}\mathrm{1}\mathrm{)}\mathit{}L$ such that $$ for all $x\mathrm{\in}\mathrm{X}$.
Proof.
(Lemma 1) Assume $\mathcal{N}={f}_{{W}_{L},{b}_{L}}\circ \mathrm{\dots}\circ {f}_{{W}_{1},{b}_{1}}$. By theorem 1, for any ${\u03f5}^{\prime}>0$, any matrix ${W}_{i}$, there exists a sequence of $2n1$ matrices ${C}_{i,n}{D}_{i,n1}{C}_{i,n1}\mathrm{\dots}{D}_{i,1}{C}_{i,1}$ such that $$, where ${D}_{i,1}$ is the identity matrix. By lemma 1, we know that there exists ${\left\{{\beta}_{ij}\right\}}_{i\in [L],j\in [n]}$ such that for all $i\in [L]$, ${f}_{{D}_{in}{C}_{in},{\beta}_{in}}\circ \mathrm{\dots}\circ {f}_{{D}_{i1}{C}_{i1},{\beta}_{i1}}(x)=ReLU\left({D}_{in}{C}_{in}\mathrm{\dots}{C}_{i1}x+{b}_{i}\right)$.
Now if ${\u03f5}^{\prime}$ tends to zero, $\parallel {f}_{{D}_{in}{C}_{in},{\beta}_{in}}\circ \mathrm{\dots}\circ {f}_{{D}_{i1}{C}_{i1},{\beta}_{i1}}ReLU\left({W}_{i}x+{b}_{i}\right)\parallel $ will also tend to zero for any $x\in \mathcal{X}$, because the ReLU function is continuous and $\mathcal{X}$ is bounded. Let ${\mathcal{N}}^{\prime}={f}_{{D}_{1n}{C}_{1n},{\beta}_{1n}}\circ \mathrm{\dots}\circ {f}_{{D}_{i1}{C}_{i1},{\beta}_{i1}}$. Again, because all functions are continuous, for all $x\in \mathcal{X}$, $\parallel \mathcal{N}(x){\mathcal{N}}^{\prime}(x)\parallel $ tends to zero as ${\u03f5}^{\prime}$ tends to zero. ∎
Corollary 1.
Bounded width DCNNs are universal approximators in the following sense: for any continuous function $f\mathrm{:}{\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}}^{n}\mathrm{\to}{\mathrm{R}}_{\mathrm{+}}$ of bounded supremum norm, for any $\u03f5\mathrm{>}\mathrm{0}$, there exists a DCNN ${\mathrm{N}}_{\u03f5}$ of width $n\mathrm{+}\mathrm{3}$ such that $\mathrm{\forall}x\mathrm{\in}{\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}}^{n\mathrm{+}\mathrm{3}}$, $$, where ${\mathrm{(}\mathrm{\cdot}\mathrm{)}}_{i}$ represents the ${i}^{t\mathit{}h}$ component of a vector.
Proof.
(Corollary 1) It has been shown recently in (Hanin, 2017) that for any continuous function $f:{[0,1]}^{n}\to {\mathbb{R}}_{+}$ of bounded supremum norm, for any $\u03f5>0$, there exists a dense neural network $\mathcal{N}$ with an input layer of width $n$, an output layer of width $1$, hidden layers of width $n+3$ and ReLU activations such that $$. From $\mathcal{N}$, we can easily build a deep ReLU network ${\mathcal{N}}^{\prime}$ of width exactly $n+3$, such that $\forall x\in {[0,1]}^{n+3}$, $$. Thanks to lemma 1, this last network can be approximated arbitrarily well by a DCNN of width $n+3$. ∎
Theorem 3.
(Rankbased expressive power of diagonal circulant neural networks)
Let $\mathrm{N}\mathrm{:}{f}_{{W}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{W}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}$ be a deep ReLU network of width $n$, depth $L$ and a total rank $k$. Assume $n$ is a power of $\mathrm{2}$. Let $\mathrm{X}\mathrm{\subset}{\mathrm{C}}^{n}$ be a bounded set. For any $\u03f5\mathrm{>}\mathrm{0}$, there exists a DCNN ${\mathrm{N}}^{\mathrm{\prime}}$ of width $n$ such that $$ for all $x\mathrm{\in}\mathrm{X}$. In addition, the depth of ${\mathrm{N}}^{\mathrm{\prime}}$ is bounded by $\mathrm{9}\mathit{}k$. Moreover, if the rank of each matrix ${A}_{i}$ divides $n$, then the depth of ${\mathrm{N}}^{\mathrm{\prime}}$ is bounded by $L\mathrm{+}\mathrm{4}\mathit{}k$.
Proof.
(Theorem 3) Let ${k}_{1}\mathrm{\dots}{k}_{L}$ be the ranks of matrices ${W}_{1}\mathrm{\dots}{W}_{L}$, which are $n$by$n$ matrices. For all $i$, there exists ${k}_{i}^{\prime}\in \{{k}_{i}\mathrm{\dots}2{k}_{i}\}$ such that ${k}_{i}^{\prime}$ is a power of $2$. Due to the fact that $n$ is also a power of $2$, ${k}_{i}^{\prime}$ divides $n$. By theorem 2, for all $i$ each matrix ${W}_{i}$ can be decomposed as an alternating product of diagonalcirculant matrices ${B}_{i,1}\mathrm{\dots}{B}_{i,4{k}_{i}^{\prime}+1}$ such that $$. Using the exact same technique as in lemma 1, we can build a DCNN ${\mathcal{N}}^{\prime}$ using matrices ${B}_{1,1}\mathrm{\dots}{B}_{L,4{k}_{L}^{\prime}+1}$, such that $$ for all $x\in \mathcal{X}$. The total number of layers is ${\sum}_{i}\left(4{k}_{i}^{\prime}+1\right)\le L+8{\sum}_{i}{k}_{i}\le L+8.\text{total rank}\le 9.\text{total rank}$.
∎
Proposition 4.
Let $\mathrm{N}\mathrm{=}{f}_{{W}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{W}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}$ be a Deep ReLU network, where ${W}_{i}\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}\mathrm{,}{b}_{i}\mathrm{\in}{\mathrm{C}}^{n}$ for all $i\mathrm{\in}\mathrm{[}L\mathrm{]}$. Let ${\stackrel{\mathrm{~}}{W}}_{i}$ be the matrix obtained by an SVD approximation of rank $k$ of matrix ${W}_{i}$. Let ${\sigma}_{i\mathrm{,}j}$ be the ${j}^{t\mathit{}h}$ singular value of ${W}_{i}$. Define $\stackrel{\mathrm{~}}{\mathrm{N}}\mathrm{=}{f}_{\stackrel{\mathrm{~}}{{W}_{L}}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{\stackrel{\mathrm{~}}{{W}_{\mathrm{1}}}\mathrm{,}{b}_{\mathrm{1}}}$. Then, for any $x\mathrm{\in}{\mathrm{C}}^{n}$, we have:
$$\parallel \mathcal{N}\left(x\right)\stackrel{~}{\mathcal{N}}\left(x\right)\parallel \le \frac{\left({\sigma}_{max,1}^{L}1\right)R{\sigma}_{max,k}}{{\sigma}_{max,1}1}$$ 
where $R$ is an upper bound on norm of the output of any layer in $\mathrm{N}$, and ${\sigma}_{m\mathit{}a\mathit{}x\mathrm{,}j}\mathrm{=}{\mathrm{max}}_{i}\mathit{}{\sigma}_{i\mathrm{,}j}$.
Proof.
(Proposition 4) Let ${x}_{0}\in {\u2102}^{n}$ and ${\stackrel{~}{x}}_{0}={x}_{0}$. For all $i\in [L]$, define ${x}_{i}=ReLU\left({W}_{i}{x}_{i1}+b\right)$ and ${\stackrel{~}{x}}_{i}=ReLU\left(\stackrel{~}{{W}_{i}}{\stackrel{~}{x}}_{i1}+b\right)$. By lemma 3, we have
$$\parallel {x}_{i}{\stackrel{~}{x}}_{i}\parallel \le {\sigma}_{i,k+1}\parallel {x}_{i1}\parallel +{\sigma}_{i,1}\parallel {x}_{i1}{\stackrel{~}{x}}_{i1}\parallel $$ 
Observe that for any sequence ${a}_{0},{a}_{1}\mathrm{\dots}$ defined recurrently by ${a}_{0}=0$ and ${a}_{i}=r{a}_{i1}+s$, the recurrence relation can be unfold as follows: ${a}_{i}=\frac{s\left({r}^{i}1\right)}{r1}$. We can apply this formula to bound our error as follows:
$$\parallel {x}_{l}{\stackrel{~}{x}}_{l}\parallel \le \frac{\left({\sigma}_{max,1}^{l}1\right){\sigma}_{max,k}{\mathrm{max}}_{i}\parallel {x}_{i}\parallel}{{\sigma}_{max,1}1}$$ 
∎
Lemma 3.
Let $W\mathrm{\in}{\mathrm{C}}^{n\mathrm{\times}n}$ with singular values ${\sigma}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{\sigma}_{n}$, and let $x\mathrm{,}\stackrel{\mathrm{~}}{x}\mathrm{\in}{\mathrm{C}}^{n}$. Let $\stackrel{\mathrm{~}}{W}$ be the matrix obtained by a SVD approximation of rank $k$ of matrix $W$. Then we have:
$$\parallel ReLU\left(Wx+b\right)ReLU\left(\stackrel{~}{W}\stackrel{~}{x}+b\right)\parallel \le {\sigma}_{k+1}\parallel x\parallel +{\sigma}_{1}\parallel \stackrel{~}{x}x\parallel $$ 
Proof.
(Lemma 3) Recall that ${\parallel W\parallel}_{2}={sup}_{z}\frac{{\parallel Wz\parallel}_{2}}{{\parallel z\parallel}_{2}}={\sigma}_{1}={\parallel \stackrel{~}{W}\parallel}_{2}$, because ${\sigma}_{1}$ is the greatest singular value of both $W$ and $\stackrel{~}{W}$. Also, note that ${\parallel W\stackrel{~}{W}\parallel}_{2}={\sigma}_{k+1}$. Let us bound the formula without ReLUs:
$\parallel \left(Wx+b\right)\left(\stackrel{~}{W}\stackrel{~}{x}+b\right)\parallel $  $=\parallel \left(Wx+b\right)\left(\stackrel{~}{W}\stackrel{~}{x}+b\right)\parallel $  
$=\parallel Wx\stackrel{~}{W}x\stackrel{~}{W}\left(\stackrel{~}{x}x\right)\parallel $  
$\le \parallel \left(W\stackrel{~}{W}\right)x\parallel +{\parallel \stackrel{~}{W}\parallel}_{2}\parallel \stackrel{~}{x}x\parallel $  
$\le \parallel x\parallel {\sigma}_{k+1}+{\sigma}_{1}\parallel \stackrel{~}{x}x\parallel $ 
Finally, it is easy to see that for any pair of vectors $a,b\in {\u2102}^{n}$, we have $\parallel ReLU(a)ReLU(b)\parallel \le \parallel ab\parallel $. This concludes the proof. ∎
Corollary 2.
Consider any deep ReLU network $\mathrm{N}\mathrm{=}{f}_{{W}_{L}\mathrm{,}{b}_{L}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{W}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}}$ of depth $L$ and width $n$. Let ${\sigma}_{m\mathit{}a\mathit{}x\mathrm{,}j}\mathrm{=}{\mathrm{max}}_{i}\mathit{}{\sigma}_{i\mathrm{,}j}$ where ${\sigma}_{i\mathrm{,}j}$ is the ${j}^{t\mathit{}h}$ singular value of ${W}_{i}$. Let $\mathrm{X}\mathrm{\subset}{\mathrm{C}}^{n}$ be a bounded set. Let $k$ be an integer dividing $n$. There exists a DCNN ${\mathrm{N}}^{\mathrm{\prime}}\mathrm{=}{f}_{{D}_{m}\mathit{}{C}_{m}\mathrm{,}{b}_{m}^{\mathrm{\prime}}}\mathrm{\circ}\mathrm{\dots}\mathrm{\circ}{f}_{{D}_{\mathrm{1}}\mathit{}{C}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{1}}^{\mathrm{\prime}}}$ of width $n$ and of depth $m\mathrm{=}L\mathit{}\mathrm{(}\mathrm{4}\mathit{}k\mathrm{+}\mathrm{1}\mathrm{)}$, such that for any $x\mathrm{\in}\mathrm{X}$:
$$ 
where $R$ is an upper bound on the norm of the outputs of each layer in $\mathrm{N}$.
Proof.
(Corollary 2) Let $\stackrel{~}{\mathcal{N}}={f}_{\stackrel{~}{{W}_{L}},{b}_{L}}\circ \mathrm{\dots}\circ {f}_{\stackrel{~}{{W}_{1}},{b}_{1}}$, where each ${\stackrel{~}{W}}_{i}$ is the matrix obtained by an SVD approximation of rank $k$ of matrix ${W}_{i}$. With Proposition 4, we have an error bound on $\parallel \mathcal{N}\left(x\right)\stackrel{~}{\mathcal{N}}\left(x\right)\parallel $. Now each matrix ${\stackrel{~}{W}}_{i}$ can be replaced by a product of $k$ diagonalcirculant matrices. By theorem 3, this product yields a DCNN of depth $m=L(4k+1)$, strictly equivalent to $\stackrel{~}{\mathcal{N}}$ on $\mathcal{X}$. The result follows. ∎
4 Proof of Subsection 4.2
Proposition 5.
Let $\mathrm{N}$ be a DCNN of depth $L$ initialized according to our procedure, with ${\sigma}^{\mathrm{\prime}}\mathrm{=}\mathrm{0}$. Assume that all layers $\mathrm{1}$ to $L\mathrm{}\mathrm{1}$ have ReLU activation functions, and that the last layer has the identity activation function. Then, for any $x\mathrm{\in}{\mathrm{R}}^{n}$, the covariance matrix of $\mathrm{N}\mathit{}\mathrm{(}x\mathrm{)}$ is $\frac{\mathrm{2}\mathrm{.}I\mathit{}d}{n}\mathit{}{\mathrm{\parallel}x\mathrm{\parallel}}_{\mathrm{2}}^{\mathrm{2}}$. Moreover, note that this covariance does not depend on the depth of the network.
Proof.
(Proposition 5) Let $\mathcal{N}={f}_{{D}_{L},{C}_{L}}\circ \mathrm{\dots}\circ {f}_{{D}_{1},{C}_{1}}$ be a $L$ layer DCNN. All matrices are initialized as described in the statement of the proposition. Let $y={D}_{1}{C}_{1}x$. Lemma 4 shows that $cov({y}_{i},{y}_{{i}^{\prime}})=0$ for $i\ne {i}^{\prime}$ and $var({y}_{i})=\frac{2}{n}{\parallel x\parallel}_{2}^{2}$. For any $j\le L$, define ${z}^{j}={f}_{{D}_{j},{C}_{j}}\circ \mathrm{\dots}\circ {f}_{{D}_{1},{C}_{1}}(x)$. By a recursive application of lemma 4, we get that then $cov({z}_{i}^{j},{z}_{{i}^{\prime}}^{j})=0$ and $var({z}_{i}^{j})=\frac{2}{n}{\parallel x\parallel}_{2}^{2}$. ∎
Lemma 4.
Let ${c}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{c}_{n}\mathrm{,}{d}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{d}_{n}\mathrm{,}{b}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{b}_{n}$ be random variables in $\mathrm{R}$ such that ${c}_{i}\mathrm{\sim}\mathrm{N}\mathit{}\mathrm{(}\mathrm{0}\mathrm{,}{\sigma}^{\mathrm{2}}\mathrm{)}$, ${b}_{i}\mathrm{\sim}\mathrm{N}\mathit{}\mathrm{(}\mathrm{0}\mathrm{,}{\sigma}^{\mathrm{\prime}\mathit{}\mathrm{2}}\mathrm{)}$ and ${d}_{i}\mathrm{\sim}\mathrm{\{}\mathrm{}\mathrm{1}\mathrm{,}\mathrm{1}\mathrm{\}}$ uniformly. Define $C\mathrm{=}c\mathit{}i\mathit{}r\mathit{}c\mathit{}\mathrm{(}{c}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{c}_{n}\mathrm{)}$ and $D\mathrm{=}d\mathit{}i\mathit{}a\mathit{}g\mathit{}\mathrm{(}{d}_{\mathrm{1}}\mathit{}\mathrm{\dots}\mathit{}{d}_{n}\mathrm{)}$. Define $y\mathrm{=}D\mathit{}C\mathit{}u$ and $z\mathrm{=}C\mathit{}D\mathit{}u$ for some vector $u$ in ${\mathrm{R}}^{n}$. Also define $\overline{y}\mathrm{=}y\mathrm{+}b$ and $\overline{z}\mathrm{=}z\mathrm{+}b$. Then, for all $i$, the p.d.f. of ${y}_{i}$, ${\overline{y}}_{i}$, ${z}_{i}$ and ${\overline{z}}_{i}$ are symmetric. Also:

•
Assume ${u}_{1}\mathrm{\dots}{u}_{n}$ is fixed. Then, we have for $i\ne {i}^{\prime}:$
$cov({y}_{i},{y}_{{i}^{\prime}})$ $=cov({z}_{i},{z}_{{i}^{\prime}})=cov({\overline{y}}_{i},{\overline{y}}_{{i}^{\prime}})=cov({\overline{z}}_{i},{\overline{z}}_{{i}^{\prime}})=0$ $var({y}_{i})$ $=var({z}_{i})={\displaystyle \sum _{j}}{u}_{j}^{2}{\sigma}^{2}$ $var({\overline{y}}_{i})$ $=var({\overline{z}}_{i})={\sigma}^{\prime 2}+{\displaystyle \sum _{j}}{u}_{j}^{2}{\sigma}^{2}$ 
•
Let ${x}_{1}\mathrm{\dots}{x}_{n}$ be random variables in $\mathbb{R}$ such that the p.d.f. of ${x}_{i}$ is symmetric for all $i$, and let ${u}_{i}=ReLU({x}_{i})$. We have for $i\ne {i}^{\prime}:$
$cov({y}_{i},{y}_{{i}^{\prime}})$ $=cov({z}_{i},{z}_{{i}^{\prime}})=cov({\overline{y}}_{i},{\overline{y}}_{{i}^{\prime}})=cov({\overline{z}}_{i},{\overline{z}}_{{i}^{\prime}})=0$ $var({y}_{i})$ $=var({z}_{i})={\displaystyle \frac{1}{2}}{\displaystyle \sum _{j}}var({x}_{i}).{\sigma}^{2}$ $var({\overline{y}}_{i})$ $=var({\overline{z}}_{i})={\sigma}^{\prime 2}+{\displaystyle \frac{1}{2}}{\displaystyle \sum _{j}}var({x}_{i}).{\sigma}^{2}$
Proof.
(Lemma 4) By an abuse of notation, we write ${c}_{0}={c}_{n},{c}_{1}={c}_{n1}$ and so on. First, note that: ${y}_{i}={\sum}_{j=1}^{n}{c}_{ji}{u}_{j}{d}_{j}$ and ${z}_{i}={\sum}_{j=1}^{n}{c}_{ji}{u}_{j}{d}_{i}$. Observe that each term ${c}_{ji}{u}_{j}{d}_{j}$ and ${c}_{ji}{u}_{j}{d}_{i}$ have symmetric p.d.f. because of ${d}_{i}$ and ${d}_{j}$. Thus, ${y}_{i}$ and ${z}_{i}$ have symmetric p.d.f. Now let us compute the covariance.
$cov({y}_{i},{y}_{{i}^{\prime}})$  $={\displaystyle \sum _{j,{j}^{\prime}=1}^{n}}cov({c}_{ji}{u}_{j}{d}_{j},{c}_{{j}^{\prime}{i}^{\prime}}{u}_{{j}^{\prime}}{d}_{{j}^{\prime}})$  
$={\displaystyle \sum _{j,{j}^{\prime}=1}^{n}}\mathbb{E}\left[{c}_{ji}{u}_{j}{d}_{j}{c}_{{j}^{\prime}{i}^{\prime}}{u}_{{j}^{\prime}}{d}_{{j}^{\prime}}\right]\mathbb{E}\left[{c}_{ji}{u}_{j}{d}_{j}\right]\mathbb{E}\left[{c}_{{j}^{\prime}{i}^{\prime}}{u}_{{j}^{\prime}}{d}_{{j}^{\prime}}\right]$ 
Observe that $\mathbb{E}\left[{c}_{ji}{u}_{j}{d}_{j}\right]=\mathbb{E}\left[{c}_{ji}{u}_{j}\right]\mathbb{E}\left[{d}_{j}\right]=0$ because ${d}_{j}$ is independent from ${c}_{ji}{u}_{j}$. Also, observe that if $j\ne {j}^{\prime}$ then $\mathbb{E}\left[{d}_{j}{d}_{{j}^{\prime}}\right]=0$ and thus $\mathbb{E}\left[{c}_{ji}{u}_{j}{d}_{j}{c}_{{j}^{\prime}{i}^{\prime}}{u}_{{j}^{\prime}}{d}_{{j}^{\prime}}\right]=\mathbb{E}\left[{d}_{j}{d}_{{j}^{\prime}}\right]\mathbb{E}\left[{c}_{ji}{u}_{j}{c}_{{j}^{\prime}{i}^{\prime}}{u}_{{j}^{\prime}}\right]=0$. Thus, the only non null terms are those for which $j={j}^{\prime}$. We get:
$cov({y}_{i},{y}_{{i}^{\prime}})$  $={\displaystyle \sum _{j=1}^{n}}\mathbb{E}\left[{c}_{ji}{u}_{j}{d}_{j}{c}_{j{i}^{\prime}}{u}_{j}{d}_{j}\right]$  
$={\displaystyle \sum _{j=1}^{n}}\mathbb{E}\left[{c}_{ji}{c}_{j{i}^{\prime}}{u}_{j}^{2}\right]$ 
Assume $u$ is a fixed vector. Then, $var({y}_{i})={\sum}_{j=1}^{n}{u}_{j}^{2}{\sigma}^{2}$ and $cov({y}_{i},{y}_{{i}^{\prime}})=0$ for $i\ne {i}^{\prime}$ because ${c}_{ji}$ is independent from ${c}_{j{i}^{\prime}}$.
Now assume that ${u}_{j}=ReLU({x}_{j})$ where ${x}_{j}$ is a r.v. Clearly, ${u}_{j}^{2}$ is independent from ${c}_{ji}$ and ${c}_{j{i}^{\prime}}$. Thus:
$cov({y}_{i},{y}_{{i}^{\prime}})$  $={\displaystyle \sum _{j=1}^{n}}\mathbb{E}\left[{c}_{ji}{c}_{j{i}^{\prime}}\right]\mathbb{E}\left[{u}_{j}^{2}\right]$ 
For $i\ne {i}^{\prime}$, then ${c}_{ji}$ and ${c}_{j{i}^{\prime}}$ are independent, and thus $\mathbb{E}\left[{c}_{ji}{c}_{j{i}^{\prime}}\right]=\mathbb{E}\left[{c}_{ji}\right]\mathbb{E}\left[{c}_{j{i}^{\prime}}\right]=0$. Therefore, $cov({y}_{i},{y}_{{i}^{\prime}})=0$ if $i\ne {i}^{\prime}$. Let us compute the variance. We get $var({y}_{i})={\sum}_{j=1}^{n}var({c}_{ji}).\mathbb{E}\left[{u}_{j}^{2}\right]$. Because the p.d.f. of ${x}_{j}$ is symmetric, $\mathbb{E}\left[{x}_{j}^{2}\right]=2\mathbb{E}\left[{u}_{j}^{2}\right]$ and $\mathbb{E}\left[{x}_{j}\right]=0$. Thus, $var({y}_{i})=\frac{1}{2}{\sum}_{j=1}^{n}var({c}_{ji}).\mathbb{E}\left[{x}_{j}^{2}\right]=\frac{1}{2}{\sum}_{j=1}^{n}var({c}_{ji}).var({x}_{j})$.
Finally, note that $cov({\overline{y}}_{i},{\overline{y}}_{{i}^{\prime}})=cov({y}_{i},{y}_{{i}^{\prime}})+cov({b}_{i},{b}_{{i}^{\prime}})$. This yields the covariances of $\overline{y}$.
To derive $cov({z}_{i},{z}_{{i}^{\prime}})$ and $cov({\overline{z}}_{i},{\overline{z}}_{{i}^{\prime}})$ , the required calculus is nearly identical. We let the reader check by himself/herself. ∎
5 Additional Empirical Evaluation
5.1 Experimental Setup
Datasets:
We conducted our experiments using three datasets. The first series of experiments (for Q1 and Q2) were conducted with a synthetic dataset and on the CIFAR10 image classification dataset (Krizhevsky & Hinton, 2009). The synthetic dataset has been created in order to reproduce the experiment on the regression linear problem proposed by Moczulski et al. (2015). We draw $X$, $Y$ and $W$ from a uniform distribution between [1, +1] and $\u03f5$ from a normal distribution with mean 0 and variance $0.01$. The relationship between $X$ and $Y$ is define by $Y=XW+\u03f5$.
The final experiments (Q3) were conducted on the YouTube8M video dataset. The YouTube8M dataset (AbuElHaija et al., 2016a) is a very large video classification dataset with over 3.8 million training examples and 3800 classes. For each video, 300 images and audio samples are extract and processed by an Inception Network. The YouTube8M dataset consists of the 300 video vectors of shape 1024 and 300 audio vectors of shape 128. The first set of experiments (Section 5.3.1) use this dataset with a mean aggregation and concatenation of all video and audio input.
Architectures & HyperParameters:
The first set of experiments use full connected neural networks with the dense matrices replaced by structured ones (Diagonal circulant, Diagonal Toeplitz, Toeplitz, LDR, etc.). For all our experiments, we train all networks for 200 epochs, a batch size of 200, Leaky ReLU activation with a different slope. We minimize the Cross Entropy Loss with Adam optimizer and use a piecewise constant learning rate of $5\times 10e5$, $2.5\times 10e5$, $5\times 10e6$ and $1\times 10e6$ after respectively 40000, 60000 and 80000 steps.
For the YouTube8M dataset experiments, we built a neural network based on the stateoftheart architecture initially proposed by AbuElHaija et al. (2016b) (see Figure 5.1) and later improved by Miech et al. (2017). Remark that no convolution layer is involved in this application since the input vectors are embeddings of video frames processed using stateoftheart convolutional neural networks trained on ImageNet.
We trained our models with the CrossEntropy loss and used Adam optimizer with a 0.0002 learning rate and a 0.8 exponential decay every 4 million examples. All fully connected layers are composed of 512 units. DBoF, NetVLAD and NetFV are respectively 8192, 64 and 64 of cluster size for video frames and 4096, 32, 32 for audio frames. We used 4 mixtures for the MoE Layer. We used all the available 300 frames for the DBoF embedding. In order to stabilize and accelerate the training, we used batch normalization before each non linear activation and gradient clipping.
Unless otherwise mentioned, we consider fully DCNNs: we replace all dense layers with diagonal circulant layers. This is an important distinction from the previous experiments from (Moczulski et al., 2015) that were typically ran on a large VGG network with only few layers replaced with structured ones. Since these networks are generally highly redundant, the contribution of the structured layer is difficult to quantify.
Reproducibility of experiments: All our experiments are developed with TensorFlow version 1.12 (Abadi et al., 2015). The code is available as Supplemental Material and will be open sourced upon acceptance of the paper. The archive contains a readme file containing a small documentation on how to run the experiments, a configuration file which defines the architecture and the hyperparameters of the experiments, python scripts which generate a bash command to run the experiments. The code contains DCNN, DTNN, Toeplitz networks, ACDC networks, low rank networks and the YouTube8M networks. The source code for LDR can be found online^{2}^{2} 2 https://github.com/HazyResearch/structurednets.
Motivation for using real neural networks instead of complex ones:
We choose to train real neural networks because the complex version of the same networks would use twice the number of parameters. As such for equal number of parameters the complex one would have half the number of layers. We made a comparison between the two and found that the real networks with double the number of layer achieve better results (see Table3)
Layers  #Weights  Accuracy  

Complex  2  43 048  0.520 
4  79 912  0.523  
5  98 344  0.524  
Real  2  21 524  0.536 
5  49 172  0.543  
10  95 252  0.544 
5.2 Additional results on the YouTube8M dataset
Figure 5.2 present the validation GAP given the number of epochs on the YouTube8M dataset. We observe that the model with the fully connected block converted to diagonal circulant layer achieve the same GAP score of the dense model while converging faster.
5.3 Comparison with other compression based approaches
For completeness, we provide a comparison with other compression based approaches. Table 4 shows the test error of DCNN against other know compression techniques on the MNIST datasets. We can observe that DCNN outperform easily HashNet (Chen et al., 2015) and Dark Knowledge (Hinton et al., 2015) with fewer number of parameters. The architecture with Fast Food (FF) (Yang et al., 2015) achieves better performance but with convolutional layers and only $1$ Fast Food Layer as the last Softmax layer.
Architecture  Settings  #Params  Error (%) 

LeNet (Lecun et al., 1998)    4 257 674  0.61 
DCNN  8 DC layers  25 620  1.74 
10 DC layers  31 764  1.60  
Fast Food (FF) (Yang et al., 2015)  Conv + FF 1024 Softmax layer  38 821  0.71 
Conv + FF 2048 Softmax Layer  52 124  0.71  
HashNet (Chen et al., 2015)  3 layers, 1/64 compress. factor  46 875  2.79 
5 layers, 1/64 compress. factor  78 125  1.99  
Dark Knowledge (Hinton et al., 2015)  3 layers, 1/64 compress.factor  46 875  6.32 
5 layers, 1/64 compress. factor  78 125  2.16 