On the Expressive Power of Deep Diagonal Circulant Neural Networks

  • 2019-06-11 16:53:27
  • Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, Jamal Atif
  • 0

Abstract

In this paper, we study deep diagonal circulant neural networks, that is deepneural networks in which all weight matrices are the product of diagonal andcirculant ones. We show that these networks outperform the recently introduceddeep networks with other types of structured layers. Besides introducingprincipled techniques for training these models, we provide theoreticalguarantees regarding their expressivity. Indeed, we prove that the functionspace spanned by diagonal circulant networks of bounded depth includes the onespanned by dense networks with specific properties on their rank. We conduct athorough experimental study to compare the performance of deep diagonalcirculant networks with state of the art models based on structured matricesand with dense models. We show that our models achieve better accuracy thantheir structured alternatives while required 2x fewer weights as the next bestapproach. Finally we train deep diagonal circulant networks to build a compactand accurate models on a real world video classification dataset with over 3.8million training examples.

 

Quick Read (beta)

On the Expressive Power of Deep Diagonal Circulant Neural Networks

Alexandre Araujo1,2 Benjamin Negrevergne2 Yann Chevaleyre2 Jamal Atif2

1Wavestone, Paris, France 
2Université Paris-Dauphine, PSL Research University, CNRS, LAMSADE, Paris, France
Abstract

In this paper, we study deep diagonal circulant neural networks, that is deep neural networks in which all weight matrices are the product of diagonal and circulant ones. We show that these networks outperform the recently introduced deep networks with other types of structured layers. Besides introducing principled techniques for training these models, we provide theoretical guarantees regarding their expressivity. Indeed, we prove that the function space spanned by diagonal circulant networks of bounded depth includes the one spanned by dense networks with specific properties on their rank. We conduct a thorough experimental study to compare the performance of deep diagonal circulant networks with state of the art models based on structured matrices and with dense models. We show that our models achieve better accuracy than their structured alternatives while required 2x fewer weights as the next best approach. Finally we train deep diagonal circulant networks to build a compact and accurate models on a real world video classification dataset with over 3.8 million training examples.

\usetikzlibrary

positioning \usetikzlibraryarrows.meta \pgfplotssetcompat=1.14

 

On the Expressive Power of Deep Diagonal Circulant Neural Networks


  Alexandre Araujo1,2 Benjamin Negrevergne2 Yann Chevaleyre2 Jamal Atif2 1Wavestone, Paris, France 2Université Paris-Dauphine, PSL Research University, CNRS, LAMSADE, Paris, France

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Deep neural networks with structured matrices (e.g. low rank matrices, Toeplitz matrices, LDR, etc.) have been been the focus of intensive research in the last years, in particular for building compact models with reduced memory footprints. Despite a number of successful results, it remains unclear whether structured layers alone are expressive enough to replace all dense layers in deep feed-forward neural networks. In particular the following questions remain open: “What is the expressive power of compact representations compared to the original ones?”, “What functions can or cannot be approximated by neural networks when all dense matrices are replaced by a small set of structured matrices?” and “How can we efficiently train deep neural networks with a large number of structured layers?”.

In this paper, we provide a principled answer to the questions mentioned above on the particular case of deep neural networks based on diagonal and circulant matrices (a.k.a. Diagonal-circulant networks or DCNNs). The idea of using diagonal and circulant matrices together comes from a series of results in linear algebra by Müller-Quade et al. (1998) and Huhtanen & Perämäki (2015). The most recent result from Huhtanen & Perämäki demonstrates that any matrix A in n×n can be decomposed into the product of 2n-1 alternating diagonal and circulant matrices. This result was then used by Moczulski et al. (2015) to introduce the AFDF structured layer, which is the building block of DCNNs.

We follow the same line of research as Moczulski et al. (2015) and bring new contributions. Remarkably, we were able to train deep DCNNs (up to 40 layers) and apply this framework in the context of a large scale real-world scneraio (the YouTube-8M video classification dataset). In addition, we endow these networks with strong theoretical guarantees by showing that DCNNs with ReLU activations and with bounded width and small depth can approximate any dense neural network with arbitrary precision. More precisely, we make the following contributions:

  1. 1.

    We enhance the results by Huhtanen & Perämäki (2015) by introducing a new bound on the number diagonal-circulant required to approximate a matrix that depends on its rank.

  2. 2.

    We introduce several results regarding DCNNs with ReLU activations. In particular, we are able to demonstrate that a DCNN with bounded width and small depth can approximate any dense networks with ReLU activations.

  3. 3.

    We describe a theoretically sound initialization procedure for DCNN. These results simplify the training and removes a large number of complex hyperparameters which made DCNNs difficult to train in practice up to now.

  4. 4.

    We provide a number of empirical insights to explain the behaviour of DCNNs. More specifically, we show the impact of the number of the non-linearities in the network on the convergence rate and the accuracy of the network.

  5. 5.

    By combining all these insights, we are able to train large and deep DCNNs. We demonstrate the good performance of DCNNs on a large scale application (the YouTube-8M video classification problem) and obtain very competitive accuracy .

2 Related Work

Structured matrices exhibit a number of good properties which have been exploited by deep learning practitioners, mainly to compress large neural networks architectures into smaller ones. For example Hinrichs & Vybíral (2011) have demonstrated that a single circulant matrix can be used to approximate the Johson-Lindenstrauss transform, often used in machine learning to perform dimensionality reduction. Building upon this result, Cheng et al. (2015) proposed to replace the weight matrix of a fully connected layer by a circulant matrix effectively replacing the complex transform modeled by the fully connected layer by a simple dimensionality reduction. Despite the reduction of expressivity, the resulting network demonstrated good accuracy using only a fraction of its original size (90% reduction).

ACDC. Moczulski et al. (2015) have introduced two Structured Efficient Linear Layers (SELL) called AFDF and ACDC. The AFDF structured layer benefits from the theoretical results introduced by Huhtanen & Perämäki and can be seen the building block of DCNNs. However, Moczulski et al. (2015) only experiment using ACDC, a different type of layer that does not involve circulant matrices. As far as we can tell, the theoretical guarantees available for the AFDF layer do not apply on the ACDC layer since the cosine transform does not diagonalize circulant matrices (Sanchez et al., 1995). Another possible limit of the ACDC paper is that they only train large neural networks involving ACDC layers combined with many other expressive layers. Although the resulting network demonstrates good accuracy, it is difficult the characterize the true contribution of the ACDC layers in this setting.

Low displacement rank structures. More recently, Thomas et al. (2018) have generalized these works by proposing neural networks with low-displacement rank matrices (LDR), that are structured matrices encompassing a large family of structured matrices, including Toeplitz-like, Vandermonde-like, Cauchy-like and more notably DCNNs. To obtain this result, LDR represents a structured matrix using two displacement operators and a low-rank residual. Despite being elegant and general, we found that the LDR framework suffers from several limits which are inherent to its generality, and makes it difficult to use in the context of large and deep neural networks. First, the training procedure for learning LDR matrices is highly involved and implies many complex mathematical objects such as Krylov matrices. Then, as acknowledged by the authors, the number of parameters required to represent a given structured matrix (e.g. a Toeplitz matrix) in practice is unnecessarily high (higher than required in theory).

Other compression techniques. Besides structured matrices, a variety of techniques have been proposed to build more compact deep learning models. These include model distillation (Hinton et al., 2015), Tensor Train (Novikov et al., 2015), Low-rank decomposition (Denil et al., 2013), to mention a few. However, Circulant networks show good performances in several contexts (the interested reader can refer to the results reported by Moczulski et al. (2015) and Thomas et al. (2018)).

3 A primer on circulant matrices and a new result

An n-by-n circulant matrix C is a matrix where each row is a cyclic right shift of the previous one as illustrated below.

C=circ(c)=[c0cn-1cn-2c1c1c0cn-1c2c2c1c0c3cn-1cn-2cn-3c0]

Circulant matrices exhibit several interesting properties from the perspective of numerical computations. Most importantly, any n-by-n circulant matrix C can be represented using only n coefficients instead of the n2 coefficients required to represent classical unstructured matrices. In addition, the matrix-vector product is simplified from O(n2) to O(nlog(n)) using the convolution theorem.

As we will show in this paper, circulant matrices also have a strong expressive power. So far, we know that a single circulant matrix can be used to represent a variety of important linear transforms such as random projections (Hinrichs & Vybíral, 2011). When they are combined with diagonal matrices, they can also be used as building blocks to represent any linear transform (Schmid et al., 2000; Huhtanen & Perämäki, 2015) with an arbitrary precision. Huhtanen & Perämäki were able to bound the number of factors that is required to approximate any matrix A with arbitrary precision. We recall this result in Theorem 1 as it is the starting point of our theoretical analysis (note that in the rest of the paper, denotes to the 2 norm when applied to vectors, and to the operator norm when applied to matrices).

Theorem 1.

(Reformulation Huhtanen & Perämäki (2015)) For every matrix ACn×n, for any ϵ>0 , there exists a sequence of matrices B1B2n-1 where Bi is a circulant matrix if i is odd, and a diagonal matrix otherwise, such that B1B2B2n-1-A<ϵ.

Unfortunately, this theorem is of little use to understand the expressive power of diagonal-circulant matrices when they are used in deep neural networks. This is because: 1) the bound does not depend on the matrix A, 2) the theorem does not provide any insights regarding the expressive power of m diagonal-circulant factors when m is much lower than 2n-1 as it is the case in most practical scenarios we consider in this paper.

In the following theorem, we enhance the result by Huhtanen & Perämäki by expressing the number of factors required to approximate A, as a function of the rank of A. This is useful when one deals with low-rank matrices, which is common in machine learning problems.

Theorem 2.

(Rank-based circulant decomposition) Let ACn×n be a matrix of rank at most k. Assume that n can be divided by k. For any ϵ>0, there exists a sequence of 4k+1 matrices B1,,B4k+1, where Bi is a circulant matrix if i is odd, and a diagonal matrix otherwise, such that B1B2B4k+1-A<ϵ

A direct consequence of Theorem 2, is that if the number of diagonal-circulant factors is set to a value K, we can only approximate a matrix A whose rank is at most K-14. As we will show, this result will be useful to analyze neural networks based on diagonal and circulant matrices.

4 Theoretical analysis of Diagonal Circulant Neural Networks (DCNNs)

4.1 The expressive power of DCNNs with bounded width and small depth

Zhao et al. (2017) have shown that circulant networks with 2 layers and unbounded width are universal approximators. However, results on unbounded networks offer weak guarantees and two important questions have remained open until now: 1) Can we approximate any function with a bounded-width circulant networks? 2) What function can we approximate with a circulant network that has a bounded width and a small depth? We answer these two questions in this section.

First, we introduce some necessary definitions regarding neural networks and we provide a theoretical analysis of their approximation capabilities.

Definition 1 (Deep ReLU network).

Given L weight matrices W=(W1,,WL) with WiCn×n and L bias vectors b=(b1,,bL) with biCn, a deep ReLU network is a function fWL,bL:CnCn such that fW,b(x)=(fWL,bLfW1,b1)(x) where fWi,bi(x)=ϕ(Wix+bi) and ϕ(.) is a ReLU non-linearity 11 1 Because our networks deal with complex numbers, we use an extension of the ReLU function to the complex domain. The most straightforward extension defined in (Trabelsi et al., 2018) is as follows: ReLU(z)=ReLU(R(z))+iReLU(I(z)), where R and I refer to the real and imaginary parts of z. In the rest of this paper, we call L and n respectively the depth and the width of the network. Moreover, we call total rank k, the sum of the ranks of the matrices W1WL. i.e. k=i=1Lrank(Wi).

We also need to introduce DCNNs, similarly to Moczulski et al. (2015).

Definition 2 (Diagonal Circulant Neural Networks).

Given L diagonal matrices D=(D1,,DL) with DiCn×n, L circulant matrices C=(C1,,CL) with CiCn×n and L bias vectors b=(b1,,bL) with biCn, a Diagonal Circulant Neural Networks (DCNN) is a function fWL,bL:CnCn such that fD,C,b(x)=(fDL,CL,bLfD1,C1,b1)(x) where fDi,Ci,bi(x)=ϕi(DiCix+bi) and where ϕi(.) is a ReLU non-linearity or the identity function.

We can now show that bounded-width DCNNs can approximate any Deep ReLU Network, and as a corollary, that they are universal approximators.

Lemma 1.

Let N be a deep ReLU network of width n and depth L, and let XCn be a bounded set. For any ϵ>0, there exists a DCNN N of width n and of depth (2n-1)L such that N(x)-N(x)<ϵ for all xX.

The proof is in the supplemental material. We can now state the universal approximation corollary:

Corollary 1.

Bounded width DCNNs are universal approximators in the following sense: for any continuous function f:[0,1]nR+ of bounded supremum norm, for any ϵ>0, there exists a DCNN Nϵ of width n+3 such that x[0,1]n+3, |f(x1xn)-(Nϵ(x))1|<ϵ, where ()i represents the ith component of a vector.

This is a first result, however (2n+5)L is not a small depth (in our experiments, n can be over 300 000), and a number of work provided empirical evidences that DCNN with small depth can offer good performances (e.g. Araujo et al. (2018); Cheng et al. (2015)). To improve our result, we introduce our main theorem which studies the approximation properties of these small depth networks.

Theorem 3.

(Main result: Rank-based expressive power of DCNNs) Let N be a deep ReLU network of width n, depth L and a total rank k and assume n is a power of 2. Let XCn be a bounded set. Then, for any ϵ>0, there exists a DCNN with ReLU activation N of width n such that N(x)-N(x)<ϵ for all xX and the depth of N is bounded by 9k.

Remark that in the theorem, we require that n is a power of 2. We conjecture that the result still holds even without this condition.

This result refines Lemma 1, and answer our second question: a DCNN of bounded width and small depth can approximate a Deep ReLU network of low total rank. Note that the converse is not true: because n-by-n circulant matrix can be of rank n, approximating a DCNN of depth 1 can require a deep ReLU network of total rank equals to n.

For the sake of clarity, we highlight the significance of these results in the following property:

Property 1.

Let Rk,n be the set of all functions f:RnRn representable by deep ReLU networks of total rank at most k and let Cl,n the set of all functions f:RnRn representable by deep diagonal-circulant networks of depth at most l, then:

k,l,n k,n𝒞l,n
l,k,n 𝒞l,nk,n

We illustrate the meaning of this Property 1 using Figure 4.1. As we can see, the set k,n of all the functions representable by a deep ReLU network of total rank k is strictly included in the set 𝒞9k of all circulant networks of depth 9k (as by Theorem 3).

\tikzset

>=Latex[width=2mm,length=2mm], base/.style = rectangle, draw=black, text centered, font=, other/.style = base, fill=none, minimum width=1.7cm, minimum height=0.7cm, ellipse/.style = base {tikzpicture}[every node/.style=fill=white, font=, align=center,scale=0.6]

\draw

(0,0) circle (2.5cm); \draw(0,0) circle (2.0cm); \draw(0,0) circle (1.5cm); \draw(0,0) circle (1.0cm); \draw(0,0) circle (0.5cm);

\draw

[ellipse, rotate=30, fill=gray, opacity=0.5] (0.1, -1.1) ellipse (2.0cm and 0.9cm); \draw[ellipse, rotate=30, fill=gray, opacity=0.5] (0.0, -0.8) ellipse (1.0cm and 0.45cm);

\node

[other, draw=none] at (0.20, 0.20) 𝒞1,n; \node[other, draw=none] at (0.55, 0.55) ; \node[other, draw=none] at (0.90, 0.90) 𝒞9,n; \node[other, draw=none] at (1.25, 1.25) ; \node[other, draw=none] at (1.60, 1.60) 𝒞18,n;

\node

[other, draw=none] at (0.4, -0.6) 1,n; \node[other, draw=none] at (1.0, -1.4) 2,n;

Figure 4.1: Illustration of Property 1 for a fixed width n.

Finally, what if we choose to use small depth networks to approximate deep ReLU networks where matrices are not of low rank? To answer this question, we first need to show the negative impact of replacing matrices by their low rank approximators in neural networks:

Proposition 4.

Let N=fAL,bLfA1,b1 be a Deep ReLU network, where AiCn×n,biCn for all i[L]. Let A~i be the matrix obtained by an SVD approximation of rank k of matrix Ai. Let σi,j be the jth singular value of Ai. Define N~=fAL~,bLfA1~,b1. Then, for any xCn, we have:

𝒩(x)-𝒩~(x)(σmax,1L-1)Rσmax,kσmax,1-1

where R is an upper bound on norm of the output of any layer in N, and σmax,j=maxiσi,j.

Proposition 4 shows that we can approximate matrices in a neural network by low rank matrices, and control the approximation error. In general, the term σmax,1L could seem large, but in practice, it is likely that most singular values in deep neural network are small in order to avoid divergent behaviors. We can now prove the result on DCNNs:

Corollary 2.

Consider any deep ReLU network N=fAL,bLfA1,b1 of depth L and width n. Let σmax,j=maxiσi,j where σi,j is the jth singular value of Ai. Let XRn be a bounded set. Let k be an integer dividing n. There exists a DCNN N=fDmCm,bmfD1C1,b1 of width n and of depth m=L(4k+1), such that for any xX:

𝒩(x)-𝒩(x)<(σmax,1L-1)Rσmax,kσmax,1-1

where R is an upper bound on the norm of the outputs of each layer in N.

4.2 Initialization of deep circulant layers

Training DCNNs has revealed to be a challenging problem. To facilitate training and convergence, we propose the following initialization procedure which is a variant of Xavier initialization. First, for each circulant matrix C=circ(c1cn), each ci is randomly drawn from 𝒩(0,σ2), with σ=2n. Next, for each diagonal matrix D=diag(d1dn), each di is drawn randomly and uniformly from {-1,1} for all i. Finally, all biases in the network are randomly drawn from 𝒩(0,σ2), for some small value of σ.

The following proposition states that the covariance matrix at the output of any layer in a DCNN is constant.

Proposition 5.

Let N be a DCNN of depth L initialized according to our procedure, with σ=0. Assume that all layers 1 to L-1 have ReLU activation functions, and that the last layer has the identity activation function. Then, for any xRn, the covariance matrix of N(x) is 2.Idnx22. Moreover, note that this covariance does not depend on the depth of the network.

With this initialization, the covariance matrix at the output of any layer does not depend on the depth of the network. This ensures that the signal is propagated accross the network without vanishing nor exploding.

5 Empirical evaluation

This experimental section aims at answering the following questions: (Q1) How to train DCNNs and tune their hyper parameters? (Q2) How do DCNNs compare to other approaches such as ACDC, LDR or other structured approaches? (Q3) How do DCNNs performs in the context of large scale real-world machine learning applications? All the details of our experiments are provided as supplemental material. We conducted experiments using architectures with real and complex weights. Since our results were consistently better using real valued matrices, we report our results on real values in this section. We provide some results with complex valued matrices in the supplemental material.

{tikzpicture}

[scale=0.45] {axis}[ legend cell align=left, xlabel=#layers, ylabel=Test Accuracy, xmin=2, xmax=40, legend pos=south west, ymajorgrids=true, grid style=dashed, ] \addplot[mark=square, color=blue, line width=0.4mm] table [y=accuracy, x=layers]data/cifar10/factor/1_factor.dat; \addplot[mark=triangle, color=brown, line width=0.4mm] table [y=accuracy, x=layers]data/cifar10/factor/2_factor.dat; \addplot[mark=o, color=gray, line width=0.4mm] table [y=accuracy, x=layers]data/cifar10/factor/3_factor.dat; \legend ReLU(DC), ReLU(DCDC), ReLU(DCDCDC),

{tikzpicture}

[scale=0.45] {axis}[ legend cell align=left, xlabel=#layers, ylabel=Test Accuracy, xmin=2, xmax=40, legend pos=south west, ymajorgrids=true, grid style=dashed, ] \addplot[mark=square, color=blue, line width=0.4mm] table [y=accuracy, x=layers]data/cifar10/leaky_relu/slope_0.2.dat; \addplot[mark=triangle, color=brown, line width=0.4mm] table [y=accuracy, x=layers]data/cifar10/leaky_relu/slope_0.3.dat; \addplot[mark=o, color=gray, line width=0.4mm] table [y=accuracy, x=layers]data/cifar10/leaky_relu/slope_0.5.dat; \legend Leaky ReLU 0.2, Leaky ReLU 0.3, Leaky ReLU 0.5,

{tikzpicture}

[scale=0.45] {axis}[ legend cell align=left, xlabel=#weights (x1000) , ylabel=Test Accuracy, xmin=21, xmax=370, ymin=0.2, ymax=0.6, legend pos=south east, ymajorgrids=true, grid style=dashed, ] \addplot[color=red, line width=0.25mm, dashed] table [y=accuracy, x=weights]data/cifar10/type/dense.dat; \addplot[mark=triangle, color=blue, line width=0.4mm] table [y=accuracy, x=weights]data/cifar10/type/circulant.dat; \addplot[mark=square, color=red, line width=0.4mm] table [y=accuracy, x=weights]data/cifar10/type/diag_toeplitz.dat; \addplot[mark=o, color=gray, line width=0.4mm] table [y=accuracy, x=weights]data/cifar10/type/toeplitz.dat; \addplot[mark=diamond, color=brown, line width=0.4mm] table [y=accuracy, x=weights]data/cifar10/type/low_rank.dat; \legend Dense (9M weights), DCNN, DTNN, Toeplitz network, Low Rank network,

Figure 5.1: Experiments on training DCNNs and other structured neural networks on CIFAR-10. Figure 5.1: impact of increasing the number of ReLU activations in a DCNN. Deep DCNNs with fewer ReLUs are easier to train. Figure 5.1: impact of increasing the slope of a Leaky-ReLU in DCNNs. Deep DCNNs with a larger slope are easier to train. Figure 5.1: network size vs. accuracy compared on Dense networks, DCNNs (our approach), DTNNs (our approach), neural networks based on Toeplitz matrices and neural networks based on Low Rank-based matrices. DCNNs outperforms alternatives structured approaches.

5.1 How to train very deep DCNNs (Q1)

We empirically found that reducing the number of non-linearities in the networks simplifies the training of deep neural networks. To support this claim, we conduct a series of experiments on various DCNNs with a varying number of ReLU activations (to reduce the number of non-linearities, we replace some ReLU activations with the identity function). In a second experiment, we replace the ReLU activations with Leaky-ReLU activations and vary the slope of the Leaky ReLU (a higher slope means an activation function that is closer to a linear function). The results of this experiment are presented in Figure 5.1 and 5.1. In 5.1, “ReLU(DC)” means that we interleave on ReLU activation functions between every diagonal-circulant matrix, whereas ReLU(DCDC) means we interleave a ReLU activation every other block etc. In both 5.1, Figure 5.1, we observe that reducing the non-linearity of the networks can be used to train deeper networks. This is an interesting result, since we can use this technique to adjust the number of parameters in the network, without facing training difficulties. We obtain a maximum accuracy of 0.56 with one ReLU every three layers and leaky-ReLUs with a slope of 0.5. We hence rely on this setting in the rest of this experimental section.

5.2 Comparison with other structured approaches (Q2)

Comparison with Dense networks, Toeplitz networks and Low Rank networks. We now compare DCNNs with other state-of-the-art structured networks by measuring the accuracy on CIFAR-10. Our baseline is a dense feed-forward network with a fixed number of weights (9 million weights). We compare with DCNNs and with DTNNs (see below), Toeplitz networks, and Low-Rank networks (Yu et al., 2017). We first consider Toeplitz networks which are stacked Toeplitz matrices interleaved with ReLU activations since Toeplitz matrices are closely related to circulant matrices. Since Toeplitz networks have a different structure (they do not include diagonal matrices), we also experiment using DTNNs, a variant of DCNNs where all the circulant matrices have been replaced by Toeplitz matrices. Finally we conduct experiments using networks based on low-rank matrices as they are also closely related to our work. For each approach, we report the accuracy of several networks with a varying depth ranging from 1 to 40 (DCNNs, Toeplitz networks) and from 1 to 30 (from DTNNs). For low-rank networks, we used a fixed depth network and increased the rank of each matrix from 7 to 40. We also tried to increase the depth of low rank matrices, but we found that deep low-rank networks are difficult to train so we do not report the results here. We compare all the networks based on the number of weights from 21K (0.2% of the dense network) to 370K weights (4% of the dense network) and we report the results in Figure 5.1. First we can see that the size of the networks correlates positively with their accuracy which demonstrate successful training in all cases. We can also see that the DCNNs achieves the maximum accuracy of 56% with 20 layers ( 200K weights) which as as good as the dense networks with only 2% of the number of weights. Other approaches also offer good performance but they are not able to reach the accuracy of a dense network.

Figure 5.2: Comparison of DCNNs and ACDC networks on two different tasks. Figure 5.2 shows the evolution of the training loss on a regression task with synthetic data. Figure 5.2 shows the test accuracy on the CIFAR-10 dataset.

Comparison with ACDC (Moczulski et al., 2015). In Section 2, we have discussed the differences between the ACDC framework and our approach from a theoretical perspective. In this section, we conduct experiments to compare the performance of DCNNs with neural networks based on ACDC layers. We first reproduce the experimental setting from Moczulski et al. (2015), and compare both approaches using only linear networks (i.e. networks without any ReLU activations). The results are presented in Figure 5.2. On this simple setting, both architectures demonstrate good performance, however, DCNNs offer better convergence rate. In Figure 5.2, we compare neural networks with ReLU activations on CIFAR-10. We found that networks which are based only on ACDC layers are difficult to train and offer poor accuracy on CIFAR. (We have tried different initialization schemes including the one from the original paper, and the one we propose in this paper.) In contrast, deep DCNNs can be trained and offer good performance without additional dense layers (these results are in line with our experiments on the YouTube-8M dataset). We can conclude that DCNNs are able to model complex relations at a low cost.

Table 1: LDR networks compared with DCNNs on CIFAR-10. DCNNs outperform all LDR configurations, with fewer weights. (Remark: the numbers may differ from the original experiments by Thomas et al. because we use the original dataset instead of a monochrome version)
Method #Parameters Accuracy
Dense 9 470 986 0.562
DCNN (5layers) 49 172 0.543
DCNN (2layers) 21 524 0.536
LDR-TD (r=2) 64 522 0.511
LDR – Toeplitz-like (r=3) 52 234 0.496
LDR – Hankel-like (r=3) 52 234 0.483

Comparison with LDR networks (Thomas et al., 2018). We now compare DCNNs with the LDR framework using the network configuration experimented in the original paper: a single LDR structured layer followed by a dense layer. In the LDR framework, we can change the size of a network by adjusting the rank of the residual matrix, effectively capturing matrices with a structure that is close to a known structure but not exactly (e.g. in the LDR framework, Toeplitz matrices can be encoded with a residual matrice with rank=2, so a matrix that can be encoded with a residual of rank=3 can be seen as Toeplitz-like.). The results are presented in Table 1 and demonstrate that DCNNs outperforms all LDR networks both in terms in size and accuracy.

5.3 DCNNs for large-scale video classification on the YouTube-8M dataset (Q3)

To understand the performance of deep DCNNs on large scale applications, we conducted experiments on the YouTube-8M video classification with 3.8 training examples introduced by (Abu-El-Haija et al., 2016b). Notice that we favour this experiment over ImageNet applications because modern image classification architectures involve a large number of convolutional layers, and compressing convolutional layers is out of our scope. Also, as mentioned earlier, testing the performance of DCNN architectures mixed with a large number of expressive layers makes little sense.

The YouTube-8M includes two datasets describing 8 million labeled videos. Both datasets contain audio and video features for each video. In the first dataset (aggregated) all audio and video features have been aggregated every 300 frames. The second dataset (full) contains the descriptors for all the frames. To compare the models we use the GAP metric (Global Average Precision) proposed by (Abu-El-Haija et al., 2016b). On the simpler (aggregated) dataset we compared off-the-shelf DCNNs with a dense baseline with 5.7M weights. On the full dataset, we designed three new compact architectures based on the state-of-the-art architecture introduced by (Abu-El-Haija et al., 2016b). The original architecture is based on 3 blocks, a Deep Bag-of-Frame embedding block (DBoF), a dimensionalilty reduction block (FC), and a Mixture of Experts block (MoE). We create three compact architectures by replacing one block by a more compact DCNN block in each architecture. Finally, we compared the performance of each compact architecture to the original one.

Table 2 shows the results of our experiments on the YouTube-8M dataset in terms of number of weights, compression rate and GAP. The compact fully connected layer achieves a compression rate of 9.5 while having a close performance. On the full dataset, replacing the DBoF block reduces the size of the network without impacting the accuracy. We obtain the best compression ratio by replacing the MoE block with DCNNs (26%) of the size of the original dataset with a GAP score of 0.805 (95% of the score obtained with the original architecture). We conclude that DCNN are both theoretically sound and of practical interest in real, large scale applications.

Table 2: GAP score of several architectures on the YouTube-8M datasets.
Dataset Architecture #Weights Compress. [email protected]
(% of baseline size)
Aggregated Dense 5.7M - 0.773
4 DC layers 25 410 0.44 0.599
32 DC layers 122 178 2.11 0.685
4 DC layers + 1 dense 4.46M 77 0.747
Full Original architecture 45M - 0.846
DBoF block replaced with DCNN 36M 80 0.838
FC block replaced with DCNN 41M 91 0.845
MoE block replaced with DCNN 12M 26 0.805

6 Conclusion

This paper deals with the training of diagonal circulant neural networks. To the best of our knowledge, training such networks with a large number of layers had not been done before. We also endowed this kind of models with theoretical guarantees, hence enriching and refining previous theoretical work from the literature. More importantly, we showed that DCNNs outperform their competing structured alternatives, including the very recent general approach based on LDR networks. Our results suggest that stacking diagonal circulant layers with non linearities improves the convergence rate and the final accuracy of the network. Formally proving these statements constitutes the future directions of this work. As future work, we would like to generalize the good results of DCNNs to convolutions neural networks. We also believe that circulant matrices deserve a particular attention in deep learning because of their strong ties with convolutions: a circulant matrix operator is equivalent to the convolution operator with circular paddings (as shown in [5]). This fact makes any contribution to the area of circulant matrices particularly relevant to the field of deep learning with impacts beyond the problem of designing compact models. As future work, we would like to generalize our results to deep convolutional neural networks.

References

  • Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • Abu-El-Haija et al. (2016a) Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A. P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. Youtube-8m: A large-scale video classification benchmark. In arXiv:1609.08675, 2016a.
  • Abu-El-Haija et al. (2016b) Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016b.
  • Araujo et al. (2018) Araujo, A., Negrevergne, B., Chevaleyre, Y., and Atif, J. Training compact deep learning models for video classification using circulant matrices. In The 2nd Workshop on YouTube-8M Large-Scale Video Understanding at ECCV 2018, 2018.
  • Chen et al. (2015) Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Chen, Y. Compressing neural networks with the hashing trick. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 2285–2294. JMLR.org, 2015.
  • Cheng et al. (2015) Cheng, Y., Yu, F. X., Feris, R. S., Kumar, S., Choudhary, A., and Chang, S. F. An exploration of parameter redundancy in deep networks with circulant projections. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2857–2865, Dec 2015.
  • Denil et al. (2013) Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., and de Freitas, N. Predicting parameters in deep learning. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 2148–2156. Curran Associates, Inc., 2013.
  • Hanin (2017) Hanin, B. Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. ArXiv e-prints, August 2017.
  • Hinrichs & Vybíral (2011) Hinrichs, A. and Vybíral, J. Johnson-lindenstrauss lemma for circulant matrices. Random Structures & Algorithms, 39(3):391–398, 2011.
  • Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  • Huhtanen & Perämäki (2015) Huhtanen, M. and Perämäki, A. Factoring matrices into the product of circulant and diagonal matrices. Journal of Fourier Analysis and Applications, 21(5):1018–1033, Oct 2015. ISSN 1531-5851. doi: 10.1007/s00041-015-9395-0.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
  • Miech et al. (2017) Miech, A., Laptev, I., and Sivic, J. Learnable pooling with context gating for video classification. CoRR, abs/1706.06905, 2017.
  • Moczulski et al. (2015) Moczulski, M., Denil, M., Appleyard, J., and de Freitas, N. Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946, 2015.
  • Müller-Quade et al. (1998) Müller-Quade, J., Aagedal, H., Beth, T., and Schmid, M. Algorithmic design of diffractive optical systems for information processing. Physica D: Nonlinear Phenomena, 120(1-2):196–205, 1998.
  • Novikov et al. (2015) Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
  • Sanchez et al. (1995) Sanchez, V., Garcia, P., Peinado, A. M., Segura, J. C., and Rubio, A. J. Diagonalizing properties of the discrete cosine transforms. IEEE transactions on Signal Processing, 43(11):2631–2641, 1995.
  • Schmid et al. (2000) Schmid, M., Steinwandt, R., Müller-Quade, J., Rötteler, M., and Beth, T. Decomposing a matrix into circulant and diagonal factors. Linear Algebra and its Applications, 306(1-3):131–143, 2000.
  • Thomas et al. (2018) Thomas, A., Gu, A., Dao, T., Rudra, A., and Ré, C. Learning compressed transforms with low displacement rank. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9066–9078. Curran Associates, Inc., 2018.
  • Trabelsi et al. (2018) Trabelsi, C., Bilaniuk, O., Zhang, Y., Serdyuk, D., Subramanian, S., Santos, J. F., Mehri, S., Rostamzadeh, N., Bengio, Y., and Pal, C. J. Deep complex networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  • Yang et al. (2015) Yang, Z., Moczulski, M., Denil, M., d. Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried convnets. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1476–1483, Dec 2015. doi: 10.1109/ICCV.2015.173.
  • Yu et al. (2017) Yu, X., Liu, T., Wang, X., and Tao, D. On compressing deep models by low rank and sparse decomposition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 67–76, July 2017. doi: 10.1109/CVPR.2017.15.
  • Zhao et al. (2017) Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., and Yuan, B. Theoretical properties for neural networks with weight matrices of low displacement rank. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4082–4090, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
{adjustwidth}

-0.5cm-0.5cm

Supplemental Material

1 Notations

We note (z) and (z) the real and imaginary parts the complex number z. We note ()t is the tth component of a vector. Let 𝐢 be the imaginary number defined by 𝐢2=-1. Define 𝟏n as the n-vector of ones. Also, we note [n]={1,,n}. The rectified linear unit on the complex domain is defined by ReLU(z)=max(0,(z))+𝐢max(0,(z)). The notation || refers to the complex modulus. Finally, define the cyclic shift matrix Sn×n as follows:

S=[01101010]

2 Proofs of Section 3

Theorem 1.

(Reformulation Huhtanen & Perämäki (2015)) For any given matrix ACn×n, for any ϵ>0, there exists a sequence of matrices B1B2n-1 where Bi is a circulant matrix if i is odd, and a diagonal matrix otherwise, such that B1B2B2n-1-A<ϵ. Moreover, if A can be decomposed as A=i=1kDiSi-1 where S is the cyclic-shift matrix and D1Dk are diagonal matrices, then A can be written as a product B1B2B2k-1 where Bi is a circulant matrix if i is odd, and a diagonal matrix otherwise.

Theorem 2.

(Rank-based circulant decomposition) Let ACn×n be a matrix of rank at most k. Assume that n can be divided by k. For any ϵ>0, there exists a sequence of 4k+1 matrices B1,,B4k+1, where Bi is a circulant matrix if i is odd, and a diagonal matrix otherwise, such that B1B2B4k+1-A<ϵ

Proof.

(Theorem 2) Let UΣVT be the SVD decomposition of M where U,V and Σ are n×n matrices. Because M is of rank k, the last n-k columns of U and V are null. In the following, we will first decompose U into a product of matrices WRO, where R and O are respectively circulant and diagonal matrices, and W is a matrix which will be further decomposed into a product of diagonal and circulant matrices. Then, we will apply the same decomposition technique to V. Ultimately, we will get a product of 4k+2 matrices alternatively diagonal and circulant.

Let R=circ(r1rn). Let O be a n×n diagonal matrix where Oi,i=1 if ik and 0 otherwise. The k first columns of the product RO will be equal to that of R, and the n-k last colomns of RO will be zeros. For example, if k=2, we have:

RO=(r1rn00r2r1r3r2rnrn-100)

Let us define k diagonal matrices Di=diag(di1din) for i[k]. For now, the values of dij are unknown, but we will show how to compute them. Let W=i=1kDiSi-1. Note that the n-k last columns of the product WRO will be zeros. For example, with k=2, we have:

W=[d1,1d2,1d2,2d1,2d2,3d2,nd1,n]
WRO=(r1d11+rnd21rnd11+rn-1d2100r2d12+r1d22r1d12+rnd22rnd1n+rn-1d2nrn-1d1n+rn-2d2n00)

We want to find the values of dij such that WRO=U. We can formulate this as linear equation system. In case k=2, we get:

(rnr1rn-1rnr1r2rnr1r2r3r1r2)×(d2,1d1,1d2,2d1,2d2,3d1,3)=(U1,1U1,2U2,1U2,2)

The ith bloc of the bloc-diagonal matrix is a Toeplitz matrix induced by a subsequence of length k of (r1,rn,r1rn). Set rj=1 for all j{k,2k,3k,n} and set rj=0 for all other values of j. Then it is easy to see that each bloc is a permutation of the identity matrix. Thus, all blocs are invertible. This entails that the block diagonal matrix above is also invertible. So by solving this set of linear equations, we find d1,1dk,n such that WRO=U. We can apply the same idea to factorize V=W.R.O for some matrix W. Finally, we get

A=UΣVT=WROΣOTRTWT

Thanks to Theorem 1, W and W can both be factorized in a product of 2k-1 circulant and diagonal matrices. Note that OΣOT is diagonal, because all three are diagonal. Overall, A can be represented with a product of 4k+2 matrices, alternatively diagonal and circulant. ∎

3 Proofs of Section 4

Lemma 1.

Let WL,W1Cn×n, bCn and let XCn be a bounded set. There exists βLβ1Cn such that for all xX we have fWL,βLfW1,β1(x)=ReLU(WLWL-1W1x+b).

Proof.

(Lemma 1) Define S={((k=1jWk)x)t:x𝒳,t[n],j[L]}. Let Ω=max{(v):vS}+𝐢max{(v):vS}. Intuitively, the real and imaginary parts of Ω are the largest any activation in the network can have. Define hj(x)=Wjx+βj. Let β1=Ω𝟏n. Clearly, for all x𝒳 we have h1(x)0, so ReLUh1(x)=h1(x). More generally, for all j<n-1 define βj+1=𝟏nΩ-Wj+1βj. It is easy to see that for all j<n we have hjh1(x)=WjWj-1W1x+𝟏nΩ. This guarantees that for all j<n, hjh1(x)=ReLUhjReLUh1(x). Finally, define βL=b-ALβL-1. We have, ReLUhLReLUh1(x)=ReLU(WjW1x+b). ∎

Lemma 2.

Let N be a deep ReLU network of width n and depth L, and let XCn be a bounded set. For any ϵ>0, there exists a DCNN N of width n and of depth (2n-1)L such that N(x)-N(x)<ϵ for all xX.

Proof.

(Lemma 1) Assume 𝒩=fWL,bLfW1,b1. By theorem 1, for any ϵ>0, any matrix Wi, there exists a sequence of 2n-1 matrices Ci,nDi,n-1Ci,n-1Di,1Ci,1 such that j=0n-1Di,n-jCi,n-j-Wi<ϵ, where Di,1 is the identity matrix. By lemma 1, we know that there exists {βij}i[L],j[n] such that for all i[L], fDinCin,βinfDi1Ci1,βi1(x)=ReLU(DinCinCi1x+bi).

Now if ϵ tends to zero, fDinCin,βinfDi1Ci1,βi1-ReLU(Wix+bi) will also tend to zero for any x𝒳, because the ReLU function is continuous and 𝒳 is bounded. Let 𝒩=fD1nC1n,β1nfDi1Ci1,βi1. Again, because all functions are continuous, for all x𝒳, 𝒩(x)-𝒩(x) tends to zero as ϵ tends to zero. ∎

Corollary 1.

Bounded width DCNNs are universal approximators in the following sense: for any continuous function f:[0,1]nR+ of bounded supremum norm, for any ϵ>0, there exists a DCNN Nϵ of width n+3 such that x[0,1]n+3, |f(x1xn)-(Nϵ(x))1|<ϵ, where ()i represents the ith component of a vector.

Proof.

(Corollary 1) It has been shown recently in (Hanin, 2017) that for any continuous function f:[0,1]n+ of bounded supremum norm, for any ϵ>0, there exists a dense neural network 𝒩 with an input layer of width n, an output layer of width 1, hidden layers of width n+3 and ReLU activations such that x[0,1]n,|f(x)-𝒩(x)|<ϵ. From 𝒩, we can easily build a deep ReLU network 𝒩 of width exactly n+3, such that x[0,1]n+3, |f(x1xn)-(𝒩(x))1|<ϵ. Thanks to lemma 1, this last network can be approximated arbitrarily well by a DCNN of width n+3. ∎

Theorem 3.

(Rank-based expressive power of diagonal circulant neural networks)
Let N:fWL,bLfW1,b1 be a deep ReLU network of width n, depth L and a total rank k. Assume n is a power of 2. Let XCn be a bounded set. For any ϵ>0, there exists a DCNN N of width n such that N(x)-N(x)<ϵ for all xX. In addition, the depth of N is bounded by 9k. Moreover, if the rank of each matrix Ai divides n, then the depth of N is bounded by L+4k.

Proof.

(Theorem 3) Let k1kL be the ranks of matrices W1WL, which are n-by-n matrices. For all i, there exists ki{ki2ki} such that ki is a power of 2. Due to the fact that n is also a power of 2, ki divides n. By theorem 2, for all i each matrix Wi can be decomposed as an alternating product of diagonal-circulant matrices Bi,1Bi,4ki+1 such that Wi-Bi,1××Bi,4ki+1<ϵ. Using the exact same technique as in lemma 1, we can build a DCNN 𝒩 using matrices B1,1BL,4kL+1, such that 𝒩(x)-𝒩(x)<ϵ for all x𝒳. The total number of layers is i(4ki+1)L+8ikiL+8.total rank9.total rank.

Proposition 4.

Let N=fWL,bLfW1,b1 be a Deep ReLU network, where WiCn×n,biCn for all i[L]. Let W~i be the matrix obtained by an SVD approximation of rank k of matrix Wi. Let σi,j be the jth singular value of Wi. Define N~=fWL~,bLfW1~,b1. Then, for any xCn, we have:

𝒩(x)-𝒩~(x)(σmax,1L-1)Rσmax,kσmax,1-1

where R is an upper bound on norm of the output of any layer in N, and σmax,j=maxiσi,j.

Proof.

(Proposition 4) Let x0n and x~0=x0. For all i[L], define xi=ReLU(Wixi-1+b) and x~i=ReLU(Wi~x~i-1+b). By lemma 3, we have

xi-x~iσi,k+1xi-1+σi,1xi-1-x~i-1

Observe that for any sequence a0,a1 defined recurrently by a0=0 and ai=rai-1+s, the recurrence relation can be unfold as follows: ai=s(ri-1)r-1. We can apply this formula to bound our error as follows:

xl-x~l(σmax,1l-1)σmax,kmaxixiσmax,1-1

Lemma 3.

Let WCn×n with singular values σ1σn, and let x,x~Cn. Let W~ be the matrix obtained by a SVD approximation of rank k of matrix W. Then we have:

ReLU(Wx+b)-ReLU(W~x~+b)σk+1x+σ1x~-x
Proof.

(Lemma 3) Recall that W2=supzWz2z2=σ1=W~2, because σ1 is the greatest singular value of both W and W~. Also, note that W-W~2=σk+1. Let us bound the formula without ReLUs:

(Wx+b)-(W~x~+b) =(Wx+b)-(W~x~+b)
=Wx-W~x-W~(x~-x)
(W-W~)x+W~2x~-x
xσk+1+σ1x~-x

Finally, it is easy to see that for any pair of vectors a,bn, we have ReLU(a)-ReLU(b)a-b. This concludes the proof. ∎

Corollary 2.

Consider any deep ReLU network N=fWL,bLfW1,b1 of depth L and width n. Let σmax,j=maxiσi,j where σi,j is the jth singular value of Wi. Let XCn be a bounded set. Let k be an integer dividing n. There exists a DCNN N=fDmCm,bmfD1C1,b1 of width n and of depth m=L(4k+1), such that for any xX:

𝒩(x)-𝒩(x)<(σmax,1L-1)Rσmax,kσmax,1-1

where R is an upper bound on the norm of the outputs of each layer in N.

Proof.

(Corollary 2) Let 𝒩~=fWL~,bLfW1~,b1, where each W~i is the matrix obtained by an SVD approximation of rank k of matrix Wi. With Proposition 4, we have an error bound on 𝒩(x)-𝒩~(x). Now each matrix W~i can be replaced by a product of k diagonal-circulant matrices. By theorem 3, this product yields a DCNN of depth m=L(4k+1), strictly equivalent to 𝒩~ on 𝒳. The result follows. ∎

4 Proof of Subsection 4.2

Proposition 5.

Let N be a DCNN of depth L initialized according to our procedure, with σ=0. Assume that all layers 1 to L-1 have ReLU activation functions, and that the last layer has the identity activation function. Then, for any xRn, the covariance matrix of N(x) is 2.Idnx22. Moreover, note that this covariance does not depend on the depth of the network.

Proof.

(Proposition 5) Let 𝒩=fDL,CLfD1,C1 be a L layer DCNN. All matrices are initialized as described in the statement of the proposition. Let y=D1C1x. Lemma 4 shows that cov(yi,yi)=0 for ii and var(yi)=2nx22. For any jL, define zj=fDj,CjfD1,C1(x). By a recursive application of lemma 4, we get that then cov(zij,zij)=0 and var(zij)=2nx22. ∎

Lemma 4.

Let c1cn,d1dn,b1bn be random variables in R such that ciN(0,σ2), biN(0,σ2) and di{-1,1} uniformly. Define C=circ(c1cn) and D=diag(d1dn). Define y=DCu and z=CDu for some vector u in Rn. Also define y¯=y+b and z¯=z+b. Then, for all i, the p.d.f. of yi, y¯i, zi and z¯i are symmetric. Also:

  • Assume u1un is fixed. Then, we have for ii:

    cov(yi,yi) =cov(zi,zi)=cov(y¯i,y¯i)=cov(z¯i,z¯i)=0
    var(yi) =var(zi)=juj2σ2
    var(y¯i) =var(z¯i)=σ2+juj2σ2
  • Let x1xn be random variables in such that the p.d.f. of xi is symmetric for all i, and let ui=ReLU(xi). We have for ii:

    cov(yi,yi) =cov(zi,zi)=cov(y¯i,y¯i)=cov(z¯i,z¯i)=0
    var(yi) =var(zi)=12jvar(xi).σ2
    var(y¯i) =var(z¯i)=σ2+12jvar(xi).σ2
Proof.

(Lemma 4) By an abuse of notation, we write c0=cn,c-1=cn-1 and so on. First, note that: yi=j=1ncj-iujdj and zi=j=1ncj-iujdi. Observe that each term cj-iujdj and cj-iujdi have symmetric p.d.f. because of di and dj. Thus, yi and zi have symmetric p.d.f. Now let us compute the covariance.

cov(yi,yi) =j,j=1ncov(cj-iujdj,cj-iujdj)
=j,j=1n𝔼[cj-iujdjcj-iujdj]-𝔼[cj-iujdj]𝔼[cj-iujdj]

Observe that 𝔼[cj-iujdj]=𝔼[cj-iuj]𝔼[dj]=0 because dj is independent from cj-iuj. Also, observe that if jj then 𝔼[djdj]=0 and thus 𝔼[cj-iujdjcj-iujdj]=𝔼[djdj]𝔼[cj-iujcj-iuj]=0. Thus, the only non null terms are those for which j=j. We get:

cov(yi,yi) =j=1n𝔼[cj-iujdjcj-iujdj]
=j=1n𝔼[cj-icj-iuj2]

Assume u is a fixed vector. Then, var(yi)=j=1nuj2σ2 and cov(yi,yi)=0 for ii because cj-i is independent from cj-i.

Now assume that uj=ReLU(xj) where xj is a r.v. Clearly, uj2 is independent from cj-i and cj-i. Thus:

cov(yi,yi) =j=1n𝔼[cj-icj-i]𝔼[uj2]

For ii, then cj-i and cj-i are independent, and thus 𝔼[cj-icj-i]=𝔼[cj-i]𝔼[cj-i]=0. Therefore, cov(yi,yi)=0 if ii. Let us compute the variance. We get var(yi)=j=1nvar(cj-i).𝔼[uj2]. Because the p.d.f. of xj is symmetric, 𝔼[xj2]=2𝔼[uj2] and 𝔼[xj]=0. Thus, var(yi)=12j=1nvar(cj-i).𝔼[xj2]=12j=1nvar(cj-i).var(xj).

Finally, note that cov(y¯i,y¯i)=cov(yi,yi)+cov(bi,bi). This yields the covariances of y¯.

To derive cov(zi,zi) and cov(z¯i,z¯i) , the required calculus is nearly identical. We let the reader check by himself/herself. ∎

5 Additional Empirical Evaluation

5.1 Experimental Setup

Datasets:

We conducted our experiments using three datasets. The first series of experiments (for Q1 and Q2) were conducted with a synthetic dataset and on the CIFAR-10 image classification dataset (Krizhevsky & Hinton, 2009). The synthetic dataset has been created in order to reproduce the experiment on the regression linear problem proposed by Moczulski et al. (2015). We draw X, Y and W from a uniform distribution between [-1, +1] and ϵ from a normal distribution with mean 0 and variance 0.01. The relationship between X and Y is define by Y=XW+ϵ.

The final experiments (Q3) were conducted on the YouTube-8M video dataset. The YouTube-8M dataset (Abu-El-Haija et al., 2016a) is a very large video classification dataset with over 3.8 million training examples and 3800 classes. For each video, 300 images and audio samples are extract and processed by an Inception Network. The YouTube-8M dataset consists of the 300 video vectors of shape 1024 and 300 audio vectors of shape 128. The first set of experiments (Section 5.3.1) use this dataset with a mean aggregation and concatenation of all video and audio input.

Architectures & Hyper-Parameters:

The first set of experiments use full connected neural networks with the dense matrices replaced by structured ones (Diagonal circulant, Diagonal Toeplitz, Toeplitz, LDR, etc.). For all our experiments, we train all networks for 200 epochs, a batch size of 200, Leaky ReLU activation with a different slope. We minimize the Cross Entropy Loss with Adam optimizer and use a piecewise constant learning rate of 5×10e-5, 2.5×10e-5, 5×10e-6 and 1×10e-6 after respectively 40000, 60000 and 80000 steps.

\tikzset

>=Latex[width=2mm,length=2mm], base/.style = rectangle, draw=black, text centered, font=, box/.style = base, rounded corners, text depth=3cm, minimum height=4cm, minimum width=3cm, transparent/.style = rectangle, draw=black, circulant/.style = base, fill=yellow!30, embedding/.style = base, fill=blue!30, minimum width=2.5cm, minimum height=1cm, other/.style = base, fill=white!30, minimum width=2cm, minimum height=1cm, fc/.style = base, fill=orange!30, minimum width=1.5cm, minimum height=1cm, gating/.style = base, fill=green!30, minimum width=2cm, text width=2cm, minimum height=1cm, moe/.style = base, fill=purple!30, minimum width=1.5cm, minimum height=1cm,

{tikzpicture}

[every node/.style=fill=white, font=, align=center]

\draw

(0.0, +2.) node [other, draw=none] Embedding; \draw(+3.7, +2.) node [other, draw=none] Dim Reduction; \draw(+8.0, +2.) node [other, draw=none] Classification;

\draw

(0, +0.8) node [embedding] Video; \draw(0, -0.8) node [embedding] Audio;

\draw

(+2.5, +0.8) node (fc) [fc] FC; \draw(+2.5, -0.8) node (fc) [fc] FC;

\draw

(+4.75, 0) node (fc) [other] concat; \draw(+7.0, 0) node (moe) [moe] MoE; \draw(+9.25, 0) node (gating2) [gating] Context Gating;

\draw

(+1.5, +2) [dashed] – (+1.5, -1.7); \draw(+6, +2) [dashed] – (+6, -1.7);

Figure 5.1: This figure shows the state-of-the-art neural network architecture, initially proposed by Abu-El-Haija et al. (2016b) and later improved by Miech et al. (2017), used in our experiment.

For the YouTube-8M dataset experiments, we built a neural network based on the state-of-the-art architecture initially proposed by Abu-El-Haija et al. (2016b) (see Figure 5.1) and later improved by Miech et al. (2017). Remark that no convolution layer is involved in this application since the input vectors are embeddings of video frames processed using state-of-the-art convolutional neural networks trained on ImageNet.

We trained our models with the CrossEntropy loss and used Adam optimizer with a 0.0002 learning rate and a 0.8 exponential decay every 4 million examples. All fully connected layers are composed of 512 units. DBoF, NetVLAD and NetFV are respectively 8192, 64 and 64 of cluster size for video frames and 4096, 32, 32 for audio frames. We used 4 mixtures for the MoE Layer. We used all the available 300 frames for the DBoF embedding. In order to stabilize and accelerate the training, we used batch normalization before each non linear activation and gradient clipping.

Unless otherwise mentioned, we consider fully DCNNs: we replace all dense layers with diagonal circulant layers. This is an important distinction from the previous experiments from (Moczulski et al., 2015) that were typically ran on a large VGG network with only few layers replaced with structured ones. Since these networks are generally highly redundant, the contribution of the structured layer is difficult to quantify.

Reproducibility of experiments: All our experiments are developed with TensorFlow version 1.12 (Abadi et al., 2015). The code is available as Supplemental Material and will be open sourced upon acceptance of the paper. The archive contains a readme file containing a small documentation on how to run the experiments, a configuration file which defines the architecture and the hyper-parameters of the experiments, python scripts which generate a bash command to run the experiments. The code contains DCNN, DTNN, Toeplitz networks, ACDC networks, low rank networks and the YouTube-8M networks. The source code for LDR can be found online22 2 https://github.com/HazyResearch/structured-nets.

Motivation for using real neural networks instead of complex ones:

We choose to train real neural networks because the complex version of the same networks would use twice the number of parameters. As such for equal number of parameters the complex one would have half the number of layers. We made a comparison between the two and found that the real networks with double the number of layer achieve better results (see Table3)

Table 3: Accuracy on the CIFAR-10 dataset with complex and real diagonal circulant networks.
Layers #Weights Accuracy
Complex 2 43 048 0.520
4 79 912 0.523
5 98 344 0.524
Real 2 21 524 0.536
5 49 172 0.543
10 95 252 0.544

5.2 Additional results on the YouTube-8M dataset

Figure 5.2 present the validation GAP given the number of epochs on the YouTube-8M dataset. We observe that the model with the fully connected block converted to diagonal circulant layer achieve the same GAP score of the dense model while converging faster.

{tikzpicture}

[scale=0.6] {axis}[ width=0.85height=\axisdefaultheight, title= Comparison of the effect of compactness over different layers with the base model , legend columns=2, legend style=fill=white, draw=black, /tikz/column 2/.style= column sep=5pt, ,, legend cell align=left, xlabel=Epochs, ylabel=Validation GAP, xmin=0, xmax=7, ymin=0.77, ymax=0.87, xtick=0,1,2,3,4,5,6,7, ytick=0.77,0.78,0.79,0.8,0.81,0.82,0.83,0.84,0.85,0.86,0.87, legend pos=north west, ymajorgrids=true, grid style=dashed, ] \addplot[color=red] table [y=gap, x=epoch]layers/dense.dat; \addplot[color=yellow] table [y=gap, x=epoch]layers/compact_dbof.dat; \addplot[color=blue] table [y=gap, x=epoch]layers/compact_fc.dat; \addplot[color=green] table [y=gap, x=epoch]layers/compact_moe.dat; \legend Dense Model, Model with compact DBoF, Model with compact FC, Model with compact MoE,

Figure 5.2: Validation GAP according to the number of epochs for different compact models.

5.3 Comparison with other compression based approaches

For completeness, we provide a comparison with other compression based approaches. Table 4 shows the test error of DCNN against other know compression techniques on the MNIST datasets. We can observe that DCNN outperform easily HashNet (Chen et al., 2015) and Dark Knowledge (Hinton et al., 2015) with fewer number of parameters. The architecture with Fast Food (FF) (Yang et al., 2015) achieves better performance but with convolutional layers and only 1 Fast Food Layer as the last Softmax layer.

Table 4: Comparison with compression based approaches
Architecture Settings #Params Error (%)
LeNet (Lecun et al., 1998) - 4 257 674 0.61
DCNN 8 DC layers 25 620 1.74
10 DC layers 31 764 1.60
Fast Food (FF) (Yang et al., 2015) Conv + FF 1024 Softmax layer 38 821 0.71
Conv + FF 2048 Softmax Layer 52 124 0.71
HashNet (Chen et al., 2015) 3 layers, 1/64 compress. factor 46 875 2.79
5 layers, 1/64 compress. factor 78 125 1.99
Dark Knowledge (Hinton et al., 2015) 3 layers, 1/64 compress.factor 46 875 6.32
5 layers, 1/64 compress. factor 78 125 2.16