Deep Network classification by Scattering and Homotopy dictionary learning

  • 2019-10-08 17:47:44
  • John Zarka, Louis Thiry, Tomás Angles, Stéphane Mallat
  • 8


We introduce a sparse scattering deep convolutional neural network, whichprovides a simple model to analyze properties of deep representation learningfor classification. Learning a single dictionary matrix with a classifieryields a higher classification accuracy than AlexNet over the ImageNetILSVRC2012 dataset. The network first applies a scattering transform whichlinearizes variabilities due to geometric transformations such as translationsand small deformations. A sparse l1 dictionary coding reduces intra-classvariability while preserving class separation through projections over unionsof linear spaces. It is implemented in a deep convolutional network with ahomotopy algorithm having an exponential convergence. A convergence proof isgiven in a general framework including ALISTA. Classification results areanalyzed over ImageNet.


Quick Read (beta)

Deep Network Classification by Scattering and Homotopy Dictionary Learning

John Zarka, Louis Thiry, Tomás Angles
Département d’informatique, ENS, CNRS, PSL University, Paris, France
&Stéphane Mallat
Collège de France, Paris, France
ENS, PSL University, Paris, France
Flatiron Institute, New York, USA
[email protected]

We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet ILSVRC2012 dataset. The network first applies a scattering transform which linearizes variabilities due to geometric transformations such as translations and small deformations. A sparse 𝐥𝟏 dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework including ALISTA. Classification results are analyzed over ImageNet.

Deep Network Classification by Scattering and Homotopy Dictionary Learning

John Zarka, Louis Thiry, Tomás Angles
Département d’informatique, ENS, CNRS, PSL University, Paris, France
Stéphane Mallat
Collège de France, Paris, France
ENS, PSL University, Paris, France
Flatiron Institute, New York, USA
[email protected]

1 Introduction

Deep convolutional networks have spectacular applications to classification and regression (LeCun et al., 2015), but they are a black box which are hard to analyze mathematically because of their architecture complexity. We introduce a simplified convolutional neural network illustrated in Figure 1, whose learning can be reduced to a single dictionary matrix and a classifier. Despite its simplicity, it applies to complex image classification and reaches a higher accuracy than AlexNet (Krizhevsky et al., 2012) over ImageNet ILSVRC2012. It is a cascade of well understood mathematical operators, and thus provides a simplified mathematical framework to analyze classification performances.

Figure 1: A sparse scattering network is composed of a scattering transform S followed by an optional linear operator L which reduces its dimensionality. A sparse code approximation of scattering coefficients is computed in a dictionary D. The dictionary D and the classifier are jointly learned by minimizing the classification loss with a stochastic gradient descent.

Intra-class variabilities due to geometric image transformations such as translations or small deformations are linearized by a scattering transform (Bruna and Mallat, 2013) which is invertible. Scattering transforms include no learning. They are effective representations to classify relatively simple images such as digits in MNIST, textures (Bruna and Mallat, 2013) or small CIFAR images (Oyallon and Mallat, 2014). Learning deep convolutional networks however gives a much higher accuracy over complex databases such as ImageNet. A fundamental issue is to understand the source of this improvement. This paper shows that it can be captured by a sparse 𝐥𝟏 code in a dictionary D optimized by supervised learning. It is implemented with a deep convolutional network architecture. The sparse code eliminates non-informative image components and projects each class in unions of linear spaces. The classification accuracy is considerably improved and goes beyond AlexNet over ImageNet 2012.

Dictionary learning for classification was introduced in Mairal et al. (2009) and implemented with deep convolutional neural network architectures by several authors (Sulam et al., 2018; Mahdizadehaghdam et al., 2019; Sun et al., 2018). These algorithms have been applied to simpler image classification problems such as MNIST or CIFAR but no results were published on large datasets such as ImageNet on which they do not seem to scale. This is due to their complexity and the need to cascade several sparse codes, which leads to complex structures. We show that a single dictionary learning is sufficient if applied to scattering coefficients as opposed to raw data. A major issue is to compute the sparse code with a small network. We introduce a new architecture based on homotopy continuation, which leads to exponential convergence. It is thus implemented in a small convolutional network. The ALISTA (Liu et al., 2019) sparse code is incorporated in this framework. The main contributions of the paper are summarized below:

  • A Sparse Scattering network architecture, illustrated in Figure 1, where the classification is performed over a sparse code in a learned dictionary of scattering coefficients. It outperforms AlexNet over ImageNet 2012.

  • A new dictionary learning algorithm with homotopy sparse coding, optimized by gradient descent in a deep convolutional network.

  • A proof of exponential convergence of ALISTA (Liu et al., 2019) in presence of noise.

We explain the implementation and mathematical properties of each element of the sparse scattering network. Section 2 briefly reviews multiscale scattering transforms. Section 3 introduces homotopy dictionary learning for classification, with a proof of exponential convergence under appropriate assumptions. Section 4 analyzes image classification results of sparse scattering networks on ImageNet 2012.

2 Scattering Transform

A scattering transform is a cascade of wavelet transforms and ReLU or modulus non-linearities. It can be interpreted as a deep convolutional network with predefined wavelet filters (Mallat, 2016). For images, wavelet filters are calculated from a mother complex wavelet ψ whose average is zero. It is rotated by r-θ, dilated by 2j and its phase is shifted by α:


We choose a Morlet wavelet as in Bruna and Mallat (2013) to produce a sparse set of non-negligible wavelet coefficients. A ReLU is written ρ(a)=max(a,0).

Scattering coefficients of order m=1 are computed by averaging rectified wavelet coefficients with a subsampling stride of 2J:


where ϕJ is a Gaussian dilated by 2J (Bruna and Mallat, 2013).

The averaging by ϕJ eliminates the variations of ρ(xψj,θ,α) at scales smaller than 2J. This information is recovered by computing their variations at all scales 2j<2J, with a second wavelet transform. Scattering coefficients of order two are:


To reduce the dimension of scattering vectors, we define phase invariant second order scattering coefficients with a complex modulus instead of a phase sensitive ReLU:


The scattering representation includes order 1 coefficients and order 2 phase invariant coefficients. In this paper, we choose J=4 and hence 4 scales 1jJ, 8 angles θ and 4 phases α on [0,2π]. Scattering coefficients are computed with the software package Kymatio (Andreux et al., 2018). They preserve the image information and x can be recovered from Sx (Oyallon et al., 2019). For computational efficiency, the dimension of scattering vectors can be reduced by a factor 6 with a linear operator L which preserves the ability to recover a close approximation of x from LS(x). The dimension reduction operator L of Figure 1 is computed by preserving the principal directions of a PCA calculated on the training image databasis, or is optimized by gradient descent together with the other network parameters.

The scattering transform is Lipschitz continuous to translations and deformations (Mallat, 2012). Intra-class variablities due to translations and deformations smaller than 2J are linearized. Good classification accuracies are obtained with a linear classifier over scattering coefficients in image databases where intra-class variabilities are dominated by translations and deformations. This is the case for digits in MNIST or texture images (Bruna and Mallat, 2013). However it does not take into account variabilities of pattern structures and clutter which dominate complex image databases. To remove this clutter while preserving class separation requires some form of supervised learning as in deep convolutional networks. When applied to raw image data, dictionary learning often computes wavelet-like filters as in the first layer of deep neural networks (Krizhevsky et al., 2012). This is not sufficient to obtain high classification accuracy over complex image databases. The sparse scattering network of Figure 1 computes a sparse code of scattering representation β=LS(x), in a dictionary D optimized by minimizing the classification loss. For this purpose, the next section introduces a homotopy dictionary learning algorithm, implemented in a small convolutional network.

3 Homotopy Dictionary Learning for Classification

Task-driven dictionary learning for classification with sparse coding was proposed in Mairal et al. (2011). We introduce a small convolutional network architecture to implement a sparse 𝐥𝟏 code and learn the dictionary with a homotopy continuation on thresholds. ALISTA (Liu et al., 2019) is also shown to be a homotopy sparse coding whose exponential convergence is proved under more general conditions. Next section reviews dictionary learning for classification. Homotopy sparse coding algorithms are studied in Section 3.2.

3.1 Dictionary Learning

Unless specified, all norms are Euclidean norms. A sparse code approximates a vector β with a linear combination of a minimum number of columns Dm of a dictionary matrix D, which are normalized Dm=1. A sparse code is a vector α0 of minimum support which has a bounded error Dα0-βσ. Such sparse codes have been used to optimize signal compression and to remove noise, to solve inverse problems in compressive sensing (Candes et al., 2006), and for classification (Mairal et al., 2011).

Minimizing the support of a code α amounts to minimizing its 𝐥𝟎 “norm” which is not convex. This non-convex optimization is convexified by replacing the 𝐥𝟎 norm by an 𝐥𝟏 norm α1=m|α(m)|. It is solved by minimizing a convex Lagrangian with a multiplier λ* which depends on the error bound Dα-βσ:

α1(β)=argminα12Dα-β2+λ*α1. (1)

The sparse code α1(β,D,λ*) also depends upon the dictionary D and λ*, we omit these two last variables in the equation above for readability. One can prove (Donoho and Elad, 2006) that α1 has the same support as the minimum support sparse code α0 if the support size s and the dictionary coherence satisfy:

sμ(D)<1/2whereμ(D)=maxmm|DmtDm|. (2)
Sparse approximation versus sparse code

Sparse coding was first introduced for denoising (Donoho and Elad, 2006). The sparse approximation Dα1(β) is a non-linear filtering which preserves the “signal” components of β represented by few large amplitude coefficients. It eliminates the “noise” corresponding to incoherent components of β whose correlations with all dictionary vectors Dm are below λ*. It can also be interpreted as a projection in a union of linear spaces, each of which corresponding to a sparse code support.

For classification, we need to reduce intra-class variabilities and preserve or increase class separability. Intra-class variabilites may be interpreted as “noise” for the classification whereas image transformations from one class to another correspond to the “signal” we want to preserve. By defining sparse representations of training vectors βi=LS(xi) with different supports for different classes, it projects each class in different unions of linear spaces, which reduces intra-class variabilites while preserving separation. The dictionary learning optimizes the choice of D to obtain sparse codes with discriminative supports.

The classification is usually performed from the sparse code α1(β). We will see that a classification applied on the reconstructed sparse approximation Dα1(β) has nearly the same accuracy. Indeed, the linear operator D can preserve separated linear spaces.

Dictionary learning by gradient descent

Given a set of inputs and labels {xi,yi}, task-driven dictionary learning minimizes a classification loss (α1(xi,D,λ*),yi,W) that takes as input the sparse code α1(xi,D,λ*) of the input xi, the label yi and the classification parameters W. Thus, the loss depends upon the dictionary D, the Lagrange multiplier λ* which adjusts the sparsity level, and the classification parameters W. All these parameters can be jointly optimized by stochastic gradient descent to minimize the loss. This requires to compute the sparse code α1(xi,D,λ*) and its derivatives w.r.t D and λ*, which can be done by implementing the sparse coding in a deep convolutional network where the sparse code α1 is computed in the forward pass and the derivatives of α1 w.r.t D and λ* are computed in the backward pass. For this purpose, next section introduces a homotopy iterated soft thresholding network architecture.

3.2 Homotopy Iterated Soft Thresholding Network

This section introduces an efficient convolutional network architecture to compute sparse codes and learn dictionaries. Iterative Soft-Thresholding Algorithms (ISTA) (Daubechies et al., 2004), and FISTA (Beck and Teboulle, 2009) can be implemented with deep neural networks but they require many layers because of their slow convergence. LISTA algorithm (Gregor and LeCun, 2010) and its more recent version ALISTA (Liu et al., 2019) accelerate this convergence by introducing an auxiliary matrix which is adapted to the statistics of the input and to the properties of the dictionary. For ALISTA, it leads to exponential convergence under appropriate hypotheses. However, we shall see that this auxiliary matrix prevents from using this approach to learn a dictionary which minimizes a classification loss with a sparse 𝐥𝟏 code. We introduce a dictionary learning based on a homotopy Iterated Soft Thresholding Continuation (Jiao et al., 2017), which has the same exponential convergence without an auxiliary matrix. We shall see that ALISTA can also be considered as a homotopy continuation algorithm. We give a proof of exponential convergence for non-zero Lagrange multipliers λ* in this general framework.

Iterated Soft Thresholding

ISTA alternates a gradient step on the quadratic term of the 𝐥𝟏 Lagrangian (1) and a soft-thresholding Tλ(a)=sign(a)max(|a|-λ,0):

αn+1=Tϵλ*(αn+ϵDt(β-Dαn))withϵ<1DtD2,2, (3)

where .2,2 is the spectral norm and α0=0. The first iteration computes a non-sparse code α1=Tϵλ*(ϵDtβ)=ϵTλ*(Dtβ) which is progressively sparsified through iterated thresholdings. After n iterations, the sparse code αn has an error in O(n-1). FISTA (Beck and Teboulle, 2009) accelerates the error decay to O(n-2), which remains slow. Each iteration of ISTA and FISTA is computed with linear operators and a soft thresholding and can thus be implemented with one layer in a deep network (Papyan et al., 2017). However, the total number N of layers must be large to achieve a small error, and it requires to compute spectral norms during training, which is slow.

Figure 2: A homotopy iterated soft thresholding network computes an 𝐥𝟏 sparse code with exponentially decreasing thresholds λn=λmaxγ-n with γ=(λmax/λ*)1/N.
Homotopy Iterated Thresholding and ALISTA

Homotopy continuation algorithms introduced in Osborne et al. (2000), minimize the 𝐥𝟏 Lagrangian (1) by progressively decreasing the Lagrange multiplier. This optimization path is opposite to ISTA and FISTA since it goes from a very sparse initial solution towards a less sparse but optimal one, similarly to matching pursuit algorithms (Davis et al., 1997; Donoho and Tsaig, 2008). Homotopy algorithms are particularly efficient if the final Lagrange multiplier λ* is large so that the optimal solution is very sparse. We shall see that it is the case for classification.

The homotopy Iterative Soft-Thresholding Continuation (ISTC) of Jiao, Jin and Lu (Jiao et al., 2017) algorithm adjusts the decay rate of an exponentially decreasing sequence of Lagrange multipliers λn for nN:

αn+1=Tλn+1(αn+Dt(β-Dαn))withλn=λmaxγ-nandγ=(λmaxλ*)1/N. (4)

After N iterations, they prove that αN has the same support as the optimal sparse code α0, if λmaxDtβ, if the dictionary coherence condition (2) is satisfied, and if γ is sufficiently close to 1. Figure 2 illustrates the implementation of this sparse coding algorithm in a deep network of depth N, with side connections. For image classification we use a convolutional translation invariant dictionary, which defines a deep convolutional network. This convolutional network is used to compute the sparse code of scattering coefficients LS(x) in Figure 1.

ALISTA can be considered as a generalization of the homotopy ISTC algorithm, which replaces Dt by an auxiliary matrix Wt. We shall also study whether this flexibility can improve results. Each column Wm of W is normalized by |WmtDm|=1. The iteration (4) is thus rewritten

αn+1=Tλn+1(αn+Wt(β-Dαn))withλn=λmaxγ-nandγ=(λmaxλ*)1/N. (5)

The following theorem extends the convergence result of homotopy ISTC algorithm, by replacing the coherence of D by the mutual coherence of W and D


This theorem also extends the ALISTA exponential convergence result in the general setting where the sparse code introduces a reconstruction error, which may be interpreted as a noise removal. We will see that this error can be large for image classification applications because it corresponds to non-informative clutter removal.

Theorem 3.1.

Let α0 be the l0 sparse code of β with error β-Dα0σ. If its support s satisfies

sμ~<1/2 (6)

then soft-thresholding iterations (5) with thresholds

λn=λmaxγ-nλ*=Wt(β-Dα0)1-2γμ~s𝑤𝑖𝑡ℎ1<γ<(2μ~s)-1𝑎𝑛𝑑λmaxWtβ (7)

define a sparse code αn, whose support is included in the support of α0 and

αn-α02λmaxγ-n. (8)

The proof is in Appendix A of the supplementary material. It adapts the convergence proof of ISTC to the more general ALISTA framework. When W=D, we recover the convergence result of the homotopy ISTC, and when λ*=0 we recover the ALISTA exponential convergence result. However, one should not get too impressed by this exponential convergence rate because the condition sμ~<1/2 only applies to very sparse codes in highly incoherent dictionaries. ALISTA optimizes W in order to minimize the mutual coherence μ~, but it is usually not possible to reach sμ~<1/2. It thus restricts the set of possible signals and dictionaries, as opposed to ISTA and FISTA algorithms whose convergence is guaranteed for any signal and dictionary. However, the condition sμ~<1/2 is based on a brutal upper bound calculation in the proof, and it is not necessary for convergence. Next section shows that for image classification over ImageNet, by setting W=D we learn a dictionary where the homotopy ISTC algorithm converges exponentially although the theorem hypothesis is not satisfied. By learning simultaneously W and D, we shall see that we can reduce the classification loss but the resulting algorithm does not converge to a sparse 𝐥𝟏 code anymore.

4 Image Classification

The goal of this work is to construct a deep neural network model which is sufficiently simple to be interpreted mathematically, while reaching a level of accuracy of more complex deep convolutional networks on complex classification problems. This is why we concentrate on ImageNet as opposed to MNIST or CIFAR. Next section compares its performance to state of the art deep networks, and analyzes the influence of different architecture components. Section 4.2 studies the exponential convergence of the homotopy ISTC sparse coding network in comparison with ISTA, FISTA and a flexible ALISTA.

4.1 Image Classification on ImageNet

We show that a sparse dictionary learning on scattering coefficients considerably improves the classification performance on S(x) and can outperform AlexNet accuracy.

ImageNet ILSVRC2012 is a challenging color image dataset of 1.2 million training images and 50,000 validation images, divided into 1000 classes. Prior to convolutional networks, SIFT representations combined with Fisher vector encoding reached a Top 5 classification accuracy of 74.3% with multiple model averaging (S nchez and Perronnin, 2011). In their PyTorch implementation, the Top 5 accuracy of AlexNet and ResNet-152 is 79.1% and 94.1% respectively11 1 Accuracies from .

Figure 3: Two variants of the image classification architecture: ISTC α inputs the sparse code α into the classifier, ISTC Dα inputs the reconstruction Dα into the classifier.

The scattering transform S(x) at a scale 2J=16 of an ImageNet color image is a spatial array of 14×14 of 1539 channels. Applying to S(x) an MLP classifier with 2 hidden layers of size 4096, ReLU and dropout like in AlexNet gives a 60.7% Top 5 accuracy. Applying to S(x) a 3-layer SLE network of 1x1 convolutions with ReLU with the same MLP reaches AlexNet performance (Oyallon et al., 2017). However, there is no mathematical understanding of the operations performed by these three layers, and the origin of the improvements.

The sparse scattering architecture is described in Figure 3. The convolutional operator L is applied on a standardized scattering transform and reduces the number of scattering channels from 1539 to 256. The sparse code is calculated with a 1×1 convolutional dictionary D having 2048 vectors. It takes as input an array LS(x) of 14×14×256 which has been normalized and outputs a code α1 of size 14×14×2048 or a sparse approximation Dα1 of size 14×14×256. Either is provided as input to the MLP classifier. The ISTC network illustrated in Figure 2 has N=12 layers with softshrink non-linearities and no batch normalization. Before the classifier, there is a batch normalization and a 5×5 average pooling. The MLP classifier has 2 hidden layers of size 4096, ReLU and dropout rate of 0.3. The supervised learning jointly optimizes L, the dictionary D with the Lagrange multiplier λ* and the MLP classifier. It is done with a stochastic gradient descent during 120 epochs using an initial learning rate of 0.01 with a decay of 0.1 at epochs 50 and 100. With a sparse code in input of the MLP, it has a Top 5 accuracy of 80.9%, outperforming AlexNet. If we replace the ISTC network by an ALISTA network, the accuracy improves to 83.7%. However, next section shows that contrarily to ISTC, an ALISTA network optimized for classification does not compute a sparse 𝐥𝟏 code and is therefore not mathematically interpretable. In the following we thus concentrate on the homotopy ISTC network.

Table 1: Top 1 and Top 5 accuracy on ImageNet for Fisher Vectors (Perronnin and Larlus, 2015), AlexNet (Krizhevsky et al., 2012), ResNet 152 (He et al., 2016), Scattering with SLE (Oyallon et al., 2019), Scattering alone, with a sparse code α1 or with a classification on Dα1.
Fisher AlexNet ResNet Scat. + Scat. Scat.+ Scat.+ Scat.+
Vectors 152 SLE alone ISTC α ISTC Dα ALISTA α
Top1 55.6 56.5 78.3 57.0 37.5 59.0 54.8 62.6
Top5 78.4 79.1 94.1 79.6 60.7 80.9 77.8 83.7

The dimension reduction operator L has a marginal effect in terms of performance. If we eliminate it or if we replace it by an unsupervised PCA dimension reduction, the performance drops by less than 2%, whereas the accuracy drops by 20% if we eliminate the sparse coding. The considerable improvement brought by the sparse code is further amplified if the MLP classifier is replaced by a linear classifier. A linear classifier on a scattering vector has a (Top 1, Top 5) accuracy of (23.4%,41.8%). With an ISTC sparse code in a learned dictionary the accuracy jumps to (51.5%,73.4%) and hence improves by more than 30%.

If the MLP classification is applied to the sparse approximation Dα1 as opposed to the sparse code α1 then the accuracy drops only by 3%. The sparse approximation Dα1 of LS(x) has a small dimension 14×14×256 similar to AlexNet last convolutional layer output and is not sparse. This indicates that it is not the individual sparse outputs of the sparse code α1 which are important but the linear space defined by their support, which are mapped to other linear spaces by D.

The optimization learns a large factor λ* which yields a large approximation error LS(x)-Dα1/LS(x)0.5. The resulting code α1 is very sparse with about 3% non-zero coefficients. The sparse approximation Dα1 thus eliminates nearly half of the energy of LS(x) which can be interpreted as non-informative "clutter" removal. The sparse code α1 is a projection of LS(x) over a linear space defined by the support of α1. If a column Dm is interpreted as a "scattering space feature" then this linear space is a conjunction of a particular set of such features. The high classification accuracy indicates that different linear spaces correspond mostly to different classes. These linear spaces are mapped by D into lower dimensional linear spaces which remain separated. It thus indicates that D is optimized to preserves discriminative directions which transform a vector of one class into a vector of another one.

4.2 Convergence of Homotopy Algorithms

To guarantee that the network is mathematically interpretable we verify numerically that the homotopy ISTC algorithm computes an accurate approximation of the optimal 𝐥𝟏 sparse code in (1), with a small number of iterations (typically 12).

Figure 4: Value of (αn)=12Dαn-β2+λ*αn1 versus the number of iterations n, for ISTC, ISTA and FISTA on the left, and for ALISTA, ISTA and FISTA on the right.

The Theorem 3.1 guarantees an exponential convergence if sμ(D)<1/2. In our classification setting, the theorem hypothesis is clearly not satisfied : sμ(D)60, which is well above 1/2. However, this condition is not necessary and based on a relatively crude upper bound.

Figure 4 left shows numerically that ISTC algorithm minimizes the 𝐥𝟏 Lagrangian (α)=12Dα-β2+λ*α1, with an exponential convergence which is faster than ISTA and FISTA over the dictionary that it learns. On the contrary, Figure 4 right shows that ALISTA does not minimize the 𝐥𝟏 Lagrangian at all. This comes from the fact that contrarily to standard ALISTA (Liu et al., 2019), we do not impose that the auxiliary matrix W has a minimum joint coherence with the dictionary D. It would require too much computation and the matrix W is rather optimized to minimize the classification loss. This is why it improves the classification accuracy but does not compute a sparse 𝐥𝟏 code.

To further compare the convergence speed of ISTC versus ISTA and FISTA, we compute the relative mean square error MSE(x,y)=x-y2/x2 between the optimal sparse code α1 and the sparse code output of 12 iterations of each of these three algorithms. The MSE is 0.02 for ISTC, 0.25 for FISTA and 0.46 for ISTA, which shows that ISTC reduces the error by a factor 10 compared to ISTA and FISTA after 12 iterations.

5 Conclusion

The first goal of this work is to define a deep neural network having a good accuracy for complex image classification and which can be analyzed mathematically. This sparse scattering network learns the representation by optimizing a sparse code computed with a dictionary learned over scattering coefficients. The dictionary learning is implemented with a new homotopy ISTC network having an exponential convergence. The sparse dictionary learning improves accuracy by more than 20% over a scattering representation alone, and has a higher accuracy than AlexNet. The dictionary seems to be optimized in order to build separated sparse codes for each class, which belong to unions of linear spaces. Because the network operators are mathematically well specified, the analysis of its properties is simpler than for standard deep convolutional networks. However, more work is needed to understand the dictionary optimization and how it relates to image and class properties.


This work was supported by the ERC InvariantClass 320959 and grants from Région Ile-de-France. We thank the Scientific Computing Core at the Flatiron Institute for the use of their computing resources. We would like to thank Eugene Belilovsky for helpful discussions and comments.


  • M. Andreux, T. Angles, G. Exarchakis, R. Leonarduzzi, G. Rochette, L. Thiry, J. Zarka, S. Mallat, J. Andén, E. Belilovsky, J. Bruna, V. Lostanlen, M. J. Hirn, E. Oyallon, S. Zhang, C. E. Cella, and M. Eickenberg (2018) Kymatio: scattering transforms in python. CoRR. External Links: Link Cited by: §2.
  • A. Beck and M. Teboulle (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences 2 (1), pp. 183–202. Cited by: §3.2, §3.2.
  • J. Bruna and S. Mallat (2013) Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1872–1886. Cited by: §1, §2, §2, §2.
  • E.J. Candes, J. Romberg, and T. Tao (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52 (2), pp. 489–509. Cited by: §3.1.
  • I. Daubechies, M. Defrise, and C. De Mol (2004) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57 (11), pp. 1413–1457. Cited by: §3.2.
  • G. Davis, S. Mallat, and M. Avellaneda (1997) Adaptive greedy approximations. Constr. Approx. 13 (1), pp. 57–98. Cited by: §3.2.
  • D.L. Donoho and M. Elad (2006) On the stability of the basis pursuit in the presence of noise. Signal Processing 86 (3), pp. 511–532. Cited by: §3.1, §3.1.
  • D.L. Donoho and Y. Tsaig (2008) Fast solution of l1-norm minimization problems when the solution may be sparse. IEEE Trans. Information Theory 54 (11), pp. 4789–4812. Cited by: §3.2.
  • K. Gregor and Y. LeCun (2010) Learning fast approximations of sparse coding. In ICML, pp. 399–406. Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
  • Y. Jiao, B. Jin, and X. Lu (2017) Iterative soft/hard thresholding with homotopy continuation for sparse recovery. IEEE Signal Processing Letters 24 (6), pp. 784–788. Cited by: §3.2, §3.2.
  • A. Krizhevsky, I. Sutskever, and G.E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106–1114. Cited by: §1, §2, Table 1.
  • Y. LeCun, Y. Bengio, and G.E. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • J. Liu, X. Chen, Z. Wang, and W. Yin (2019) ALISTA: analytic weights are as good as learned weights in LISTA. In International Conference on Learning Representations, Cited by: 3rd item, §1, §3.2, §3, §4.2.
  • S. Mahdizadehaghdam, A. Panahi, H. Krim, and L. Dai (2019) Deep dictionary learning: a parametric network approach. IEEE Transactions on Image Processing 28 (10), pp. 4790–4802. Cited by: §1.
  • J. Mairal, F. Bach, and J. Ponce (2011) Task-driven dictionary learning. IEEE transactions on pattern analysis and machine intelligence 34 (4), pp. 791–804. Cited by: §3.1, §3.
  • J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. Bach (2009) Supervised dictionary learning. In Advances in neural information processing systems, pp. 1033–1040. Cited by: §1.
  • S. Mallat (2012) Group invariant scattering. Comm. Pure Appl. Math. 65 (10), pp. 1331–1398. Cited by: §2.
  • S. Mallat (2016) Understanding deep convolutional networks. Phil. Trans. of Royal Society A 374 (2065). Cited by: §2.
  • M.R. Osborne, B. Presnell, and B.A. Turlach (2000) A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20 (3), pp. 389. Cited by: §3.2.
  • E. Oyallon, E. Belilovsky, and S. Zagoruyko (2017) Scaling the scattering transform: deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, pp. 5618–5627. Cited by: §4.1.
  • E. Oyallon and S. Mallat (2014) Deep roto-translation scattering for object classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2865–2873. Cited by: §1.
  • E. Oyallon, S. Zagoruyko, G. Huang, N. Komodakis, S. Lacoste-Julien, M. Blaschko, and E. Belilovsky (2019) Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2208–2221. Cited by: §2, Table 1.
  • V. Papyan, Y. Romano, and M. Elad (2017) Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research 18, pp. 83:1–83:52. Cited by: §3.2.
  • F. Perronnin and D. Larlus (2015) Fisher vectors meet neural networks: a hybrid classification architecture. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3743–3752. Cited by: Table 1.
  • J. S nchez and F. Perronnin (2011) High-dimensional signature compression for large-scale image classification.. In CVPR, pp. 1665–1672. Cited by: §4.1.
  • J. Sulam, V. Papyan, Y. Romano, and M. Elad (2018) Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §1.
  • X. Sun, N. M. Nasrabadi, and T. D. Tran (2018) Supervised deep sparse coding networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 346–350. Cited by: §1.

Appendix A Appendix

A.1 Proof of Theorem 3.1

Let α0 be the optimal 𝐥𝟎 sparse code. We denote by 𝒮(α) the support of any α. We are going to prove by induction on n that for any n0 we have 𝒮(αn)𝒮(α0) and αn-α02λn if λnλ*.

For n=0, α0=0 so 𝒮(α0)= is indeed included in the support of α0 and α0-α0=α0. To verify the induction hypothesis for λ0=λmaxλ*, we shall prove that α02λmax.

Let us write the error w=β-Dα0. For all m


Since the support of α0 is smaller than s, WmtDm=1 and μ~=maxmm|WmtDm|


so taking the max on m gives:


But given the inequalities

Wtβ λmax
Wtw λmax(1-2γμ~s)
(1-γμ~s)(1-μ~s) 1 since γ1 and (1-μ~s)>0

we get


Let us now suppose that the property is valid for n and let us prove it for n+1. We denote by D𝒜 the restriction of D to vectors indexed by 𝒜. We begin by showing that 𝒮(αn+1)𝒮(α0). For any m𝒮(αn+1), since β=Dα0+w and WmtDm=1 we have

αn+1(m) = Tλn+1(αn(m)+Wmt(β-Dαn))
= Tλn+1(α0(m)+Wmt(D𝒮(α0)𝒮(αn)-{m}(α0-αn)𝒮(α0)𝒮(αn)-{m}+w))

For any m not in 𝒮(α0), let us prove that αn+1(m)=0. The induction hypothesis assumes that 𝒮(αn)𝒮(α0) and α0-αn2λn with λnλ* so:

I = |α0(m)+Wmt(D𝒮(α0)𝒮(αn)-{m}(α0-αn)𝒮(α0)𝒮(αn)-{m}+w)|
|Wmt(D𝒮(α0)(α0-αn)𝒮(α0))|+|Wmtw| since 𝒮(αn)𝒮(α0) and α0(m)=0 by assumption.

Since we assume that λn+1λ*, we have


and thus


since λn=γλn+1.

Because of the thresholding Tλn+1, it proves that αn+1(m)=0 and hence that 𝒮(αn+1)𝒮(α0).

Let us now evaluate α0-αn+1. For any (α1,α2,λ), a soft thresholding satisfies



|αn+1(m)-α0(m)| λn+1+|Wmt(D𝒮(α0)𝒮(αn)-{m}(α0-αn)𝒮(α0)𝒮(αn)-{m})|+|Wmtw|

Taking a max over m proves the induction hypothesis.