Abstract
We introduce a sparse scattering deep convolutional neural network, whichprovides a simple model to analyze properties of deep representation learningfor classification. Learning a single dictionary matrix with a classifieryields a higher classification accuracy than AlexNet over the ImageNetILSVRC2012 dataset. The network first applies a scattering transform whichlinearizes variabilities due to geometric transformations such as translationsand small deformations. A sparse l1 dictionary coding reduces intraclassvariability while preserving class separation through projections over unionsof linear spaces. It is implemented in a deep convolutional network with ahomotopy algorithm having an exponential convergence. A convergence proof isgiven in a general framework including ALISTA. Classification results areanalyzed over ImageNet.
Quick Read (beta)
Deep Network Classification by Scattering and Homotopy Dictionary Learning
Abstract
We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet ILSVRC2012 dataset. The network first applies a scattering transform which linearizes variabilities due to geometric transformations such as translations and small deformations. A sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ dictionary coding reduces intraclass variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework including ALISTA. Classification results are analyzed over ImageNet.
Deep Network Classification by Scattering and Homotopy Dictionary Learning
John Zarka, Louis Thiry, Tomás Angles 

Département d’informatique, ENS, CNRS, PSL University, Paris, France 
{john.zarka,louis.thiry,tomas.angles}@ens.fr 
Stéphane Mallat 

Collège de France, Paris, France 
ENS, PSL University, Paris, France 
Flatiron Institute, New York, USA 
[email protected] 
1 Introduction
Deep convolutional networks have spectacular applications to classification and regression (LeCun et al., 2015), but they are a black box which are hard to analyze mathematically because of their architecture complexity. We introduce a simplified convolutional neural network illustrated in Figure 1, whose learning can be reduced to a single dictionary matrix and a classifier. Despite its simplicity, it applies to complex image classification and reaches a higher accuracy than AlexNet (Krizhevsky et al., 2012) over ImageNet ILSVRC2012. It is a cascade of well understood mathematical operators, and thus provides a simplified mathematical framework to analyze classification performances.
Intraclass variabilities due to geometric image transformations such as translations or small deformations are linearized by a scattering transform (Bruna and Mallat, 2013) which is invertible. Scattering transforms include no learning. They are effective representations to classify relatively simple images such as digits in MNIST, textures (Bruna and Mallat, 2013) or small CIFAR images (Oyallon and Mallat, 2014). Learning deep convolutional networks however gives a much higher accuracy over complex databases such as ImageNet. A fundamental issue is to understand the source of this improvement. This paper shows that it can be captured by a sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ code in a dictionary $D$ optimized by supervised learning. It is implemented with a deep convolutional network architecture. The sparse code eliminates noninformative image components and projects each class in unions of linear spaces. The classification accuracy is considerably improved and goes beyond AlexNet over ImageNet 2012.
Dictionary learning for classification was introduced in Mairal et al. (2009) and implemented with deep convolutional neural network architectures by several authors (Sulam et al., 2018; Mahdizadehaghdam et al., 2019; Sun et al., 2018). These algorithms have been applied to simpler image classification problems such as MNIST or CIFAR but no results were published on large datasets such as ImageNet on which they do not seem to scale. This is due to their complexity and the need to cascade several sparse codes, which leads to complex structures. We show that a single dictionary learning is sufficient if applied to scattering coefficients as opposed to raw data. A major issue is to compute the sparse code with a small network. We introduce a new architecture based on homotopy continuation, which leads to exponential convergence. It is thus implemented in a small convolutional network. The ALISTA (Liu et al., 2019) sparse code is incorporated in this framework. The main contributions of the paper are summarized below:

•
A Sparse Scattering network architecture, illustrated in Figure 1, where the classification is performed over a sparse code in a learned dictionary of scattering coefficients. It outperforms AlexNet over ImageNet 2012.

•
A new dictionary learning algorithm with homotopy sparse coding, optimized by gradient descent in a deep convolutional network.

•
A proof of exponential convergence of ALISTA (Liu et al., 2019) in presence of noise.
We explain the implementation and mathematical properties of each element of the sparse scattering network. Section 2 briefly reviews multiscale scattering transforms. Section 3 introduces homotopy dictionary learning for classification, with a proof of exponential convergence under appropriate assumptions. Section 4 analyzes image classification results of sparse scattering networks on ImageNet 2012.
2 Scattering Transform
A scattering transform is a cascade of wavelet transforms and ReLU or modulus nonlinearities. It can be interpreted as a deep convolutional network with predefined wavelet filters (Mallat, 2016). For images, wavelet filters are calculated from a mother complex wavelet $\psi $ whose average is zero. It is rotated by ${r}_{\theta}$, dilated by ${2}^{j}$ and its phase is shifted by $\alpha $:
$${\psi}_{j,\theta}(u)={2}^{2j}\psi ({2}^{j}{r}_{\theta}u)\text{and}{\psi}_{j,\theta ,\alpha}=\mathrm{Real}({e}^{i\alpha}{\psi}_{j,\theta}(u)).$$ 
We choose a Morlet wavelet as in Bruna and Mallat (2013) to produce a sparse set of nonnegligible wavelet coefficients. A ReLU is written $\rho (a)=\mathrm{max}(a,0)$.
Scattering coefficients of order $m=1$ are computed by averaging rectified wavelet coefficients with a subsampling stride of ${2}^{J}$:
$$Sx(u,k,\alpha )=\rho (x\star {\psi}_{j,\theta ,\alpha})\star {\varphi}_{J}({2}^{J}u)\text{with}k=(j,\theta ),$$ 
where ${\varphi}_{J}$ is a Gaussian dilated by ${2}^{J}$ (Bruna and Mallat, 2013).
The averaging by ${\varphi}_{J}$ eliminates the variations of $\rho (x\star {\psi}_{j,\theta ,\alpha})$ at scales smaller than ${2}^{J}$. This information is recovered by computing their variations at all scales $$, with a second wavelet transform. Scattering coefficients of order two are:
$$Sx(u,k,{k}^{\prime},\alpha ,{\alpha}^{\prime})=\rho (\rho (x\star {\psi}_{j,\theta ,\alpha})\star {\psi}_{{j}^{\prime},{\theta}^{\prime},{\alpha}^{\prime}})\star {\varphi}_{J}({2}^{J}u)\text{with}k,{k}^{\prime}=(j,\theta ),({j}^{\prime},{\theta}^{\prime}).$$ 
To reduce the dimension of scattering vectors, we define phase invariant second order scattering coefficients with a complex modulus instead of a phase sensitive ReLU:
$$Sx(u,k,{k}^{\prime})=x\star {\psi}_{j,\theta}\star {\psi}_{{j}^{\prime},{\theta}^{\prime}}\star {\varphi}_{J}({2}^{J}u)\text{for}{j}^{\prime}>j.$$ 
The scattering representation includes order $1$ coefficients and order $2$ phase invariant coefficients. In this paper, we choose $J=4$ and hence 4 scales $1\le j\le J$, $8$ angles $\theta $ and 4 phases $\alpha $ on $[0,2\pi ]$. Scattering coefficients are computed with the software package Kymatio (Andreux et al., 2018). They preserve the image information and $x$ can be recovered from $Sx$ (Oyallon et al., 2019). For computational efficiency, the dimension of scattering vectors can be reduced by a factor $6$ with a linear operator $L$ which preserves the ability to recover a close approximation of $x$ from $LS(x)$. The dimension reduction operator $L$ of Figure 1 is computed by preserving the principal directions of a PCA calculated on the training image databasis, or is optimized by gradient descent together with the other network parameters.
The scattering transform is Lipschitz continuous to translations and deformations (Mallat, 2012). Intraclass variablities due to translations and deformations smaller than ${2}^{J}$ are linearized. Good classification accuracies are obtained with a linear classifier over scattering coefficients in image databases where intraclass variabilities are dominated by translations and deformations. This is the case for digits in MNIST or texture images (Bruna and Mallat, 2013). However it does not take into account variabilities of pattern structures and clutter which dominate complex image databases. To remove this clutter while preserving class separation requires some form of supervised learning as in deep convolutional networks. When applied to raw image data, dictionary learning often computes waveletlike filters as in the first layer of deep neural networks (Krizhevsky et al., 2012). This is not sufficient to obtain high classification accuracy over complex image databases. The sparse scattering network of Figure 1 computes a sparse code of scattering representation $\beta =LS(x)$, in a dictionary $D$ optimized by minimizing the classification loss. For this purpose, the next section introduces a homotopy dictionary learning algorithm, implemented in a small convolutional network.
3 Homotopy Dictionary Learning for Classification
Taskdriven dictionary learning for classification with sparse coding was proposed in Mairal et al. (2011). We introduce a small convolutional network architecture to implement a sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ code and learn the dictionary with a homotopy continuation on thresholds. ALISTA (Liu et al., 2019) is also shown to be a homotopy sparse coding whose exponential convergence is proved under more general conditions. Next section reviews dictionary learning for classification. Homotopy sparse coding algorithms are studied in Section 3.2.
3.1 Dictionary Learning
Unless specified, all norms are Euclidean norms. A sparse code approximates a vector $\beta $ with a linear combination of a minimum number of columns ${D}_{m}$ of a dictionary matrix $D$, which are normalized $\parallel {D}_{m}\parallel =1$. A sparse code is a vector ${\alpha}^{0}$ of minimum support which has a bounded error $\parallel D{\alpha}^{0}\beta \parallel \le \sigma $. Such sparse codes have been used to optimize signal compression and to remove noise, to solve inverse problems in compressive sensing (Candes et al., 2006), and for classification (Mairal et al., 2011).
Minimizing the support of a code $\alpha $ amounts to minimizing its ${\mathbf{l}}^{\mathrm{\U0001d7ce}}$ “norm” which is not convex. This nonconvex optimization is convexified by replacing the ${\mathbf{l}}^{\mathrm{\U0001d7ce}}$ norm by an ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ norm ${\parallel \alpha \parallel}_{1}={\sum}_{m}\alpha (m)$. It is solved by minimizing a convex Lagrangian with a multiplier ${\lambda}_{*}$ which depends on the error bound $\parallel D\alpha \beta \parallel \le \sigma $:
$${\alpha}^{1}(\beta )=\mathrm{arg}\underset{\alpha}{\mathrm{min}}\frac{1}{2}{\parallel D\alpha \beta \parallel}^{2}+{\lambda}_{*}{\parallel \alpha \parallel}_{1}.$$  (1) 
The sparse code ${\alpha}^{1}(\beta ,D,{\lambda}_{*})$ also depends upon the dictionary $D$ and ${\lambda}_{*}$, we omit these two last variables in the equation above for readability. One can prove (Donoho and Elad, 2006) that ${\alpha}^{1}$ has the same support as the minimum support sparse code ${\alpha}^{0}$ if the support size $s$ and the dictionary coherence satisfy:
$$  (2) 
Sparse approximation versus sparse code
Sparse coding was first introduced for denoising (Donoho and Elad, 2006). The sparse approximation $D{\alpha}^{1}(\beta )$ is a nonlinear filtering which preserves the “signal” components of $\beta $ represented by few large amplitude coefficients. It eliminates the “noise” corresponding to incoherent components of $\beta $ whose correlations with all dictionary vectors ${D}_{m}$ are below ${\lambda}_{*}$. It can also be interpreted as a projection in a union of linear spaces, each of which corresponding to a sparse code support.
For classification, we need to reduce intraclass variabilities and preserve or increase class separability. Intraclass variabilites may be interpreted as “noise” for the classification whereas image transformations from one class to another correspond to the “signal” we want to preserve. By defining sparse representations of training vectors ${\beta}_{i}=LS({x}_{i})$ with different supports for different classes, it projects each class in different unions of linear spaces, which reduces intraclass variabilites while preserving separation. The dictionary learning optimizes the choice of $D$ to obtain sparse codes with discriminative supports.
The classification is usually performed from the sparse code ${\alpha}^{1}(\beta )$. We will see that a classification applied on the reconstructed sparse approximation $D{\alpha}^{1}(\beta )$ has nearly the same accuracy. Indeed, the linear operator $D$ can preserve separated linear spaces.
Dictionary learning by gradient descent
Given a set of inputs and labels $\{{x}_{i},{y}_{i}\}$, taskdriven dictionary learning minimizes a classification loss $\mathrm{\ell}({\alpha}^{1}({x}_{i},D,{\lambda}_{*}),{y}_{i},W)$ that takes as input the sparse code ${\alpha}^{1}({x}_{i},D,{\lambda}_{*})$ of the input ${x}_{i}$, the label ${y}_{i}$ and the classification parameters $W$. Thus, the loss depends upon the dictionary $D$, the Lagrange multiplier ${\lambda}_{*}$ which adjusts the sparsity level, and the classification parameters $W$. All these parameters can be jointly optimized by stochastic gradient descent to minimize the loss. This requires to compute the sparse code ${\alpha}^{1}({x}_{i},D,{\lambda}_{*})$ and its derivatives w.r.t $D$ and ${\lambda}_{*}$, which can be done by implementing the sparse coding in a deep convolutional network where the sparse code ${\alpha}^{1}$ is computed in the forward pass and the derivatives of ${\alpha}^{1}$ w.r.t $D$ and ${\lambda}_{*}$ are computed in the backward pass. For this purpose, next section introduces a homotopy iterated soft thresholding network architecture.
3.2 Homotopy Iterated Soft Thresholding Network
This section introduces an efficient convolutional network architecture to compute sparse codes and learn dictionaries. Iterative SoftThresholding Algorithms (ISTA) (Daubechies et al., 2004), and FISTA (Beck and Teboulle, 2009) can be implemented with deep neural networks but they require many layers because of their slow convergence. LISTA algorithm (Gregor and LeCun, 2010) and its more recent version ALISTA (Liu et al., 2019) accelerate this convergence by introducing an auxiliary matrix which is adapted to the statistics of the input and to the properties of the dictionary. For ALISTA, it leads to exponential convergence under appropriate hypotheses. However, we shall see that this auxiliary matrix prevents from using this approach to learn a dictionary which minimizes a classification loss with a sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ code. We introduce a dictionary learning based on a homotopy Iterated Soft Thresholding Continuation (Jiao et al., 2017), which has the same exponential convergence without an auxiliary matrix. We shall see that ALISTA can also be considered as a homotopy continuation algorithm. We give a proof of exponential convergence for nonzero Lagrange multipliers ${\lambda}_{*}$ in this general framework.
Iterated Soft Thresholding
ISTA alternates a gradient step on the quadratic term of the ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ Lagrangian (1) and a softthresholding ${T}_{\lambda}(a)=\mathrm{sign}(a)\mathrm{max}(a\lambda ,0)$:
$$  (3) 
where $\parallel .{\parallel}_{2,2}$ is the spectral norm and ${\alpha}_{0}=0$. The first iteration computes a nonsparse code ${\alpha}_{1}={T}_{\u03f5{\lambda}_{*}}(\u03f5{D}^{t}\beta )=\u03f5{T}_{{\lambda}_{*}}({D}^{t}\beta )$ which is progressively sparsified through iterated thresholdings. After $n$ iterations, the sparse code ${\alpha}_{n}$ has an error in $O({n}^{1})$. FISTA (Beck and Teboulle, 2009) accelerates the error decay to $O({n}^{2})$, which remains slow. Each iteration of ISTA and FISTA is computed with linear operators and a soft thresholding and can thus be implemented with one layer in a deep network (Papyan et al., 2017). However, the total number $N$ of layers must be large to achieve a small error, and it requires to compute spectral norms during training, which is slow.
Homotopy Iterated Thresholding and ALISTA
Homotopy continuation algorithms introduced in Osborne et al. (2000), minimize the ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ Lagrangian (1) by progressively decreasing the Lagrange multiplier. This optimization path is opposite to ISTA and FISTA since it goes from a very sparse initial solution towards a less sparse but optimal one, similarly to matching pursuit algorithms (Davis et al., 1997; Donoho and Tsaig, 2008). Homotopy algorithms are particularly efficient if the final Lagrange multiplier ${\lambda}_{*}$ is large so that the optimal solution is very sparse. We shall see that it is the case for classification.
The homotopy Iterative SoftThresholding Continuation (ISTC) of Jiao, Jin and Lu (Jiao et al., 2017) algorithm adjusts the decay rate of an exponentially decreasing sequence of Lagrange multipliers ${\lambda}_{n}$ for $n\le N$:
$${\alpha}_{n+1}={T}_{{\lambda}_{n+1}}({\alpha}_{n}+{D}^{t}(\beta D{\alpha}_{n}))\text{with}{\lambda}_{n}={\lambda}_{\mathrm{max}}{\gamma}^{n}\text{and}\gamma ={\left(\frac{{\lambda}_{\mathrm{max}}}{{\lambda}_{*}}\right)}^{1/N}.$$  (4) 
After $N$ iterations, they prove that ${\alpha}_{N}$ has the same support as the optimal sparse code ${\alpha}^{0}$, if ${\lambda}_{\mathrm{max}}\ge {\parallel {D}^{t}\beta \parallel}_{\mathrm{\infty}}$, if the dictionary coherence condition (2) is satisfied, and if $\gamma $ is sufficiently close to $1$. Figure 2 illustrates the implementation of this sparse coding algorithm in a deep network of depth $N$, with side connections. For image classification we use a convolutional translation invariant dictionary, which defines a deep convolutional network. This convolutional network is used to compute the sparse code of scattering coefficients $LS(x)$ in Figure 1.
ALISTA can be considered as a generalization of the homotopy ISTC algorithm, which replaces ${D}^{t}$ by an auxiliary matrix ${W}^{t}$. We shall also study whether this flexibility can improve results. Each column ${W}_{m}$ of $W$ is normalized by ${W}_{m}^{t}{D}_{m}=1$. The iteration (4) is thus rewritten
$${\alpha}_{n+1}={T}_{{\lambda}_{n+1}}({\alpha}_{n}+{W}^{t}(\beta D{\alpha}_{n}))\text{with}{\lambda}_{n}={\lambda}_{\mathrm{max}}{\gamma}^{n}\text{and}\gamma ={\left(\frac{{\lambda}_{\mathrm{max}}}{{\lambda}_{*}}\right)}^{1/N}.$$  (5) 
The following theorem extends the convergence result of homotopy ISTC algorithm, by replacing the coherence of $D$ by the mutual coherence of $W$ and $D$
$$\stackrel{~}{\mu}=\underset{m\ne {m}^{\prime}}{\mathrm{max}}{W}_{{m}^{\prime}}^{t}{D}_{m}.$$ 
This theorem also extends the ALISTA exponential convergence result in the general setting where the sparse code introduces a reconstruction error, which may be interpreted as a noise removal. We will see that this error can be large for image classification applications because it corresponds to noninformative clutter removal.
Theorem 3.1.
Let ${\alpha}^{\mathrm{0}}$ be the ${\mathrm{l}}^{\mathrm{0}}$ sparse code of $\beta $ with error $\mathrm{\parallel}\beta \mathrm{}D\mathit{}{\alpha}^{\mathrm{0}}\mathrm{\parallel}\mathrm{\le}\sigma $. If its support $s$ satisfies
$$  (6) 
then softthresholding iterations (5) with thresholds
$$  (7) 
define a sparse code ${\alpha}_{n}$, whose support is included in the support of ${\alpha}^{\mathrm{0}}$ and
$${\parallel {\alpha}_{n}{\alpha}^{0}\parallel}_{\mathrm{\infty}}\le 2{\lambda}_{\mathrm{max}}{\gamma}^{n}.$$  (8) 
The proof is in Appendix A of the supplementary material. It adapts the convergence proof of ISTC to the more general ALISTA framework. When $W=D$, we recover the convergence result of the homotopy ISTC, and when ${\lambda}_{*}=0$ we recover the ALISTA exponential convergence result. However, one should not get too impressed by this exponential convergence rate because the condition $$ only applies to very sparse codes in highly incoherent dictionaries. ALISTA optimizes $W$ in order to minimize the mutual coherence $\stackrel{~}{\mu}$, but it is usually not possible to reach $$. It thus restricts the set of possible signals and dictionaries, as opposed to ISTA and FISTA algorithms whose convergence is guaranteed for any signal and dictionary. However, the condition $$ is based on a brutal upper bound calculation in the proof, and it is not necessary for convergence. Next section shows that for image classification over ImageNet, by setting $W=D$ we learn a dictionary where the homotopy ISTC algorithm converges exponentially although the theorem hypothesis is not satisfied. By learning simultaneously $W$ and $D$, we shall see that we can reduce the classification loss but the resulting algorithm does not converge to a sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ code anymore.
4 Image Classification
The goal of this work is to construct a deep neural network model which is sufficiently simple to be interpreted mathematically, while reaching a level of accuracy of more complex deep convolutional networks on complex classification problems. This is why we concentrate on ImageNet as opposed to MNIST or CIFAR. Next section compares its performance to state of the art deep networks, and analyzes the influence of different architecture components. Section 4.2 studies the exponential convergence of the homotopy ISTC sparse coding network in comparison with ISTA, FISTA and a flexible ALISTA.
4.1 Image Classification on ImageNet
We show that a sparse dictionary learning on scattering coefficients considerably improves the classification performance on $S(x)$ and can outperform AlexNet accuracy.
ImageNet ILSVRC2012 is a challenging color image dataset of 1.2 million training images and 50,000 validation images, divided into 1000 classes. Prior to convolutional networks, SIFT representations combined with Fisher vector encoding reached a Top 5 classification accuracy of 74.3% with multiple model averaging (S nchez and Perronnin, 2011). In their PyTorch implementation, the Top 5 accuracy of AlexNet and ResNet152 is 79.1% and 94.1% respectively^{1}^{1} 1 Accuracies from https://pytorch.org/docs/master/torchvision/models.html .
The scattering transform $S(x)$ at a scale ${2}^{J}=16$ of an ImageNet color image is a spatial array of $14\times 14$ of $1539$ channels. Applying to $S(x)$ an MLP classifier with 2 hidden layers of size 4096, ReLU and dropout like in AlexNet gives a 60.7% Top 5 accuracy. Applying to $S(x)$ a 3layer SLE network of 1x1 convolutions with ReLU with the same MLP reaches AlexNet performance (Oyallon et al., 2017). However, there is no mathematical understanding of the operations performed by these three layers, and the origin of the improvements.
The sparse scattering architecture is described in Figure 3. The convolutional operator $L$ is applied on a standardized scattering transform and reduces the number of scattering channels from $1539$ to $256$. The sparse code is calculated with a $1\times 1$ convolutional dictionary $D$ having $2048$ vectors. It takes as input an array $LS(x)$ of $14\times 14\times 256$ which has been normalized and outputs a code ${\alpha}^{1}$ of size $14\times 14\times 2048$ or a sparse approximation $D{\alpha}^{1}$ of size $14\times 14\times 256$. Either is provided as input to the MLP classifier. The ISTC network illustrated in Figure 2 has $N=12$ layers with softshrink nonlinearities and no batch normalization. Before the classifier, there is a batch normalization and a $5\times 5$ average pooling. The MLP classifier has 2 hidden layers of size 4096, ReLU and dropout rate of 0.3. The supervised learning jointly optimizes $L$, the dictionary $D$ with the Lagrange multiplier ${\lambda}_{*}$ and the MLP classifier. It is done with a stochastic gradient descent during 120 epochs using an initial learning rate of 0.01 with a decay of 0.1 at epochs 50 and 100. With a sparse code in input of the MLP, it has a Top 5 accuracy of 80.9%, outperforming AlexNet. If we replace the ISTC network by an ALISTA network, the accuracy improves to $83.7\%$. However, next section shows that contrarily to ISTC, an ALISTA network optimized for classification does not compute a sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ code and is therefore not mathematically interpretable. In the following we thus concentrate on the homotopy ISTC network.
Fisher  AlexNet  ResNet  Scat. +  Scat.  Scat.+  Scat.+  Scat.+  
Vectors  152  SLE  alone  ISTC $\alpha $  ISTC $D\alpha $  ALISTA $\alpha $  
Top1  55.6  56.5  78.3  57.0  37.5  59.0  54.8  62.6 
Top5  78.4  79.1  94.1  79.6  60.7  80.9  77.8  83.7 
The dimension reduction operator $L$ has a marginal effect in terms of performance. If we eliminate it or if we replace it by an unsupervised PCA dimension reduction, the performance drops by less than $2\%$, whereas the accuracy drops by $20\%$ if we eliminate the sparse coding. The considerable improvement brought by the sparse code is further amplified if the MLP classifier is replaced by a linear classifier. A linear classifier on a scattering vector has a (Top 1, Top 5) accuracy of $(23.4\%,41.8\%)$. With an ISTC sparse code in a learned dictionary the accuracy jumps to $(51.5\%,73.4\%)$ and hence improves by more than $30\%$.
If the MLP classification is applied to the sparse approximation $D{\alpha}^{1}$ as opposed to the sparse code ${\alpha}^{1}$ then the accuracy drops only by $3\%$. The sparse approximation $D{\alpha}^{1}$ of $LS(x)$ has a small dimension $14\times 14\times 256$ similar to AlexNet last convolutional layer output and is not sparse. This indicates that it is not the individual sparse outputs of the sparse code ${\alpha}^{1}$ which are important but the linear space defined by their support, which are mapped to other linear spaces by $D$.
The optimization learns a large factor ${\lambda}_{*}$ which yields a large approximation error $\parallel LS(x)D{\alpha}^{1}\parallel /\parallel LS(x)\parallel \approx 0.5$. The resulting code ${\alpha}^{1}$ is very sparse with about $3\%$ nonzero coefficients. The sparse approximation $D{\alpha}^{1}$ thus eliminates nearly half of the energy of $LS(x)$ which can be interpreted as noninformative "clutter" removal. The sparse code ${\alpha}^{1}$ is a projection of $LS(x)$ over a linear space defined by the support of ${\alpha}^{1}$. If a column ${D}_{m}$ is interpreted as a "scattering space feature" then this linear space is a conjunction of a particular set of such features. The high classification accuracy indicates that different linear spaces correspond mostly to different classes. These linear spaces are mapped by $D$ into lower dimensional linear spaces which remain separated. It thus indicates that $D$ is optimized to preserves discriminative directions which transform a vector of one class into a vector of another one.
4.2 Convergence of Homotopy Algorithms
To guarantee that the network is mathematically interpretable we verify numerically that the homotopy ISTC algorithm computes an accurate approximation of the optimal ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ sparse code in (1), with a small number of iterations (typically 12).
The Theorem 3.1 guarantees an exponential convergence if $$. In our classification setting, the theorem hypothesis is clearly not satisfied : $s\mu (D)\approx 60$, which is well above $1/2$. However, this condition is not necessary and based on a relatively crude upper bound.
Figure 4 left shows numerically that ISTC algorithm minimizes the ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ Lagrangian $\mathcal{L}(\alpha )=\frac{1}{2}{\parallel D\alpha \beta \parallel}^{2}+{\lambda}_{*}{\parallel \alpha \parallel}_{1}$, with an exponential convergence which is faster than ISTA and FISTA over the dictionary that it learns. On the contrary, Figure 4 right shows that ALISTA does not minimize the ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ Lagrangian at all. This comes from the fact that contrarily to standard ALISTA (Liu et al., 2019), we do not impose that the auxiliary matrix $W$ has a minimum joint coherence with the dictionary $D$. It would require too much computation and the matrix $W$ is rather optimized to minimize the classification loss. This is why it improves the classification accuracy but does not compute a sparse ${\mathbf{l}}^{\mathrm{\U0001d7cf}}$ code.
To further compare the convergence speed of ISTC versus ISTA and FISTA, we compute the relative mean square error $\mathrm{MSE}(x,y)={\parallel xy\parallel}^{2}/{\parallel x\parallel}^{2}$ between the optimal sparse code ${\alpha}^{1}$ and the sparse code output of 12 iterations of each of these three algorithms. The $\mathrm{MSE}$ is 0.02 for ISTC, 0.25 for FISTA and 0.46 for ISTA, which shows that ISTC reduces the error by a factor $10$ compared to ISTA and FISTA after 12 iterations.
5 Conclusion
The first goal of this work is to define a deep neural network having a good accuracy for complex image classification and which can be analyzed mathematically. This sparse scattering network learns the representation by optimizing a sparse code computed with a dictionary learned over scattering coefficients. The dictionary learning is implemented with a new homotopy ISTC network having an exponential convergence. The sparse dictionary learning improves accuracy by more than 20% over a scattering representation alone, and has a higher accuracy than AlexNet. The dictionary seems to be optimized in order to build separated sparse codes for each class, which belong to unions of linear spaces. Because the network operators are mathematically well specified, the analysis of its properties is simpler than for standard deep convolutional networks. However, more work is needed to understand the dictionary optimization and how it relates to image and class properties.
Acknowledgments
This work was supported by the ERC InvariantClass 320959 and grants from Région IledeFrance. We thank the Scientific Computing Core at the Flatiron Institute for the use of their computing resources. We would like to thank Eugene Belilovsky for helpful discussions and comments.
References
 Kymatio: scattering transforms in python. CoRR. External Links: Link Cited by: §2.
 A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences 2 (1), pp. 183–202. Cited by: §3.2, §3.2.
 Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1872–1886. Cited by: §1, §2, §2, §2.
 Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52 (2), pp. 489–509. Cited by: §3.1.
 An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57 (11), pp. 1413–1457. Cited by: §3.2.
 Adaptive greedy approximations. Constr. Approx. 13 (1), pp. 57–98. Cited by: §3.2.
 On the stability of the basis pursuit in the presence of noise. Signal Processing 86 (3), pp. 511–532. Cited by: §3.1, §3.1.
 Fast solution of l${}_{\text{1}}$norm minimization problems when the solution may be sparse. IEEE Trans. Information Theory 54 (11), pp. 4789–4812. Cited by: §3.2.
 Learning fast approximations of sparse coding. In ICML, pp. 399–406. Cited by: §3.2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
 Iterative soft/hard thresholding with homotopy continuation for sparse recovery. IEEE Signal Processing Letters 24 (6), pp. 784–788. Cited by: §3.2, §3.2.
 ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106–1114. Cited by: §1, §2, Table 1.
 Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
 ALISTA: analytic weights are as good as learned weights in LISTA. In International Conference on Learning Representations, Cited by: 3rd item, §1, §3.2, §3, §4.2.
 Deep dictionary learning: a parametric network approach. IEEE Transactions on Image Processing 28 (10), pp. 4790–4802. Cited by: §1.
 Taskdriven dictionary learning. IEEE transactions on pattern analysis and machine intelligence 34 (4), pp. 791–804. Cited by: §3.1, §3.
 Supervised dictionary learning. In Advances in neural information processing systems, pp. 1033–1040. Cited by: §1.
 Group invariant scattering. Comm. Pure Appl. Math. 65 (10), pp. 1331–1398. Cited by: §2.
 Understanding deep convolutional networks. Phil. Trans. of Royal Society A 374 (2065). Cited by: §2.
 A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20 (3), pp. 389. Cited by: §3.2.
 Scaling the scattering transform: deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, pp. 5618–5627. Cited by: §4.1.
 Deep rototranslation scattering for object classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2865–2873. Cited by: §1.
 Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2208–2221. Cited by: §2, Table 1.
 Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research 18, pp. 83:1–83:52. Cited by: §3.2.
 Fisher vectors meet neural networks: a hybrid classification architecture. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3743–3752. Cited by: Table 1.
 Highdimensional signature compression for largescale image classification.. In CVPR, pp. 1665–1672. Cited by: §4.1.
 Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §1.
 Supervised deep sparse coding networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 346–350. Cited by: §1.
Appendix A Appendix
A.1 Proof of Theorem 3.1
Let ${\alpha}^{0}$ be the optimal ${\mathbf{l}}^{\mathrm{\U0001d7ce}}$ sparse code. We denote by $\mathcal{S}(\alpha )$ the support of any $\alpha $. We are going to prove by induction on $n$ that for any $n\ge 0$ we have $\mathcal{S}({\alpha}_{n})\subset \mathcal{S}({\alpha}^{0})$ and ${\parallel {\alpha}_{n}{\alpha}^{0}\parallel}_{\mathrm{\infty}}\le 2{\lambda}_{n}$ if ${\lambda}_{n}\ge {\lambda}_{*}$.
For $n=0$, ${\alpha}_{0}=0$ so $\mathcal{S}({\alpha}_{0})=\mathrm{\varnothing}$ is indeed included in
the support of ${\alpha}^{0}$ and ${\parallel {\alpha}_{0}{\alpha}^{0}\parallel}_{\mathrm{\infty}}={\parallel {\alpha}^{0}\parallel}_{\mathrm{\infty}}$. To verify the induction hypothesis for ${\lambda}_{0}={\lambda}_{\mathrm{max}}\ge {\lambda}_{*}$, we shall prove that
${\parallel {\alpha}^{0}\parallel}_{\mathrm{\infty}}\le 2{\lambda}_{\mathrm{max}}$.
Let us write the error $w=\beta D{\alpha}^{0}$. For all $m$
$${\alpha}^{0}(m){W}_{m}^{t}{D}_{m}={W}_{m}^{t}\beta {W}_{m}^{t}w\sum _{m\ne {m}^{\prime}}{\alpha}^{0}({m}^{\prime}){W}_{m}^{t}{D}_{{m}^{\prime}}.$$ 
Since the support of ${\alpha}^{0}$ is smaller than $s$, ${W}_{m}^{t}{D}_{m}=1$ and $\stackrel{~}{\mu}={\mathrm{max}}_{m\ne {m}^{\prime}}{W}_{m}^{t}{D}_{{m}^{\prime}}$
$${\alpha}^{0}(m)\le {W}_{m}^{t}\beta +{W}_{m}^{t}w+s\stackrel{~}{\mu}{\parallel {\alpha}^{0}\parallel}_{\mathrm{\infty}}$$ 
so taking the max on $m$ gives:
$${\parallel {\alpha}^{0}\parallel}_{\mathrm{\infty}}(1\stackrel{~}{\mu}s)\le {\parallel {W}^{t}\beta \parallel}_{\mathrm{\infty}}+{\parallel {W}^{t}w\parallel}_{\mathrm{\infty}}$$ 
But given the inequalities
${\parallel {W}^{t}\beta \parallel}_{\mathrm{\infty}}$  $\le $  ${\lambda}_{\mathrm{max}}$  
${\parallel {W}^{t}w\parallel}_{\mathrm{\infty}}$  $\le $  ${\lambda}_{\mathrm{max}}(12\gamma \stackrel{~}{\mu}s)$  
$\frac{(1\gamma \stackrel{~}{\mu}s)}{(1\stackrel{~}{\mu}s)}$  $\le $  $1\mathit{\hspace{1em}}\text{since}\gamma \ge 1\text{and}(1\stackrel{~}{\mu}s)0$ 
we get
$${\parallel {\alpha}^{0}\parallel}_{\mathrm{\infty}}\le 2{\lambda}_{\mathrm{max}}=2{\lambda}_{0}$$ 
Let us now suppose that the property is valid for $n$ and let us prove it for $n+1$. We denote by ${D}_{\mathcal{A}}$ the restriction of $D$ to vectors indexed by $\mathcal{A}$. We begin by showing that $\mathcal{S}({\alpha}_{n+1})\subset \mathcal{S}({\alpha}^{0})$. For any $m\in \mathcal{S}({\alpha}_{n+1})$, since $\beta =D{\alpha}^{0}+w$ and ${W}_{m}^{t}{D}_{m}=1$ we have
${\alpha}_{n+1}(m)$  $=$  ${T}_{{\lambda}_{n+1}}({\alpha}_{n}(m)+{W}_{m}^{t}(\beta D{\alpha}_{n}))$  
$=$  ${T}_{{\lambda}_{n+1}}({\alpha}^{0}(m)+{W}_{m}^{t}({D}_{\mathcal{S}({\alpha}^{0})\cup \mathcal{S}({\alpha}_{n})\{m\}}{({\alpha}^{0}{\alpha}_{n})}_{\mathcal{S}({\alpha}^{0})\cup \mathcal{S}({\alpha}_{n})\{m\}}+w))$ 
For any $m$ not in $\mathcal{S}({\alpha}^{0})$, let us prove that ${\alpha}_{n+1}(m)=0$. The induction hypothesis assumes that $\mathcal{S}({\alpha}_{n})\subset \mathcal{S}({\alpha}^{0})$ and ${\parallel {\alpha}^{0}{\alpha}_{n}\parallel}_{\mathrm{\infty}}\le 2{\lambda}_{n}$ with ${\lambda}_{n}\ge {\lambda}_{*}$ so:
$I$  $=$  ${\alpha}^{0}(m)+{W}_{m}^{t}({D}_{\mathcal{S}({\alpha}^{0})\cup \mathcal{S}({\alpha}_{n})\{m\}}{({\alpha}^{0}{\alpha}_{n})}_{\mathcal{S}({\alpha}^{0})\cup \mathcal{S}({\alpha}_{n})\{m\}}+w)$  
$\le $  ${W}_{m}^{t}({D}_{\mathcal{S}({\alpha}^{0})}{({\alpha}^{0}{\alpha}_{n})}_{\mathcal{S}({\alpha}^{0})})+{W}_{m}^{t}w\mathit{\hspace{1em}}\text{since}\mathcal{S}({\alpha}_{n})\subset \mathcal{S}({\alpha}^{0})\text{and}{\alpha}^{0}(m)=0\text{by assumption.}$  
$\le $  $\stackrel{~}{\mu}s{\parallel {\alpha}^{0}{\alpha}_{n}\parallel}_{\mathrm{\infty}}+{\parallel {W}^{t}w\parallel}_{\mathrm{\infty}}$ 
Since we assume that ${\lambda}_{n+1}\ge {\lambda}_{*}$, we have
$${\parallel {W}^{t}w\parallel}_{\mathrm{\infty}}\le (12\gamma \stackrel{~}{\mu}s){\lambda}_{n+1}$$ 
and thus
$$I\le \stackrel{~}{\mu}s{\parallel {\alpha}^{0}{\alpha}_{n}\parallel}_{\mathrm{\infty}}+{\parallel {W}^{t}w\parallel}_{\mathrm{\infty}}\le \stackrel{~}{\mu}s2{\lambda}_{n}+{\lambda}_{n+1}(12\gamma \stackrel{~}{\mu}s)\le {\lambda}_{n+1}$$ 
since ${\lambda}_{n}=\gamma {\lambda}_{n+1}$.
Because of the thresholding ${T}_{{\lambda}_{n+1}}$, it proves that ${\alpha}_{n+1}(m)=0$ and hence that $\mathcal{S}({\alpha}_{n+1})\subset \mathcal{S}({\alpha}^{0})$.
Let us now evaluate ${\parallel {\alpha}^{0}{\alpha}_{n+1}\parallel}_{\mathrm{\infty}}$. For any $({\alpha}_{1},{\alpha}_{2},\lambda )$, a soft thresholding satisfies
$${T}_{\lambda}({\alpha}_{1}+{\alpha}_{2}){\alpha}_{1}\le \lambda +{\alpha}_{2}$$ 
so:
${\alpha}_{n+1}(m){\alpha}^{0}(m)$  $\le $  ${\lambda}_{n+1}+{W}_{m}^{t}({D}_{\mathcal{S}({\alpha}^{0})\cup \mathcal{S}({\alpha}_{n})\{m\}}{({\alpha}^{0}{\alpha}_{n})}_{\mathcal{S}({\alpha}^{0})\cup \mathcal{S}({\alpha}_{n})\{m\}})+{W}_{m}^{t}w$  
$\le $  ${\lambda}_{n+1}+\stackrel{~}{\mu}s{\parallel {\alpha}^{0}{\alpha}_{n}\parallel}_{\mathrm{\infty}}+{\parallel {W}^{t}w\parallel}_{\mathrm{\infty}}$  
$\le $  ${\lambda}_{n+1}+\stackrel{~}{\mu}s2{\lambda}_{n}+{\lambda}_{n+1}(12\gamma \stackrel{~}{\mu}s)=2{\lambda}_{n+1}$ 
Taking a max over $m$ proves the induction hypothesis.