An Improved Analysis of Training Over-parameterized Deep Neural Networks

  • 2019-06-11 16:40:46
  • Difan Zou, Quanquan Gu
  • 4

Abstract

A recent line of research has shown that gradient-based algorithms withrandom initialization can converge to the global minima of the training lossfor over-parameterized (i.e., sufficiently wide) deep neural networks. However,the condition on the width of the neural network to ensure the globalconvergence is very stringent, which is often a high-degree polynomial in thetraining sample size $n$ (e.g., $O(n^{24})$). In this paper, we provide animproved analysis of the global convergence of (stochastic) gradient descentfor training deep neural networks, which only requires a milderover-parameterization condition than previous work in terms of the trainingsample size and other problem-dependent parameters. The main technicalcontributions of our analysis include (a) a tighter gradient lower bound thatleads to a faster convergence of the algorithm, and (b) a sharpercharacterization of the trajectory length of the algorithm. By specializing ourresult to two-layer (i.e., one-hidden-layer) neural networks, it also providesa milder over-parameterization condition than the best-known result in priorwork.

 

Quick Read (beta)

An Improved Analysis of Training Over-parameterized
Deep Neural Networks

Difan Zou    and    Quanquan Gu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]
Abstract

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size n (e.g., O(n24)). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.

1 Introduction

Recent study (Zhang et al., 2016) has revealed that deep neural networks trained by gradient-based algorithms can fit training data with random labels and achieve zero training error. Since the loss landscape of training deep neural network is highly nonconvex or even nonsmooth, conventional optimization theory cannot explain why gradient descent (GD) and stochastic gradient descent (SGD) can find the global minimum of the loss function (i.e., achieving zero training error). To better understand the training of neural networks, there is a line of research (Tian, 2017; Brutzkus and Globerson, 2017; Du et al., 2017; Li and Yuan, 2017; Zhong et al., 2017; Du and Lee, 2018; Zhang et al., 2018) studying two-layer (i.e., one-hidden-layer) neural networks, where it assumes there exists a teacher network (i.e., an underlying ground-truth network) generating the output given the input, and casts neural network learning as weight matrix recovery for the teacher network. However, these studies not only make strong assumptions on the training data, but also need special initialization methods that are very different from the commonly used initialization method (He et al., 2015) in practice. Li and Liang (2018); Du et al. (2018b) advanced this line of research by proving that under much milder assumptions on the training data, (stochastic) gradient descent can attain a global convergence for training over-parameterized (i.e.,sufficiently wide) two-layer ReLU network with widely used random initialization method (He et al., 2015). More recently, Allen-Zhu et al. (2018b); Du et al. (2018a); Zou et al. (2018) generalized the global convergence results from two-layer networks to deep neural networks. However, there is a huge gap between the theory and practice since all these work Li and Liang (2018); Du et al. (2018b); Allen-Zhu et al. (2018b); Du et al. (2018a); Zou et al. (2018) require unrealistic over-parameterization conditions on the width of neural networks, especially for deep networks. In specific, in order to establish the global convergence for training two-layer ReLU networks, Du et al. (2018b) requires the network width, i.e., number of hidden nodes, to be at least Ω(n6/λ04), where n is the training sample size and λ0 is the smallest eigenvalue of the so-called Gram matrix defined in Du et al. (2018b), which is essentially the neural tangent kernel (Jacot et al., 2018; Chizat and Bach, 2018) on the training data. Under the same assumption on the training data, Wu et al. (2019) improved the iteration complexity of GD in Du et al. (2018b) from O(n2log(1/ϵ)/λ02) to O(nlog(1/ϵ)/λ0) and Oymak and Soltanolkotabi (2019) improved the over-parameterization condition to Ω(n𝐗26/λ04), where ϵ is the target error and 𝐗n×d is the input data matrix. For deep ReLU networks, the best known result was established in Allen-Zhu et al. (2018b), which requires the network width to be at least Ω~(kn24L12ϕ-8)11 1 Here Ω~() hides constants and the logarithmic dependencies on problem dependent parameters except ϵ. to ensure the global convergence of GD and SGD, where L is the number of hidden layers, ϕ is the minimum data separation distance and k is the output dimension.

This paper continues the line of research, and improves the over-parameterization condition and the global convergence rate of (stochastic) gradient descent for training deep neural networks. In specific, under the same setting as in Allen-Zhu et al. (2018b), we prove faster global convergence rates for both GD and SGD under a significantly milder condition on the neural network width. Furthermore, when specializing our result to two-layer ReLU networks, it also outperforms the best-known result proved in Oymak and Soltanolkotabi (2019). The improvement in our result is due to the following two innovative proof techniques: (a) a tighter gradient lower bound, which leads to a faster rate of convergence for GD/SGD; and (b) a sharper characterization of the trajectory length for GD/SGD until convergence.

We highlight our main contributions as follows:

  • We show that, with Gaussian random initialization (He et al., 2015) on each layer, when the number of hidden nodes per layer is Ω~(kn8L12ϕ-4), GD can achieve ϵ training loss within O~(n2L2log(1/ϵ)ϕ-1) iterations, where L is the number of hidden layers, ϕ is the minimum data separation distance, n is the number of training examples, and k is the output dimension. Compared with the state-of-the-art result (Allen-Zhu et al., 2018b), our over-parameterization condition is milder by a factor of Ω~(n16ϕ-4), and our iteration complexity is better by a factor of O~(n4ϕ-1).

  • We also prove a similar convergence result for SGD. We show that with Gaussian random initialization (He et al., 2015) on each layer, when the number of hidden nodes per layer is Ω~(kn17L12B-4ϕ-8), SGD can achieve ϵ expected training loss within O~(n5log(1/ϵ)B-1ϕ-2) iterations, where B is the minibatch size of SGD. Compared with the corresponding results in Allen-Zhu et al. (2018b), our results are strictly better by a factor of Ω~(n7B5) and O~(n2) respectively regarding over-parameterization condition and iteration complexity.

  • When specializing our results of training deep ReLU networks with GD to two-layer ReLU networks, it also outperforms the corresponding results (Du et al., 2018b; Wu et al., 2019; Oymak and Soltanolkotabi, 2019). In addition, for training two-layer ReLU networks with SGD, we are able to show much better result than training deep ReLU networks with SGD.

For the ease of comparison, we summarize the best-known results (Du et al., 2018b; Allen-Zhu et al., 2018b; Du et al., 2018a; Wu et al., 2019; Oymak and Soltanolkotabi, 2019) of training overparameterized neural networks with GD and compare with them in terms of over-parameterization condition and iteration complexity in Table 1. We will show in Section 3 that, under the assumption that all training data points have unit 2 norm, which is the common assumption made in all these work (Du et al., 2018b; Allen-Zhu et al., 2018b; Du et al., 2018a; Wu et al., 2019; Oymak and Soltanolkotabi, 2019), λ0>0 is equivalent to the fact that all training data are separated by some distance ϕ, and we have λ0=O(n-2ϕ) (Oymak and Soltanolkotabi, 2019). Substituting λ0=Ω(n-2ϕ) into Table 1, it is evident that our result outperforms all the other results under the same assumptions.

Table 1: Over-parameterization conditions and iteration complexities of GD for training overparamterized neural networks. 𝐊(L) is the Gram matrix for L-hidden-layer neural network (Du et al., 2018a). Note that the dimension of the output is k=1 in Du et al. (2018b, a); Wu et al. (2019); Oymak and Soltanolkotabi (2019).
Over-para. condition Iteration complexity Deep? ReLU?
Du et al. (2018b) Ω(n6λ04) O(n2log(1/ϵ)λ02) no yes
Wu et al. (2019) Ω(n6λ04) O(nlog(1/ϵ)λ02) no yes
Oymak and Soltanolkotabi (2019) Ω(n𝐗26λ04) O(𝐗22log(1/ϵ)λ0) no yes
Du et al. (2018a) Ω(2O(L)n4λmin4(𝐊(L))) O(2O(L)n2log(1/ϵ)λmin2(𝐊(L))) yes no
Allen-Zhu et al. (2018b) Ω~(kn24L12ϕ8) O(n6L2log(1/ϵ)ϕ2) yes yes
This paper Ω~(kn8L12ϕ4) O(n2L2log(1/ϵ)ϕ) yes yes

Notation For scalars, vectors and matrices, we use lower case, lower case bold face, and upper case bold face letters to denote them respectively. For a positive integer, we denote by [k] the set {1,,k}. For a vector 𝐱=(x1,,xd) and a positive integer p, we denote by 𝐱p=(i=1d|xi|p)1/p the p norm of 𝐱. In addition, we denote by 𝐱=maxi=1,,d|xi| the norm of 𝐱, and 𝐱0=|{xi:xi0,i=1,,d}| the 0 norm of 𝐱. For a matrix 𝐀m×n, we denote by 𝐀F the Frobenius norm of 𝐀, 𝐀2 the spectral norm (maximum singular value), λmin(𝐀) the smallest singular value, 𝐀0 the number of nonzero entries, and 𝐀2, the maximum 2 norm over all row vectors, i.e., 𝐀2,=maxi=1,,m𝐀i*2. For a collection of matrices 𝐖={𝐖1,,𝐖L}, we denote 𝐖F=l=1L𝐖lF2, 𝐖2=maxl[L]𝐖l2 and 𝐖2,=maxl[L]𝐖l2,. Given two collections of matrices 𝐖~={𝐖~1,,𝐖~L} and 𝐖^={𝐖^1,,𝐖^L}, we define their inner product as 𝐖~,𝐖^=l=1L𝐖~l,𝐖^l. For two sequences {an} and {bn}, we use an=O(bn) to denote that anC1bn for some absolute constant C1>0, and use an=Ω(bn) to denote that anC2bn for some absolute constant C2>0. In addition, we use O~() and Ω~() to hide logarithmic factors.

2 Problem setup and algorithms

In this section, we introduce the problem setup and the training algorithms.

Following Allen-Zhu et al. (2018b), we consider the training of an L-hidden layer fully connected neural network, which takes 𝐱d as input, and outputs 𝐲k. In specific, the neural network is a vector-valued function 𝐟𝐖:dm, which is defined as

𝐟𝐖(𝐱)=𝐕σ(𝐖Lσ(𝐖L-1σ(𝐖1𝐱))),

where 𝐖1m×d, 𝐖2,,𝐖Lm×m denote the weight matrices for the hidden layers, and 𝐕k×m denotes the weight matrix in the output layer, σ(x)=max{0,x} is the entry-wise ReLU activation function. In addition, we denote by σ(x)=𝟙(x) the derivative of ReLU activation function and 𝐰l,j the weight vector of the j-th node in the l-th layer.

Given a training set {(𝐱i,𝐲i)}i=1,,n where 𝐱id and 𝐲ik, the empirical loss function for training the neural network is defined as

L(𝐖):=1ni=1n(𝐲^i,𝐲i), (2.1)

where (,) is the loss function, and 𝐲^i=𝐟𝐖(𝐱i). In this paper, for the ease of exposition, we follow Allen-Zhu et al. (2018b); Du et al. (2018b, a); Oymak and Soltanolkotabi (2019) and consider square loss as follows

(𝐲^i,𝐲i)=12𝐲i-𝐲^i22,

where 𝐲^i=𝐟𝐖(𝐱i)k denotes the output of the neural network given input 𝐱i. It is worth noting that our result can be easily extended to other loss functions such as cross entropy loss (Zou et al., 2018) as well.

We will study both gradient descent and stochastic gradient descent as training algorithms, which are displayed in Algorithm 2. For gradient descent, we update the weight matrix 𝐖l(t) using full partial gradient 𝐖lL(𝐖(t)). For stochastic gradient descent, we update the weight matrix 𝐖l(t) using stochastic partial gradient 1/Bs(t)𝐖l(𝐟𝐖(t)(𝐱s),𝐲s), where (t) with |(t)|=B denotes the minibatch of training examples at the t-th iteration. Both algorithms are initialized in the same way as Allen-Zhu et al. (2018b), which is essentially the initialization method (He et al., 2015) widely used in practice. In the remaining of this paper, we denote by

L(𝐖(t))={𝐖lL(𝐖(t))}l[L]and(𝐟𝐖(t)(𝐱i),𝐲i)={𝐖l(𝐟𝐖(t)(𝐱i),𝐲i)}l[L]

the collections of all partial gradients of L(𝐖(t)) and (𝐟𝐖(t)(𝐱i),𝐲i).

{algorithm}

[t] (Stochastic) Gradient descent with Gaussian random initialization \[email protected]@algorithmic[1] \STATEinput: Training data {𝐱i,𝐲i}i[n], step size η, total number of iterations T, minibatch size B. \STATEinitialization: For all l[L], each row of weight matrix 𝐖l(0) is independently generated from 𝒩(0,2/m𝐈), each row of 𝐕 is independently generated from 𝒩(0,𝐈/d)

Gradient Descent 

\FOR

t=0,,T \STATE𝐖l(t+1)=𝐖l(t)-η𝐖lL(𝐖(t)) for all l[L] \ENDFOR\STATEoutput: {𝐖l(T)}l[L]

Stochastic Gradient Descent 

\FOR

t=0,,T \STATEUniformly sample a minibatch of training data (t)[n] \STATE𝐖l(t+1)=𝐖l(t)-ηBs(t)𝐖l(𝐟𝐖(t)(𝐱s),𝐲s) for all l[L] \ENDFOR\STATEoutput: {𝐖l(T)}l[L]

3 Main theory

In this section, we present our main theoretical results. We make the following assumptions on the training data.

Assumption 3.1.

For any 𝐱i, it holds that 𝐱i2=1 and (𝐱i)d=μ, where μ is an positive constant.

The same assumption has been made in all previous work along this line (Du et al., 2018a; Allen-Zhu et al., 2018b; Zou et al., 2018; Oymak and Soltanolkotabi, 2019). Note that requiring the norm of all training examples to be 1 is not essential, and this assumption can be relaxed to be 𝐱i2 is lower and upper bounded by some constants.

Assumption 3.2.

For any two different training data points 𝐱i and 𝐱j, there exists a positive constant ϕ>0 such that 𝐱i-𝐱j2ϕ.

This assumption has also been made in Allen-Zhu et al. (2018c, b), which is essential to guarantee zero training error for deep neural networks. It is a quite mild assumption for the regression problem as studied in this paper. Note that Du et al. (2018a) made a different assumption on training data, which requires the Gram matrix 𝐊(L) (See their paper for details) defined on the L-hidden-layer networks is positive definite. However, their assumption is not easy to verify for neural networks with more than two layers.

Based on Assumptions 3.1 and 3.2, we are able to establish the global convergence rates of GD and SGD for training deep ReLU networks. We start with the result of GD for L-hidden-layer networks.

3.1 Training L-hidden-layer ReLU networks with GD

The global convergence of GD for training deep neural networks is stated in the following theorem.

Theorem 3.3.

Under Assumptions 3.1 and 3.2, and suppose the number of hidden nodes per layer satisfies

m=Ω(kn8L12log3(m)/ϕ4). (3.1)

Then if set the step size η=O(k/(L2m)), with probability at least 1-O(n-1), gradient descent is able to find a point that achieves ϵ training loss within

T=O(n2L2log(1/ϵ)/ϕ)

iterations.

Remark 3.4.

The state-of-the-art results for training deep ReLU network are provided by Allen-Zhu et al. (2018b), where the authors showed that GD can achieve ϵ-training loss within O(n6L2log(1/ϵ)/ϕ2) iterations if the neural network width satisfies m=Ω~(kn24L12/ϕ8). As a clear comparison, our result on the iteration complexity is better than theirs by a factor of O(n4/ϕ), and our over-parameterization condition is milder than theirs by a factor of Ω~(n16/ϕ4). Du et al. (2018a) also proved the global convergence of GD for training deep neural network with smooth activation functions. As shown in Table 1, the over-parameterization condition and iteration complexity in Du et al. (2018a) have an exponential dependency on L, which is much worse than the polynomial dependency on L as in Allen-Zhu et al. (2018b) and our result.

We now specialize our results in Theorem 3.3 to two-layer networks by removing the dependency on the number of hidden layers, i.e., L. We state this result in the following corollary.

Corollary 3.5.

Under the same assumptions made in Theorem 3.3. For training two-layer ReLU networks, if set the number of hidden nodes m=Ω(kn8log3(m)/ϕ4) and step size η=O(k/m), then with probability at least 1-O(n-1), GD is able to find a point that achieves ϵ-training loss within T=O(n2log(1/ϵ)/ϕ) iterations.

For training two-layer ReLU networks, Du et al. (2018b) made a different assumption on the training data to establish the global convergence of GD. Specifically, Du et al. (2018b) defined a Gram matrix, which is also known as neural tangent kernel (Jacot et al., 2018), based on the training data {𝐱i}i=1,,n and assumed that the smallest eigenvalue of such Gram matrix is strictly positive. In fact, for two-layer neural networks, their assumption is equivalent to Assumption 3.2, as shown in the following proposition.

Proposition 3.6.

Under Assumption 3.1, define the Gram matrix 𝐇n×n as follows

𝐇ij=𝔼𝐰𝒩(0,𝐈)[𝐱i𝐱jσ(𝐰𝐱i)σ(𝐰𝐱j)],

then the assumption λ0=λmin(𝐇)>0 is equivalent to Assumption 3.2. In addition, there exists a sufficiently small constant C such that λ0Cϕn-2.

Remark 3.7.

According to Proposition 3.6, we can make a direct comparison between our convergence results for two-layer ReLU networks in Corollary 3.5 with those in Du et al. (2018b); Oymak and Soltanolkotabi (2019). In specific, as shown in Table 1, the iteration complexity and over-parameterization condition proved in Du et al. (2018b) can be translated to O(n6log(1/ϵ)/ϕ2) and Ω(n14/ϕ4) respectively under Assumption 3.2. Oymak and Soltanolkotabi (2019) improved the result in Du et al. (2018b) and the improved iteration complexity and over-parameterization condition can be translated to O(n2𝐗22log(1/ϵ)/ϕ) 22 2 It is worth noting that 𝐗22=O(1) if dn, 𝐗22=O(n/d) if 𝐗 is randomly generated, and 𝐗22=O(n) in the worst case. and Ω(n9𝐗26/ϕ4) respectively, where 𝐗=[𝐱1,,𝐱n]d×n is the input data matrix. Our iteration complexity for two-layer ReLU networks is better than that in Oymak and Soltanolkotabi (2019) by a factor of O(𝐗22) 33 3 Here we set k=1 in order to match the problem setting in Du et al. (2018b); Oymak and Soltanolkotabi (2019)., and the over-parameterization condition is also strictly milder than the that in Oymak and Soltanolkotabi (2019) by a factor of O(n𝐗26).

3.2 Extension to training L-hidden-layer ReLU networks with SGD

Then we extend the convergence results of GD to SGD in the following theorem.

Theorem 3.8.

Under Assumptions 3.1 and 3.2, and suppose the number of hidden nodes per layer satisfies

m=Ω(kn17L12log3(m)/(B4ϕ8)). (3.2)

Then if set the step size as η=O(kBϕ/(n3mlog(m))), with probability at least 1-O(n-1), SGD is able to achieve ϵ expected training loss within

T=O(n5log(m)log2(1/ϵ)/(Bϕ2))

iterations.

Remark 3.9.

We first compare our result with the state-of-the-art proved in Allen-Zhu et al. (2018b), where they showed that SGD can converge to a point with ϵ-training loss within O~(n7log(1/ϵ)/(Bϕ2)) iterations if m=Ω~(n24L12Bk/ϕ8). In stark contrast, our result on the over-parameterization condition is strictly better than it by a factor of Ω~(n7B5), and our result on the iteration complexity is also faster by a factor of O(n2).

Moreover, we also characterize the convergence rate and over-parameterization condition of SGD for training two-layer networks. Unlike the gradient descent, which has the same convergence rate and over-parameterization condition for training both deep and two-layer networks in terms of training data size n, we find that the over-parameterization condition of SGD can be further improved for training two-layer neural networks. We state this improved result in the following theorem.

Theorem 3.10.

Under the same assumptions made in Theorem 3.8. For two-layer ReLU networks, if set the number of hidden nodes and step size as

m=Ω(k5/2n11log3(m)/(ϕ5B)),η=O(kBϕ/(n3mlog(m))),

then with probability at least 1-O(n-1), stochastic gradient descent is able to achieve ϵ training loss within T=O(n5log(m)log(1/ϵ)/(Bϕ2)) iterations.

Remark 3.11.

From Theorem 3.8, we can also obtain the convergence results of SGD for two-layer ReLU networks by choosing L=1. However, the resulting over-parameterization condition is m=Ω(kn17log3(m)B-4ϕ-8), which is much worse than that in Theorem 3.10. This is because for two-layer networks, the training loss enjoys nicer local properties around the initialization, which can be leveraged to improve the convergence of SGD. Due to space limit, we defer more details to Appendix A.

4 Proof sketch of the main theory

In this section, we provide the proof sketch for Theorems 3.3, and highlight our technical contributions and innovative proof techniques.

4.1 Overview of the technical contributions

The improvements in our result are mainly attributed to the following two aspects: (1) a tighter gradient lower bound leading to faster convergence; and (2) a sharper characterization of the trajectory length of the algorithm.

We first define the following perturbation region based on the initialization,

(𝐖(0),τ)={𝐖:𝐖l-𝐖l(0)2τ for all l[L]},

where τ>0 is the preset perturbation radius for each weight matrix 𝐖l.

Tighter gradient lower bound. By the definition of L(𝐖), we have L(𝐖)F2=l=1L𝐖lL(𝐖)F2𝐖LL(𝐖)F2. Therefore, we can focus on the partial gradient of L(𝐖) with respect to the weight matrix at the last hidden layer. Note that we further have 𝐖LL(𝐖)F2=j=1m𝐰L,jL(𝐖)22, where

𝐰L,jL(𝐖)=1ni=1n𝐟𝐖(𝐱i)-𝐲i,𝐯jσ(𝐰L,j,𝐱L-1,i)𝐱L-1,i,

and 𝐱L-1,i denotes the output of the (L-1)-th hidden layer with input 𝐱i. In order to prove the gradient lower bound, for each 𝐱L-1,i, we introduce a region namely “gradient region”, denoted by 𝒲j, which is almost orthogonal to 𝐱L-1,i. Then we prove two major properties of these n regions {𝒲1,,𝒲n}: (1) 𝒲i𝒲j= if ij, and (2) if 𝐰L,j𝒲i for any i, with probability at least 1/2, 𝐰L,jL(𝐖)2 is sufficiently large. We visualize these “gradient regions” in Figure 1(a). Since {𝐰L,j}j[m] are randomly generated at the initialization, in order to get a larger bound of 𝐖LL(𝐖)F2, we hope the size of these “gradient regions” to be as large as possible. We take the union of the “gradient regions” for all training data, i.e., i=1n𝒲i, which is shown in Figure 1(a). As a comparison, Allen-Zhu et al. (2018b); Zou et al. (2018) only leveraged the “gradient region” for one training data point to establish the gradient lower bound, which is shown in Figure 1(b). Roughly speaking, the size of “gradient regions” utilized in our proof is n times larger than those used in Allen-Zhu et al. (2018b); Zou et al. (2018), which consequently leads to an O(n) improvement on the gradient lower bound. The improved gradient lower bound is formally stated in the following lemma.

Lemma 4.1 (Gradient lower bound).

Let τ=O(ϕ3/2n-3L-6log-3/2(m)), then for all 𝐖(𝐖(0),τ), with probability at least 1-exp(O(mϕ/(dn))), it holds that

L(𝐖)F2O(mϕL(𝐖)/(kn2)).
(a) “gradient region” for {𝐱L-1,i}i[n]
(b) “gradient region” for 𝐱L-1,1
Figure 1: (a): “gradient region” for all training data (b): “gradient region” for one training example.

Sharper characterization of the trajectory length. The improved analysis of the trajectory length is motivated by the following observation: at the t-th iteration, the decrease of the training loss after one-step gradient descent is proportional to the gradient norm, i.e., L(𝐖(t))-L(𝐖(t+1))L(𝐖(t))F2. In addition, the gradient norm L(𝐖(t))F determines the trajectory length in the t-th iteration. Putting them together, we can obtain

𝐖l(t+1)-𝐖l(t)2=η𝐖lL(𝐖(t))2Ckn2/(mϕ)(L(𝐖(t))-L(𝐖(t+1))), (4.1)

where C is an absolute constant. (4.1) enables the use of telescope sum, which yields 𝐖l(t)-𝐖l(0)2Ckn2L(𝐖(0))/mϕ. In stark contrast, Allen-Zhu et al. (2018b) bounds the trajectory length as

𝐖l(t+1)-𝐖l(t)2=η𝐖lL(𝐖(t))2ηCmL(𝐖(t))/k,

and further prove that 𝐖l(t)-𝐖l(0)2Ckn6L2(𝐖(0))/(mϕ2) by taking summation over t, where C is an absolute constant. Our sharp characterization of the trajectory length is formally summarized in the following lemma.

Lemma 4.2.

Assuming all iterates are staying inside the region (𝐖(0),τ) with τ=O(ϕ3/2n-3L-6log-3/2(m)), if set the step size η=O(k/(L2m)), with probability least 1-O(n-1), the following holds for all t0 and l[L],

𝐖l(t)-𝐖l(0)2O(kn2log(n)/(mϕ)).

4.2 Proof of Theorem 3.3

Our proof road map can be organized in three steps: (i) prove that the training loss enjoys good curvature properties within the perturbation region (𝐖(0),τ); (ii) show that gradient descent is able to converge to global minima based on such good curvature properties; and (iii) ensure all iterates stay inside the perturbation region until convergence.

Step (i) Training loss properties. We first show some key properties of the training loss within (𝐖(0),τ), which are essential to establish the convergence guarantees of gradient descent.

Lemma 4.3.

If mO(Llog(nL)), with probability at least 1-O(n-1) it holds that L(𝐖(0))O~(1).

Lemma 4.3 suggests that the training loss L(𝐖) at the initial point does not depend on the number of hidden nodes per layer, i.e., m.

Moreover, the training loss L(𝐖) is nonsmooth due to the non-differentiable ReLU activation function. Generally speaking, smoothness is essential to achieve linear rate of convergence for gradient-based algorithms. Fortunately, Allen-Zhu et al. (2018b) showed that the training loss satisfies locally semi-smoothness property, which is summarized in the following lemma.

Lemma 4.4 (Semi-smoothness (Allen-Zhu et al., 2018b)).

Let

τ[Ω(1/(k3/2m3/2L3/2log3/2(m))),O(1/(L4.5log3/2(m)))].

Then for any two collections 𝐖^={𝐖^l}l[L] and 𝐖~={𝐖~l}l[L] satisfying 𝐖^,𝐖~(𝐖(0),τ), with probability at least 1-exp(-Ω(-mτ3/2L)), there exist two constants C and C′′ such that

L(𝐖~) L(𝐖^)+L(𝐖^),𝐖~-𝐖^
+CL(𝐖^)τ1/3L2mlog(m)k𝐖~-𝐖^2+C′′L2mk𝐖~-𝐖^22. (4.2)

Lemma 4.4 is a rescaled version of Theorem 4 in Allen-Zhu et al. (2018b), since the training loss L(𝐖) in (2.1) is divided by the training sample size n, as opposed to the training loss in Allen-Zhu et al. (2018b). This lemma suggests that if the perturbation region is small, i.e., τ1, the non-smooth term (third term on the R.H.S. of (4.4)) is small and dominated by the gradient term (the second term on the the R.H.S. of (4.4)). Therefore, the training loss behaves like a smooth function in the perturbation region and the linear rate of convergence can be proved.

Step (ii) Convergence rate of GD. Now we are going to establish the convergence rate for gradient descent under the assumption that all iterates stay inside the region (𝐖(0),τ), where τ will be specified later.

Lemma 4.5.

Assume all iterates stay inside the region (𝐖(0),τ), where τ=O(ϕ3/2n-3L-6log-3/2(m)). Then under Assumptions 3.1 and 3.2, if set the step size η=O(k/(L2m)), with probability least 1-exp(-O(mτ3/2L)), it holds that

L(𝐖(t))(1-O(mϕηkn2))tL(𝐖(0)).

Lemma 4.5 suggests that gradient descent is able to decrease the training loss to zero at a linear rate.

Step (iii) Verifying all iterates of GD stay inside the perturbation region. Then we are going to ensure that all iterates of GD are staying inside the required region (𝐖(0),τ). Note that we have proved the distance 𝐖l(t)-𝐖l(0)2 in Lemma 4.2. Therefore, it suffices to verify that such distance is smaller than the preset value τ. Thus, we can complete the proof of Theorem 3.3 by verifying the conditions based on our choice of m. Note that we have set the required number of m in (3.1), plugging (3.1) into the result of Lemma 4.2, we have with probability at least 1-O(n-1), the following holds for all tT and l[L]

𝐖l(t)-𝐖l(0)2O(ϕ3/2n-3L-6log-3/2(m)),

which is exactly in the same order of τ in Lemma 4.5. Therefore, our choice of m guarantees that all iterates are inside the required perturbation region. In addition, by Lemma 4.5, in order to achieve ϵ accuracy, we require

Tη=O(kn2log(1/ϵ)m-1ϕ-1). (4.3)

Then substituting our choice of step size η=O(k/(L2m)) into (4.3) and applying Lemma 4.3, we can get the desired result for T.

5 Conclusions and future work

In this paper, we studied the global convergence of (stochastic) gradient descent for training over-parameterized ReLU networks, and improved the state-of-the-art results. Our proof technique can be also applied to prove similar results for other loss functions such as cross-entropy loss and other neural network architectures such as convolutional neural networks (CNN) (Allen-Zhu et al., 2018b; Du et al., 2018b) and ResNet (Allen-Zhu et al., 2018b; Du et al., 2018b; Zhang et al., 2019). One important future work is to investigate whether the over-parameterization condition and the convergence rate can be further improved. Another interesting future direction is to explore the use of our proof technique to improve the generalization analysis of overparameterized neural networks trained by gradient-based algorithms (Allen-Zhu et al., 2018a; Cao and Gu, 2019; Arora et al., 2019).

Appendix A Proof of the Main Theory

A.1 Proof of Proposition 3.6

We prove this proposition by two steps: (1) we prove that if there is no duplicate training data, it must hold that λmin(𝐇)>0; (2) we prove that if there exists at least one duplicate training data, we have λmin(𝐇)=0.

The first step can be done by applying Theorem 3 in Du et al. (2018b), where the author showed that if for any ij, 𝐱i𝐱j, then it holds that λmin(𝐇)>0. Since under Assumption 3.1, we have 𝐱i2=𝐱j2. Then it can be shown that 𝐱i𝐱j for all ij is an sufficient condition to λmin(𝐇).

Then we conduct the second step. Clearly, if we have two training data with 𝐱i=𝐱j, it can be shown that 𝐇ik=𝐇jk for all k=1,,n. This immediately implies that there exist two identical rows in 𝐇, which further suggests that λmin(𝐇)=0.

The last argument can be directly proved by Lemma I.1 in Oymak and Soltanolkotabi (2019), where the authors showed that λ0=λmin(𝐇)ϕ/(100n2).

By combining the above discussions, we are able to complete the proof.

A.2 Proof of Theorem 3.8

Now we sketch the proof of Theorem 3.8. Following the same idea of proving Theorem 3.3, we split the whole proof into three steps.

Step (i) Initialization and perturbation region characterization. Unlike the proof for GD, in addition to the crucial gradient lower bound specified in Lemma 4.1, we also require the gradient upper bound, which is stated in the following lemma.

Lemma A.1 (Gradient upper bounds (Allen-Zhu et al., 2018b)).

Let τ=O(ϕ3/2n-3L-6log-3/2(m)), then for all 𝐖(𝐖(0),τ), with probability at least 1-exp(O(mϕ/(dn))), it holds that

L(𝐖)F2O(mL(𝐖)k),(𝐟𝐖(𝐱i),𝐲i)F2O(m(𝐟𝐖(𝐱i),𝐲i)k).

In later analysis, we show that the gradient upper bound will be exploited to bound the distance between iterates of SGD and its initialization. Besides, note that Lemmas 4.3 and 4.4 hold for both GD and SGD, we do not state them again in this part.

Step (ii) Convergence rate of SGD. Analogous to the proof for GD, the following lemma shows that SGD is able to converge to the global minima at a linear rate.

Lemma A.2.

Assume all iterates stay inside the region (𝐖(0),τ), where τ=O(ϕ3B3/2n-6L-6log-3/2(m)). Then under Assumptions 3.1 and 3.2, if set the step size η=O(Bϕ/(L2mn2)), with probability least 1-exp(-O(mτ3/2L)), it holds that

𝔼[L(𝐖(t))](1-O(mϕηkn2))tL(𝐖(0)).

Step (iii) Verifying all iterates of SGD stay inside the perturbation region. Similar to the proof for GD, the following lemma characterizes the distance from each iterate to the initial point for SGD.

Lemma A.3.

Under the same assumptions made in Lemma A.2, if set the step size η=O(kBϕ/(n3mlog(m))), suppose mO(Tn), with probability at least 1-O(n-1), the following holds for all tT and l[L],

𝐖l(t)-𝐖l(0)2O(k1/2n5/2B-1/2m-1/2ϕ-1).
Proof of Theorem 3.8.

Compared with Lemma 4.2, the trajectory length of SGD is much larger than that of GD. In addition, we require a much smaller step size to guarantee that the iterates do not go too far away from the initial point. This makes over-parameterization condition of SGD worse than that of GD.

We complete the proof of Theorem 3.8 by verifying our choice of m in (3.2). By substituting (3.2) into Lemma A.3, we have with probability at least 1-O(n-1), the following holds for all tT and l[L]

𝐖l(t)-𝐖l(0)2=O(ϕ3/2B3/2n-6L-6log-3/2(m)),

which is exactly in the same order of τ in Lemma A.2. Then by Lemma A.2, we know that in order to achieve ϵ expected training loss, it suffices to set

Tη=O(kn2m-1ϕ-1log(1/ϵ)).

Then applying our choice of step size, i.e., η=O(kBϕ/(n3mlog(m))), we can get the desired result for T. This completes the proof. ∎

A.3 Proof of Theorem 3.10

Before proving Theorem 3.10, we first deliver the following two lemmas. The first lemma states the upper bound of stochastic gradient in 2, norm.

Lemma A.4.

With probability at least 1-O(m-1), it holds that

(𝐟𝐖(𝐱i),𝐲i)2,2O((𝐟𝐖(𝐱i),𝐲i)log(m))

for all 𝐖m×d and i[n].

The following lemma gives a different version of semi-smoothness for two-layer ReLU network.

Lemma A.5 (Semi-smoothness for two-layer ReLU network).

For any two collections 𝐖^={𝐖^l}l[L] and 𝐖~={𝐖~l}l[L] satisfying 𝐖^,𝐖~(𝐖(0),τ), with probability at least 1-exp(-O(-mτ2/3)), there exist two constants C and C′′ such that

L(𝐖~) L(𝐖^)+L(𝐖^),𝐖~-𝐖^
+CL(𝐖^)τ2/3mlog(m)k𝐖~-𝐖^2,+C′′mk𝐖~-𝐖^22.

It is worth noting that Lemma 4.4 can also imply a 2, norm based semi-smoothness result by applying the inequality 𝐖~-𝐖^2𝐖~-𝐖^Fm𝐖~-𝐖^2,. However, this operation will maintain the dependency on τ, i.e., τ1/3, which is worse than that in Lemma A.5 (e.g. τ2/3) since typically we have τ1. Therefore, Lemma A.5 is crucial to establish a better convergence guarantee for SGD in training two-layer ReLU network.

Proof of Theorem 3.10.

To simplify the proof, we use the following short-hand notation to define mini-batch stochastic gradient at the t-th iteration

𝐆(t)=1|(t)|s(t)(𝐟𝐖(t)(𝐱s),𝐲s),

where (t) is the minibatch of data indices with |(t)|=B. Then we bound its variance as follows,

𝔼[𝐆(t)-L(𝐖(t))F2] 1B𝔼s[(𝐟𝐖(t)(𝐱s),𝐲s)-L(𝐖(t))F2]
2B[𝔼s[(𝐟𝐖(t)(𝐱s),𝐲i)F2]+L(𝐖(t))F2]
4L(𝐖(t))Bk,

where the expectation is taken over the random choice of training data and the second inequality follows from Young’s inequality and the last inequality is by Lemma A.1. Moreover, we can further bound the expectation 𝔼[𝐆(t)F2] as follows,

𝔼[𝐆(t)F2]2𝔼[𝐆(t)-L(𝐖(t))F2]+2L(𝐖(t))F28mL(𝐖(t))Bk+2L(𝐖(t))F2. (A.1)

By Lemma A.5, we have the following for one-step stochastic gradient descent

L(𝐖(t+1)) L(𝐖(t))-ηL(𝐖(t)),𝐆(t)
+CηL(𝐖(t))τ2/3mlog(m)k𝐆(t)2,+C′′mη2k𝐆(t)22.

Taking expectation conditioned on 𝐖(t), we obtain

𝔼[L(𝐖(t+1))|𝐖(t)] L(𝐖(t))-ηL(𝐖(t)),𝐆(t)
+CηL(𝐖(t))τ2/3mlog(m)k𝔼[𝐆(t)2,|𝐖(t)]
+C′′mη2k𝔼[𝐆(t)22|𝐖(t)]. (A.2)

By Lemma A.4, with probability at least 1-O(m-1) we have the following upper bound on the quantity 𝔼[𝐆(t)2,|𝐖(t)] for all t=1,,T,

𝔼[𝐆(t)2,|𝐖(t)]𝔼[(𝐟𝐖(t)(𝐱i),𝐲i)2,|𝐖(t)]O(L(𝐖(t))log(m)).

Then based on Lemma B.2, plugging (A.1) and the above inequality into (A.3), and set

η=O(kmn2)andτ=O(ϕ3n3k3/4log3/2(m)).

Then with proper adjustment of constants we can obtain

𝔼[L(𝐖(t+1))|𝐖(t)]L(𝐖(t))-η2L(𝐖(t))F2(1-mϕη2kn2)L(𝐖(t)),

where the last inequality follows from Lemma 4.1. Then taking expectation on 𝐖(t), we have with probability 1-O(m-1),

𝔼[L(𝐖(t+1))](1-mϕη2kn2)𝔼[L(𝐖(t))](1-mϕη2kn2)t+1𝔼[L(𝐖(0))] (A.3)

holds for all t>0. Then by Lemma A.3, we know that if set η=O(kBϕ/(n3mlog(m))), with probability at least 1-O(n-1), it holds that

𝐖l(t)-𝐖l(0)2O(k1/2n5/2B1/2m1/2ϕ),

for all tT. Then by our choice of m, it is easy to verify that with probability at least 1-O(n-1)-O(m-1)=1-O(n-1),

𝐖l(t)-𝐖l(0)2O(k1/2n5/2B1/2ϕϕ4B1/2k5/4n11/2log3/2(m))=τ.

Moreover, note that in Lemma A.3 we set the step size as η=O(kBϕ/(n3mlog(m))) and (A.3) suggests that we need

Tη=O(kn2mϕ)

to achieve ϵ expected training loss. Therefore we can derive the number of iteration as

T=O(n5log(m)log(1/ϵ)Bϕ2).

This completes the proof. ∎

Appendix B Proof of Lemmas in Section 4 and Appendix A

B.1 Proof of Lemma 4.1

We first provide the following useful lemmas before starting the proof of Lemma 4.1.

The following lemma states that with high probability the norm of the output of each hidden layer is bounded by constants.

Lemma B.1 ((Zou et al., 2018)).

If mO(Llog(nL)), with probability at least 1-exp(-O(m/L)), it holds that 1/2𝐱l,i22 and 𝐱l,i/𝐱l,i2-𝐱l,j/𝐱l,j22ϕ/2 for all i,j[n] and l[L], where 𝐱l,i denotes the output of the l-th hidden layer given the input 𝐱i.

Lemma B.2.

Assume mO~(n2k2ϕ-1), then there exist an absolute constant C>0 such that with probability at least 1-exp(-O(mϕ/(kn))), it holds that

j=1m1ni=1n𝐮i,𝐯jσ(𝐰L,j(0),𝐱L-1,i)𝐱L-1,i22Cϕmi=1n𝐮i22kn3.

If we set 𝐮i=𝐟𝐖(0)(𝐱i)-𝐲i, Lemma B.2 corresponds to the gradient lower bound at the initialization. Then the next step is to prove the bounds for all 𝐖 in the required perturbation region. Before proceeding to our final proof, we present the following lemma that provides useful results regarding the neural network within the perturbation region.

Lemma B.3 ((Allen-Zhu et al., 2018b)).

Consider a collection of weight matrices 𝐖~={𝐖~l}l=1,,L such that 𝐖~(𝐖(0),τ), with probability at least 1-exp(-O(mτ2/3L)), there exists constants C, C′′ and C′′′ such that

  • 𝚺~L,i-𝚺L,i0Cτ2/3L

  • 𝐕(𝚺~L,i-𝚺L,i)2C′′τ1/3L2mlog(m)/k

  • 𝐱~L-1,i-𝐱L-1,i2C′′′τL5/2log(m),

for all i=1,,n, where 𝐱L-1,i and 𝐱~L-1,i denote the outputs of the L-1-th layer of the neural network with weight matrices 𝐖(0) and 𝐖~, and 𝚺L,i and 𝚺~L,i are diagonal matrices with (𝚺L,i)jj=σ(𝐰L,j(0),𝐱L-1) and (𝚺~L,i)jj=σ(𝐰~L,j,𝐱~L-1) respectively.

Now we are ready to prove the lower and upper bounds of the Frobenious norm of the gradient.

Proof of Lemma 4.1.

The upper bound of the gradient norm can be proved according to Theorem 3 in Allen-Zhu et al. (2018b). We slightly modify their result since we consider average loss over all training examples while Allen-Zhu et al. (2018b) considers summation.

Then we focus on proving the lower bound. Note that the gradient 𝐖LL(𝐖~) takes form

𝐖LL(𝐖~)=1ni=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕𝚺~L,i)𝐱~L-1,i,

where 𝚺~L,i is a diagonal matrix with (𝚺~L,i)jj=σ(𝐰~L-1,j,𝐱~L-1,i) and 𝐱~l-1,i denotes the output of the l-th hidden layer with input 𝐱i and model weight matrices 𝐖~. Let 𝐯j denote the j-th row of matrix 𝐕, and define

𝐆~=1ni=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕𝚺L,i)𝐱L-1,i,

where 𝚺L,i is a diagonal matrix with (𝚺~L,i)jj=σ(𝐰L-1,j(0),𝐱L-1,i) Then by Lemma B.2, we have with probability at least 1-exp(-O(mϕ/(kn))), the following holds for any 𝐖~,

𝐆~F2 =1n2j=1mi=1n𝐟𝐖~(𝐱i)-𝐲i,𝐯jσ(𝐰L,j,𝐱L-1,i)𝐱L-1,i22
C0ϕmi=1n𝐟𝐖~(𝐱i)-𝐲i22kn3,

where C0 is an absolute constant. Then we have

𝐆~-𝐖LL(𝐖~)F
=1n2i=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕𝚺L,i)𝐱L-1,i-i=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕𝚺~L,i)𝐱~L-1,iF
1n2[i=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕(𝚺L,i-𝚺~L,i))𝐱L-1,iF
+i=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕𝚺~L,i)(𝐱L-1,i-𝐱~L-1,i)F].

By Lemmas B.1 and B.3, we have

i=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕(𝚺L,i-𝚺~L,i))𝐱L-1,iF
i=1n𝐟𝐖~(𝐱i)-𝐲i2𝐕(𝚺L,i-𝚺~L,i)2𝐱L-1,i2
C1τ1/3L2mlog(m)ki=1n𝐟𝐖~(𝐱i)-𝐲i2,

where the second inequality follows from Lemma B.3 and C1 is an absolute constant. In addition, we also have

i=1n((𝐟𝐖~(𝐱i)-𝐲i)𝐕𝚺~L,i)(𝐱L-1,i-𝐱~L-1,i)F
i=1n𝐟𝐖~(𝐱i)-𝐲i2𝐕2𝐱L-1,i-𝐱~L-1,i2
C2τL5/2mlog(m)ki=1n𝐟𝐖~(𝐱i)-𝐲i2,

where the second inequality follows from Lemma B.3 and C2 is an absolute constant. Combining the above bounds we have

𝐆-𝐖LL(𝐖~)F i=1n𝐟𝐖~(𝐱i)-𝐲i2n(C1τ1/3L2mlog(m)k+C2τL5/2mlog(m)k)
i=1n𝐟𝐖~(𝐱i)-𝐲i2nC3τ1/3L2mlog(m)k,

where the second inequality follows from the fact that τO(L-4/3). Then by triangle inequality, we have the following lower bound of 𝐖LL(𝐖~)F

𝐖LL(𝐖~)F 𝐆F-𝐆-𝐖LL(𝐖~)F
C0ϕ1/2m1/2ni=1n𝐟𝐖~(𝐱i)-𝐲i22kn2
-i=1n𝐟𝐖~(𝐱i)-𝐲i2nC3τ1/3L2mlog(m)k.

By Jensen’s inequality we know that ni=1n𝐟𝐖~(𝐱i)-𝐲i22(i=1n𝐟𝐖~(𝐱i)-𝐲i2)2. Then we set

τ=C3ϕ3/22C0n3L6log3/2(m)=O(ϕ3/2n3L6log3/2(m)),

and obtain

𝐖LL(𝐖~)FC0ϕ1/2m1/2ni=1n𝐟𝐖~(𝐱i)-𝐲i222kn2.

Then plugging the fact that 1/ni=1n𝐟𝐖~(𝐱i)-𝐲i22=L(𝐖~), we are able to complete the proof. ∎

B.2 Proof of Lemma 4.2

Proof of Lemma 4.2.

Note that we assume that all iterate are staying inside the region (𝐖(0),τ), then by Lemma 4.4, with probability at least 1-exp(-O(mτ2/3L)), we have the following after one-step gradient descent

L(𝐖(t+1)) L(𝐖(t))-ηL(𝐖(t))F2
+CηL(𝐖(t))τ1/3L2mlog(m)kL(𝐖(t))2+C′′L2mη2kL(𝐖(t))22. (B.1)

We first choose the step size

η=k4C′′L2m=O(kL2m),

then (B.2) yields

L(𝐖(t+1)) L(𝐖(t))-3η4L(𝐖(t))F2+CηL(𝐖(t))τ1/3L2mlog(m)kL(𝐖(t))2
L(𝐖(t))-ηL(𝐖(t))F(L(𝐖(t))F2-CL(𝐖(t))τ1/3L2mlog(m)k),

where we use the fact that L(𝐖(t))2L(𝐖(t))F. Then by Lemma 4.1, we know that with probability at least 1-exp(-O(mϕ/(kn)))

𝐖L(𝐖(t))F2𝐖LL(𝐖(t))F2Cmϕkn2L(𝐖(t)), (B.2)

where C is an absolute constant. Thus, we can choose the radius τ as

τ=C3/2ϕ3/264n3C3L6log3/2(m)=O(ϕ3/2n3L6log3/2(m)), (B.3)

and thus the following holds with probability at least 1-exp(-O(mτ2/3L))-exp(-O(mϕ/(kn)))=1-exp(-O(mτ2/3L)),

L(𝐖(t+1)) L(𝐖(t))-η2L(𝐖(t))F2, (B.4)

where the second inequality follows from (B.2). This completes the proof. By triangle inequality, we have

𝐖l(t)-𝐖l(0)2 s=0t-1η𝐖lL(𝐖(s))2s=0t-1ηL(𝐖(s))F. (B.5)

Moreover, we have

L(𝐖(s))-L(𝐖(s+1)) =L(𝐖(s))-L(𝐖(s+1))L(𝐖(s))+L(𝐖(s+1))
ηL(𝐖(s))F24L(𝐖(s))
Cmϕkn2ηL(𝐖(s))F4,

where the second inequality is by (B.4) and the fact that L(𝐖(s+1))L(𝐖(s)), and the last inequality follows from (B.2). Plugging the above result into (B.5), we have with probability at least 1-exp(-O(mτ2/3L)),

𝐖l(t)-𝐖l(0)2 s=0t-1ηL(𝐖(s))F
4kn2Cmϕs=0t-1[L(𝐖(s))-L(𝐖(s+1))]
4kn2CmϕL(𝐖(0)). (B.6)

Note that (B.2) holds for all l and t. Then apply Lemma 4.3, we are able to complete the proof.

B.3 Proof of Lemma 4.3

Proof of Lemma 4.3.

Note that the output of the neural network can be formulated as

f𝐖(0)(𝐱i)=𝐕𝐱L,i,

where 𝐱L,i denotes the output of the last hidden layer with input 𝐱i. Note that each entry in 𝐕 is i.i.d. generated from Gaussian distribution 𝒩(0,1/k). Thus, we know that with probability at least 1-δ, it holds that 𝐕𝐱L,i2log(1/δ)𝐱L,i2. Then by Lemma B.1 and union bound, we have 𝐕𝐱L,i22log(1/δ) for all i[n] with probability at least 1-exp(-O(m/L))-nδ. Then we set δ=O(n-2) and use the fact that mO(Llog(nL)), we have

f𝐖(0)(𝐱i)=𝐕𝐱L,i22O(log(n))

for all i[n] with probability at least 1-O(n-1). Then by our definition of training loss, it follows that

L(𝐖(0)) =12ni=1𝐟𝐖(0)(𝐱i)-𝐲i22
1ni=1[𝐟𝐖(0)(𝐱i)22+𝐲i22]
O(log(n))

with probability at least 1-O(n-1), where the first inequality is by Young’s inequality and we assume that 𝐲i2=O(1) for all i[n] in the second inequality. This completes the proof. ∎

B.4 Proof of Lemma 4.5

Proof of Lemma 4.5.

By (B.4), we have

L(𝐖(t+1)) L(𝐖(t))-η2L(𝐖(t))F2
(1-Cmϕη2kn2)L(𝐖(t))
(1-Cmϕη2kn2)t+1L(𝐖(0)), (B.7)

where the second inequality follows from (B.2). This completes the proof. ∎

B.5 Proof of Lemma A.2

Proof of Lemma A.2.

Let 𝐆(t) denote the stochastic gradient leveraged in the t-th iteration, where the corresponding minibatch is defined as (t). By Lemma 4.4, we have the following inequality regarding one-step stochastic gradient descent

L(𝐖(t+1)) L(𝐖(t))-ηL(𝐖(t)),𝐆(t)
+CηL(𝐖(t))τ1/3L2mlog(m)k𝐆(t)2+C′′L2mη2k𝐆(t)22.

Then taking expectation on both sides conditioned on 𝐖(t) gives

𝔼[L(𝐖(t+1))|𝐖(t)]
L(𝐖(t))-ηL(𝐖(t))F2+CηL(𝐖(t))τ1/3L2mlog(m)k𝔼[𝐆(t)2|𝐖(t)]
+C′′L2mη2k𝔼[𝐆(t)22|𝐖(t)]. (B.8)

Note that given 𝐖(t), the expectations on 𝐆(t)2 and 𝐆(t)22 are only taken over the random minibatch (t). Then by (A.1), we have

𝔼[𝐆(t)2|𝐖(t)]2𝔼[𝐆(t)F2|𝐖(t)]8mL(𝐖(t))Bk+2L(𝐖(t))F2.

By (B.2), we know that there is a constant C such that L(𝐖(t))F2CmϕL(𝐖(t))/(kn2). Then we set the step size η and radius τ as follows

η =Cd64C′′L2mn2=O(kL2mn2)
τ =C3ϕ3/2B3642n6C3L6log3/2(m)=O(ϕ3B3/2n6L6log3/2(m))

Then (B.5) yields that

𝔼[L(𝐖(t+1))|𝐖(t)] L(𝐖(t))-ηL(𝐖(t))F2-C′′L2mη2k(8n2cϕB+2)L(𝐖(t))F2
-CηL(𝐖(t))τ1/3L2mlog(m)k8n2cϕB+2L(𝐖(t))F
L(𝐖(t))-η2L(𝐖(t))F2. (B.9)

Then applying (B.2) again and taking expectation over 𝐖(t) on both sides of (B.5), we obtain

𝔼[L(𝐖(t+1))](1-Cmϕη2kn2)𝔼[L(𝐖(t))](1-Cmϕη2kn2)t+1L(𝐖(0)).

This completes the proof.

B.6 Proof of Lemma A.3

Proof of Lemma A.3.

We prove this by standard martingale inequality. By Lemma 4.4 and our choice of η and τ, we have

L(𝐖(t+1))L(𝐖(t))+2ηL(𝐖(t))F𝐆(t)2+η2𝐆(t)22. (B.10)

By Lemma A.1, we know that there exists an absolute constant C such that

L(𝐖(t))F2CmL(𝐖(t))k and 𝐆(t)F2CmnL(𝐖(t))Bk,

where B denotes the minibatch size. Then note that ηO(B/n), we have the following according to (B.10)

L(𝐖(t+1))(1+Cmn1/2ηB1/2d)L(𝐖(t)),

where C is an absolute constant. Taking logarithm on both sides further leads to

log(L(𝐖(t+1)))log(L(𝐖(t)))+Cmn1/2ηB1/2d,

where we use the inequality log(1+x)x. By (B.2) and (B.5), we know that

𝔼[L(𝐖(t+1))|𝐖(t)]L(𝐖(t))-η4L(𝐖(t))F2(1-C′′mϕη2kn2)L(𝐖(t)).

Then by Jensen’s inequality and the inequality log(1+x)x, we have

𝔼[log(L(𝐖(t+1)))|𝐖(t)]log(𝔼[L(𝐖(t+1))|𝐖(t)])log(L(𝐖(t)))-C′′mϕη2kn2,

which further yields the following by taking expectation on 𝐖(t) and telescope sum over t,

𝔼[log(L(𝐖(t)))]log(L(𝐖(t)))-C′′mϕη2kn2. (B.11)

Therefore {L(𝐖(t))}t=0,1, is a supermartingale. By one-side Azuma’s inequality, we know that with probability at least 1-δ, the following holds for any t

log(L(𝐖(t))) 𝔼[log(L(𝐖(t)))]+Cmn1/2ηB1/2d2tlog(1/δ)
log(L(𝐖(0)))-tC′′mϕη2kn2+Cmn1/2ηB1/2d2tlog(1/δ)
log(L(𝐖(0)))-tC′′mϕη4kn2+C2mn3log(1/δ)ηC′′kBϕ, (B.12)

where the second inequality is by (B.11) and we use the fact that -at+btb2/(4a) in the last inequality. Then we chose δ=O(m-1) and

η=log(2)C′′kBϕC2mn3log(1/δ)=O(kBϕn3mlog(m)).

Plugging these into (B.6) gives

log(L(𝐖(t)))log(2L(𝐖(0)))-tC′′mϕη4kn2,

which implies that

L(𝐖(t))2L(𝐖(0))exp(-tC′′mϕη4kn2). (B.13)

By Lemma A.1, we have

𝐆(t)2𝐆(t)FO(m1/2n1/2L(𝐖(t))B1/2k1/2) (B.14)

for all tT. Therefore, plugging (B.14) into (B.13) and taking union bound over all tT, and apply the result in Lemma 4.3, the following holds for all tT with probability at least 1-O(Tm-1)-O(n-1)=1-O(n-1),

𝐖l(t)-𝐖l(0)2s=0t-1η𝐆(t)2O(m1/2n1/2B1/2k1/2)s=0t-1ηL(𝐖(s))O~(k1/2n5/2B1/2m1/2ϕ),

where the first inequality is by triangle inequality, the second inequality follows from (B.14) and the last inequality is by (B.13) and Lemma 4.3. This completes the proof.

B.7 Proof of Lemma A.4

Proof of Lemma A.4.

We first write the formula of (f𝐖(𝐱i),𝐲i) as follows

(f𝐖(𝐱i),𝐲i)=(f𝐖(𝐱i)-yi)𝐕𝚺i)𝐱i.

Since 𝚺i is an diagonal matrix with (𝚺i)jj=σ(𝐰j,𝐱i). Therefore, it holds that

(f𝐖(𝐱i),𝐲i)2, =maxj[m]f𝐖~(𝐱i)-𝐲i,𝐯j𝐱i2maxj[m]f𝐖(𝐱i)-𝐲i2𝐯j2, (B.15)

where 𝐯j denotes the j-th column of 𝐕 and we use the fact that 𝐱i2=1. Note that 𝐯j𝒩(0,𝐈/k), we have

(𝐯j22O(log(m)))O(m-1).

Applying union bound over 𝐯1,,𝐯m, we have with probability at least 1-O(m-1),

maxj[m]𝐯j2O(log1/2(m)).

Plugging this into (B.15) and applying the fact that f𝐖(𝐱i)-𝐲i2=(𝐟𝐖(𝐱i),𝐲i), we are able to complete the proof. ∎

B.8 Proof of Lemma A.5

Recall that the output of two-layer ReLU network can be formulated as

𝐟𝐖(𝐱i)=𝐕𝚺i𝐖𝐱i,

where 𝚺i is a diagonal matrix with only non-zero diagonal entry (𝚺i)jj=σ(𝐰j𝐱i). Then based on the definition of L(𝐖), we have

L(𝐖~)-L(𝐖^)
=12ni=1n𝐕𝚺~i𝐖~𝐱i-𝐲i22-1ni=1n𝐕𝚺^i𝐖^𝐱i-𝐲i22
=12ni=1n𝐕𝚺^i𝐖^𝐱i-𝐲i,𝐕𝚺~i𝐖~𝐱i-𝐕𝚺^i𝐖^𝐱iI1+12ni=1n𝐕𝚺~i𝐖~𝐱i-𝐕𝚺^i𝐖^𝐱i22I2.

Then we tackle the two terms on the R.H.S. of the above equation separately. Regarding the first term, i.e., I1, we have

I1 =12ni=1n𝐕𝚺^i𝐖^𝐱i-𝐲i,𝐕𝚺^i(𝐖~-𝐖^)𝐱i
+12ni=1n𝐕𝚺^i𝐖^𝐱i-𝐲i,𝐕(𝚺~i-𝚺^i)𝐖~𝐱i
L(𝐖^),𝐖~-𝐖^+12ni=1n(f𝐖^(𝐱i),𝐲i)𝐕(𝚺~i-𝚺^i)𝐖~𝐱i2
L(𝐖^),𝐖~-𝐖^+L(𝐖^)2𝐕(𝚺~i-𝚺^i)𝐖~𝐱i2,

where the last inequality follows from Jensen’s inequality. Note that the non-zero entries in 𝚺~i-𝚺^i represent the nodes, say j, satisfying sign(𝐰~j𝐱i)sign(𝐰^j𝐱i), which implies |𝐰~j𝐱i||(𝐰~j-𝐰^j)𝐱i|. Therefore, we have

𝐕(𝚺~i-𝚺^i)𝐖~𝐱i22𝐕(𝚺~i-𝚺^i)(𝐖~-𝐖^)𝐱i22.

By Lemma B.3, we have 𝚺~i-𝚺^i0𝚺~i-𝚺i0+𝚺^i-𝚺i0=O(mτ2/3). Then we define 𝚺¯i as

(𝚺¯i)jk=|(𝚺~i-𝚺^i)jk| for all j,k.

Then we have

𝐕(𝚺~i-𝚺^i)𝐖~𝐱i2 𝐕(𝚺~i-𝚺^i)𝚺¯i(𝐖~-𝐖^)𝐱i2
𝐕(𝚺~i-𝚺^i)2𝚺¯i(𝐖~-𝐖^)F
𝐕(𝚺~i-𝚺^i)2𝚺¯i01/2𝐖~-𝐖^2,.

By Lemma B.3, we have with probability 1-O(mτ2/3)

𝐕(𝚺~i-𝚺^i)𝐖~𝐱i2O(mlog(m)τ2/3k-1)𝐖~-𝐖^2,.

In what follows we are going to tackle the term I2. Note that for each i, we have

𝐕𝚺~i𝐖~𝐱i-𝐕𝚺^i𝐖^𝐱i2 =𝐕𝚺^i(𝐖~-𝐖^)𝐱i2+𝐕(𝚺~i-𝚺^i)𝐖~𝐱i2
𝐕2𝐖~-𝐖^2+𝐕(𝚺~i-𝚺^i)2𝐖~-𝐖^2
=O(m1/2/k1/2)𝐖~-𝐖^2,

where the last inequality holds due to the fact that 𝐕2=O(m1/2/k1/2) with probability at least 1-exp(-O(m)). This leads to I2O(m/k)𝐖~-𝐖^22. Now we can put everything together, and obtain

L(𝐖~)-L(𝐖^) =I1+I2
L(𝐖^),𝐖~-𝐖^+O(mlog(m)τ2/3k-1/2)L(𝐖^)𝐖~-𝐖^2,
+O(m/k)𝐖~-𝐖^22.

Then applying union bound on the inequality for I1 and I2, we are able to complete the proof.

Appendix C Proof of Technical Lemmas in Appendix B

C.1 Proof of Lemma B.2

Let 𝐳1,,𝐳nd be n vectors with 1/2mini{𝐳i2}maxi{𝐳i2}2. Let 𝐳¯i=𝐳i/𝐳i2 and assume mini,j𝐳¯i-𝐳¯j2ϕ~. Then for each 𝐳i, we construct an orthonormal matrix 𝐐i=[𝐳¯i,𝐐i]d×d. Then consider a random vector 𝐰𝒩(0,𝐈), it follows that 𝐮i:=𝐐i𝐰𝒩(0,𝐈). Then we can decompose 𝐰 as

𝐰=𝐐i𝐮i=𝐮i(1)𝐳¯i+𝐐i𝐮i, (C.1)

where 𝐮i(1) denotes the first coordinate of 𝐮i and 𝐮i:=(𝐮i(2),,𝐮i(d)). Then let γ=πϕ~/(8n), we define the following set of 𝐰 based on 𝐳i,

𝒲i={𝐰:|𝐮i(1)|γ,|𝐐i𝐮i,𝐳¯i|2γ for all 𝐳¯j such that ji}.

Regarding the class of sets {𝒲1,,𝒲n}, we have the following lemmas.

Lemma C.1.

For each 𝒲i and 𝒲j, we have

(𝐰𝒲i)ϕ~n128eand𝒲i𝒲j=.

Then we deliver the following two lemmas which are useful to establish the required lower bound.

Lemma C.2.

For any 𝐚=(a1,,an)n, let 𝐡(𝐰)=i=1naiσ(𝐰,𝐳i)𝐳i where 𝐰N(𝟎,𝐈) is a Gaussian random vector. Then it holds that

[𝐡(𝐰)2|ai|/4|𝐰𝒲i]1/2.

Now we are able to prove Lemma B.2.

Proof of Lemma B.2.

We first prove the result for any fixed 𝐮1,,𝐮n. Then we define ai(𝐯j)=𝐮i,𝐯j, 𝐰j=m/2𝐰L,j and

𝐡(𝐯j,𝐰j)=i=1nai(𝐯j)σ(𝐰j,𝐱L-1,i)𝐱L-1,i.

Then we define the event

i={j[m]:𝐰j𝒲i,𝐡(𝐯j,𝐰j)2|ai(𝐯j)|/4,|ai(𝐯j)|𝐮i2/k}.

By Lemma B.1, we know that with high probability 1/2𝐱L-1,i22 for all i and 𝐱L-1,i/𝐱L-1,i2-𝐱L-1,j/𝐱L-1,j2ϕ/2 for all ij. Then by Lemma C.1 we know that ij= if ij and

(ji) =[𝐡(𝐯j,𝐰j)2|ai(𝐯j)|/4|𝐰j𝒲i][𝐰j𝒲i][|ai(𝐯j)|𝐮i2/k]
ϕ642en, (C.2)

where the first equality holds because 𝐰j and 𝐯j are independent, and the second inequality follows from Lemmas C.1, C.2 and the fact that (|ai(𝐯j)|𝐮i2/k)1/2. Then we have

𝐖LL(𝐖)F2 =1n2j=1m𝐡(𝐯j,𝐰j)22
1n2j=1m𝐡(𝐯j,𝐰j)22s=1n𝟙(js)
1n2j=1ms=1n𝐮s2216k𝟙(js),

where the second inequality holds due to the fact that

𝐡(𝐯j,𝐰j)22𝟙(js) as2(𝐯j)16𝟙(|as(𝐯j)|𝐮s2/k)𝟙(js)
𝐮s2216k𝟙(js),

where the first inequality follows from the definition of s. Then we further define

Zj=s=1n𝐮s2216k𝟙(js),

and provide the following results for 𝔼[Z(𝐰j)] and var[Z(𝐰j)]

𝔼[Zj] =s=1n𝐮s2216k(js),var[Z(𝐰)]=s=1n𝐮s24256k2(js)[1-(js)].

Then by Bernstein inequality, with probability at least 1-exp(-O(m𝔼[Z(𝐰)]/maxi[n]𝐮i22)), it holds that

j=1mZjm2𝔼[Zj]i=1n𝐮i2232dmϕ642en=Cϕmi=1n𝐮i22kn,

where the second inequality follows from (C.1) and C=1/(20962e) is an absolute constant. Therefore, with probability at least 1-exp(-O(mϕ/(kn))) we have

j=1m1ni=1n𝐮i,𝐯jσ(𝐰L,j,𝐱L-1,i)𝐱L-1,i221n2j=1mZ(𝐰j)Cϕmi=1n𝐮i22kn3.

Till now we have completed the proof for one particular vector collection {𝐮i}i=1,,n. Then we are going to prove that the above inequality holds for arbitrary {𝐮i}i=1,,n with high probability. Taking ϵ-net over all possible vectors {𝐮1,,𝐮n}(d)n and applying union bound, the above inequality holds with probability at least 1-exp(-O(mϕ/(kn))+nklog(nk)). Since we have mO~(ϕ-1n2k2), the desired result holds for all choices of {𝐮1,,𝐮n}.

Appendix D Proof of Auxiliary Lemmas in Appendix C

Proof of Lemma C.1.

We first prove that any two sets 𝒲i and 𝒲j have not overlap region. Consider an vector 𝐰𝒲i with the decomposition

𝐰=𝐮i(1)𝐳¯i+𝐐i𝐮i.

Then based on the definition of 𝒲i we have,

𝐰,𝐳¯j=𝐮i(1)𝐳¯i+𝐐i𝐮i,𝐳¯j=𝐮i(1)𝐳¯i,𝐳¯j+𝐐i𝐮i,𝐳¯j.

Since 𝐰𝒲i, we have |𝐮i(1)|γ and |𝐐𝐮i,𝐳¯j|2γ. Therefore, note that |𝐳¯i,𝐳¯j|1, it holds that

|𝐰,𝐳¯j|||𝐐i𝐮i,𝐳¯j|-|𝐮i(1)||>γ. (D.1)

Note that set 𝒲j requires |𝐮j(1)|=𝐰,𝐳¯jγ, which conflicts with (D.1). This immediately implies that 𝒲i𝒲j=.

Then we are going to compute the probability (𝐰𝒲i). Based on the parameter γ, we define the following two events

1(γ)={|𝐮i(1)|γ},2(γ)={|𝐐i𝐮i,𝐳¯j|2γ for all 𝐳¯j,ji}.

Evidently, we have (𝐰𝒲i)=(1)(2). Since 𝐮i(1) is a standard Gaussian random variable, we have

(1)=12π-γγexp(-12x2)dx2πeγ.

Moreover, by definition, for any j=1,,n we have

𝐐i𝐮i,𝐳¯jN[0,1-(𝐳¯i𝐳¯j)2].

Note that for any ji we have 𝐳¯i-𝐳¯j2ϕ~, then it follows that

|𝐳i,𝐳j|1-ϕ~2/2,

and if ϕ~22, then

1-(𝐳¯i𝐳¯j)2ϕ~2-ϕ~4/4ϕ~2/2.

Therefore for any ji,

[|𝐐i𝐮i,𝐳¯j|<2γ]=12π-2[1-(𝐳¯i𝐳¯j)2]-1/2γ2[1-(𝐳¯i𝐳¯j)2]-1/2γexp(-12x2)dx8πγ[1-(𝐳¯i𝐳¯j)2]1/24πγϕ~-1.

By union bound over [n], we have

(2)=[|𝐐i𝐮i,𝐳¯j|2γ,j]1-4πnγϕ~-1.

Therefore we have

(𝐰𝒲i)2πeγ(1-4πnγϕ~-1).

Plugging γ=πϕ~/(8n), it holds that ()ϕ~/(128en). This completes the proof. ∎

Proof of Lemma C.2.

Recall the decomposition of 𝐰 in (C.1),

𝐰=𝐮i(1)𝐳¯i+𝐐i𝐮i.

Define the event i:={𝐰𝒲i}. Then conditioning on i, we have

𝐡(𝐰) =i=1naiσ(𝐰,𝐳i)𝐳r
=aiσ(𝐮i(1))𝐳i+jiajσ(𝐮i(1)𝐳¯i,𝐳j+𝐐i𝐮i,𝐳j)𝐳j
=aiσ(𝐮i(1))𝐳i+jiajσ(𝐐i𝐮i,𝐳j)𝐳j (D.2)

where the last equality follows from the fact that conditioning on event i, for all ji, it holds that |𝐐i𝐮i,𝐳j|2γ𝐳j2|𝐮i(1)|𝐳j2|𝐮i(1)𝐳i,𝐳j|. We then consider two cases: 𝐮i(1)>0 and 𝐮i(1)<0, which occur equally likely conditioning on the event i. Let u1>0 and u2<0 denote 𝐮i(1) in these two cases, we have

[𝐡(𝐰)2infu1>0,u2<0max{𝐡(u1𝐳i+𝐐i𝐮i)2,𝐡(u2𝐳i+𝐐i𝐮i)2}|i]1/2.

By the inequality max{𝐚2,𝐛2}𝐚-𝐛2/2, we have

[𝐡(𝐰)2infu1>0,u1<0𝐡(u1𝐳i+𝐐i𝐮i)-𝐡(u2𝐳i+𝐐i𝐮i)2/2|i]1/2. (D.3)

For any u1>0 and u2<0, denote 𝐰1=u1𝐳i+𝐐i𝐮i, 𝐰2=u2𝐳i+𝐐i𝐮i. We now proceed to give lower bound for 𝐡(𝐰1)-𝐡(𝐰2)2. By (D), we have

𝐡(𝐰1)-𝐡(𝐰2)2=ai𝐳i2ai/2, (D.4)

where we use the fact that 𝐳i21/2. Plugging this back into (D.3), we have

[𝐡(𝐰)2|ai|/4|i]1/2.

This completes the proof. ∎

References

  • Allen-Zhu et al. (2018a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2018a). Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 .
  • Allen-Zhu et al. (2018b) Allen-Zhu, Z., Li, Y. and Song, Z. (2018b). A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 .
  • Allen-Zhu et al. (2018c) Allen-Zhu, Z., Li, Y. and Song, Z. (2018c). On the convergence rate of training recurrent neural networks. arXiv preprint arXiv:1810.12065 .
  • Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 .
  • Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
  • Cao and Gu (2019) Cao, Y. and Gu, Q. (2019). A generalization theory of gradient descent for learning over-parameterized deep relu networks. arXiv preprint arXiv:1902.01384 .
  • Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956 .
  • Du and Lee (2018) Du, S. S. and Lee, J. D. (2018). On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206 .
  • Du et al. (2018a) Du, S. S., Lee, J. D., Li, H., Wang, L. and Zhai, X. (2018a). Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804 .
  • Du et al. (2017) Du, S. S., Lee, J. D. and Tian, Y. (2017). When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129 .
  • Du et al. (2018b) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2018b). Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 .
  • He et al. (2015) He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision.
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
  • Li and Liang (2018) Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204 .
  • Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. arXiv preprint arXiv:1705.09886 .
  • Oymak and Soltanolkotabi (2019) Oymak, S. and Soltanolkotabi, M. (2019). Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674 .
  • Tian (2017) Tian, Y. (2017). An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560 .
  • Wu et al. (2019) Wu, X., Du, S. S. and Ward, R. (2019). Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111 .
  • Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 .
  • Zhang et al. (2019) Zhang, H., Yu, D., Chen, W. and Liu, T.-Y. (2019). Training over-parameterized deep resnet is almost as easy as training a two-layer network. arXiv preprint arXiv:1903.07120 .
  • Zhang et al. (2018) Zhang, X., Yu, Y., Wang, L. and Gu, Q. (2018). Learning one-hidden-layer ReLU networks via gradient descent. arXiv preprint arXiv:1806.07808 .
  • Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L. and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175 .
  • Zou et al. (2018) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2018). Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 .