Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

  • 2020-02-13 17:18:19
  • Kaifeng Lyu, Jian Li
  • 0

Abstract

In this paper, we study the implicit regularization of the gradient descentalgorithm in homogeneous neural networks, including fully-connected andconvolutional neural networks with ReLU or LeakyReLU activations. Inparticular, we study the gradient descent or gradient flow (i.e., gradientdescent with infinitesimal step size) optimizing the logistic loss orcross-entropy loss of any homogeneous model (possibly non-smooth), and showthat if the training loss decreases below a certain threshold, then we candefine a smoothed version of the normalized margin which increases over time.We also formulate a natural constrained optimization problem related to marginmaximization, and prove that both the normalized margin and its smoothedversion converge to the objective value at a KKT point of the optimizationproblem. Our results generalize the previous results for logistic regressionwith one-layer or multi-layer linear networks, and provide more quantitativeconvergence results with weaker assumptions than previous results forhomogeneous smooth neural networks. We conduct several experiments to justifyour theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin isclosely related to robustness, we discuss potential benefits of training longerfor improving the robustness of the model.

 

Quick Read (beta)

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Kaifeng Lyu & Jian Li
Institute for Interdisciplinary Information Sciences
Tsinghua University
Beijing, China
[email protected],[email protected]
Abstract

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Kaifeng Lyu & Jian Li
Institute for Interdisciplinary Information Sciences
Tsinghua University
Beijing, China
[email protected],[email protected]

1 Introduction

A major open question in deep learning is why gradient descent or its variants, are biased towards solutions with good generalization performance on the test set. To achieve a better understanding, previous works have studied the implicit bias of gradient descent in different settings. One simple but insightful setting is linear logistic regression on linearly separable data. In this setting, the model is parameterized by a weight vector 𝒘, and the class prediction for any data point 𝒙 is determined by the sign of 𝒘𝒙. Therefore, only the direction 𝒘/𝒘2 is important for making prediction. Soudry et al. (2018a; b); Ji & Telgarsky (2018; 2019c); Nacson et al. (2019c) investigated this problem and proved that the direction of 𝒘 converges to the direction that maximizes the L2-margin while the norm of 𝒘 diverges to +, if we train 𝒘 with (stochastic) gradient descent on logistic loss. Interestingly, this convergent direction is the same as that of any regularization path: any sequence of weight vectors {𝒘t} such that every 𝒘t is a global minimum of the L2-regularized loss (𝒘)+λt2𝒘22 with λt0 (Rosset et al., 2004). Indeed, the trajectory of gradient descent is also pointwise close to a regularization path (Suggala et al., 2018).

The aforementioned linear logistic regression can be viewed as a single-layer neural network. A natural and important question is to what extent gradient descent has similiar implicit bias for modern deep neural networks. For theoretical analysis, a natural candidate is to consider homogeneous neural networks. Here a neural network Φ is said to be (positively) homogeneous if there is a number L>0 (called the order) such that the network output Φ(𝜽;𝒙), where 𝜽 stands for the parameter and 𝒙 stands for the input, satisfies the following:

c>0:Φ(c𝜽;𝒙)=cLΦ(𝜽;𝒙) for all 𝜽 and 𝒙. (1)

It is important to note that many neural networks are homogeneous (Neyshabur et al., 2015a; Du et al., 2018). For example, deep fully-connected neural networks or deep CNNs with ReLU or LeakyReLU activations can be made homogeneous if we remove all the bias terms, and the order L is exactly equal to the number of layers.

In (Wei et al., 2019), it is shown that the regularization path does converge to the max-margin direction for homogeneous neural networks with cross-entropy or logistic loss. This result suggests that gradient descent or gradient flow may also converges to the max-margin direction by assuming homogeneity, and this is indeed true for some sub-classes of homogeneous neural networks. For gradient flow, this convergent direction is proven for linear fully-connected networks (Ji & Telgarsky, 2019a). For gradient descent on linear fully-connected and convolutional networks, (Gunasekar et al., 2018b) formulate a constrained optimization problem related to margin maximization and prove that gradient descent converges to the direction of a KKT point or even the max-margin direction, under various assumptions including the convergence of loss and gradient directions. In an independent work, (Nacson et al., 2019a) generalize the result in (Gunasekar et al., 2018b) to smooth homogeneous models (we will discuss this work in more details in Section 2).

1.1 Main Results

In this paper, we identify a minimal set of assumptions for proving our theoretical results for homogeneous neural networks on classification tasks. Besides homogeneity, we make two additional assumptions:

  1. 1.

    Exponential-type Loss Function. We require the loss function to have certain exponential tail (see Appendix A for the details). This assumption is not restrictive as it includes the most popular classfication losses: exponential loss, logistic loss and cross-entropy loss.

  2. 2.

    Separability. The neural network can separate the training data during training (i.e., the neural network can achieve 100% training accuracy)11 1 Note that this does NOT mean the training loss is 0..

While the first assumption is natural, the second requires some explanation. In fact, we assume that at some time t0, the training loss is smaller than a threshold, and the threshold here is chosen to be so small that the training accuracy is guaranteed to be 100% (e.g., for the logistic loss and cross-entropy loss, the threshold can be set to ln2). Empirically, state-of-the-art CNNs for image classification can even fit randomly labeled data easily (Zhang et al., 2017). Recent theoretical work on over-parameterized neural networks (Allen-Zhu et al., 2019; Zou et al., 2018) show that gradient descent can fit the training data if the width is large enough. Furthermore, in order to study the margin, ensuring the training data can be separated is inevitable; otherwise, there is no positive margin between the data and decision boundary.

Our Contribution.   Similar to linear models, for homogeneous models, only the direction of parameter 𝜽 is important for making predictions, and one can see that the margin γ(𝜽) scales linearly with 𝜽2L, when fixing the direction of 𝜽. To compare margins among 𝜽 in different directions, it makes sense to study the normalized margin, γ¯(𝜽):=γ(𝜽)/𝜽2L.

In this paper, we focus on the training dynamics of the network after t0 (recall that t0 is a time that the training loss is less than the threshold). Our theoretical results can answer the following questions regarding the normalized margin.

First, how does the normalized margin change during training? The answer may seem complicated since one can easily come up with examples in which γ¯ increases or decreases in a short time interval. However, we can show that the overall trend of the normalized margin is to increase in the following sense: there exists a smoothed version of the normalized margin, denoted as γ~, such that (1) |γ~-γ¯|0 as t; and (2) γ~ is non-decreasing for t>t0.

Second, how large is the normalized margin at convergence? To answer this question, we formulate a natural constrained optimization problem which aims to directly maximize the margin. We show that every limit point of {𝜽(t)/𝜽(t)2:t>0} is along the direction of a KKT point of the max-margin problem. This indicates that gradient descent/gradient flow performs margin maximization implicitly in deep homogeneous networks. This result can be seen as a significant generalization of previous works (Soudry et al., 2018a; b; Ji & Telgarsky, 2019a; Gunasekar et al., 2018b) from linear classifiers to homogeneous classifiers.

As by-products of the above results, we derive tight asymptotic convergence/growth rates of the loss and weights. It is shown in (Soudry et al., 2018a; b; Ji & Telgarsky, 2018; 2019c) that the loss decreases at the rate of O(1/t), the weight norm grows as O(logt) for linear logistic regression. In this work, we generalize the result by showing that the loss decreases at the rate of O(1/(t(logt)2-2/L)) and the weight norm grows as O((logt)1/L) for homogeneous neural networks with exponential loss, logistic loss, or cross-entropy loss.

Experiments.22 2 Code available: https://github.com/vfleaking/max-margin   The main practical implication of our theoretical result is that training longer can enlarge the normalized margin. To justify this claim empiricaly, we train CNNs on MNIST and CIFAR-10 with SGD (see Section K.1). Results on MNIST are presented in Figure 1. For constant step size, we can see that the normalized margin keeps increasing, but the growth rate is rather slow (because the gradient gets smaller and smaller). Inspired by our convergence results for gradient descent, we use a learning rate scheduling method which enlarges the learning rate according to the current training loss, then the training loss decreases exponentially faster and the normalized margin increases significantly faster as well.

For feedforward neural networks with ReLU activation, the normalized margin on a training sample is closely related to the L2-robustness (the L2-distance from the training sample to the decision boundary). Indeed, the former divided by a Lipschitz constant is a lower bound for the latter. For example, the normalized margin is a lower bound for the L2-robustness on fully-connected networks with ReLU activation (see, e.g., Theorem 4 in (Sokolic et al., 2017)). This fact suggests that training longer may have potential benefits on improving the robustness of the model. In our experiments, we observe noticeable improvements of L2-robustness on both training and test sets (see Section K.2).

\includegraphics [width=]fig/a.pdf
\includegraphics [width=]fig/b.pdf
Figure 1: (a) Training CNNs with and without bias on MNIST, using SGD with learning rate 0.01. The training loss (left) decreases over time, and the normalized margin (right) keeps increasing after the model is fitted, but the growth rate is slow (1.8×10-4 after 10000 epochs). (b) Training CNNs with and without bias on MNIST, using SGD with the loss-based learning rate scheduler. The training loss (left) decreases exponentially over time (<10-800 after 9000 epochs), and the normalized margin (right) increases rapidly after the model is fitted (1.2×10-3 after 10000 epochs, 10× larger than that of SGD with learning rate 0.01). Experimental details are in Appendix K.

2 Related Work

Implicit Bias in Training Linear Classifiers.   For linear logistic regression on linearly separable data, Soudry et al. (2018a; b) showed that full-batch gradient descent converges in the direction of the max L2-margin solution of the corresponding hard-margin Support Vector Machine (SVM). Subsequent works extended this result in several ways: Nacson et al. (2019c) extended the results to the case of stochastic gradient descent; Gunasekar et al. (2018a) considered other optimization methods; Nacson et al. (2019b) considered other loss functions including those with poly-exponential tails; Ji & Telgarsky (2018; 2019c) characterized the convergence of weight direction without assuming separability; Ji & Telgarsky (2019b) proved a tighter convergence rate for the weight direction.

Those results on linear logistic regression have been generalized to deep linear networks. Ji & Telgarsky (2019a) showed that the product of weights in a deep linear network with strictly decreasing loss converges in the direction of the max L2-margin solution. Gunasekar et al. (2018b) showed more general results for gradient descent on linear fully-connected and convolutional networks with exponential loss, under various assumptions on the convergence of the loss and gradient direction.

Margin maximization phenomenon is also studied for boosting methods (Schapire et al., 1998; Schapire & Freund, 2012; Shalev-Shwartz & Singer, 2010; Telgarsky, 2013) and Normalized Perceptron (Ramdas & Pena, 2016).

Implicit Bias in Training Nonlinear Classifiers.   Soudry et al. (2018a) analyzed the case where there is only one trainable layer of a ReLU network. Xu et al. (2018) characterized the implicit bias for the model consisting of one single ReLU unit. Our work is closely related to a recent independent work by (Nacson et al., 2019a) which we discuss in details below.

Comparison with (Nacson et al., 2019a).   Very recently, (Nacson et al., 2019a) analyzed gradient descent for smooth homogeneous models and proved the convergence of parameter direction to a KKT point of the aforementioned max-margin problem. Compared with their work, our work adopt much weaker assumptions: (1) They assume the training loss converges to 0, but in our work we only require that the training loss is lower than a small threshold value at some time t0 (and we prove the exact convergence rate of the loss after t0); (2) They assume the convergence of parameter direction33 3 Assuming the convergence of the parameter direction may seem quite reasonable, however, the problem here can be quite subtle in theory. In Appendix J, we present a smooth homogeneuous function f, based on the Mexican hat function (Absil et al. (2005)), such that even the direction of the parameter does not converge along gradient flow (it moves around a cirle when t increases). , while we prove that KKT conditions hold for all limit points of {𝜽(t)/𝜽(t)2:t>0}, without requiring any convergence assumption; (3) They assume the convergence of the direction of losses (the direction of the vector whose entries are loss values on every data point) and Linear Independence Constraint Qualification (LICQ) for the max-margin problem, while we do not need such assumptions. Besides the above differences in assumptions, we also prove the monotonicity of the normalized margin and provide tight convergence rate for training loss. We believe both results are interesting in their own right.

Another technical difference is that their work analyzes discrete gradient descent on smooth homogeneous models (which fails to capture ReLU networks). In our work, we analyze both gradient descent on smooth homogeneous models and also gradient flow on homogeneous models which could be non-smooth.

Other Works on Implicit Bias.   Banburski et al. (2019) also studied the dynamics of gradient flow and among other things, provided mathematical insights to the implicit bias towards max margin solution for homogeneous networks. We note that their analysis of gradient flow decomposes the dynamics to the tangent component and radial component, which is similar to our proof of Theorem 4.1 in spirit. Wilson et al. (2017); Ali et al. (2019); Gunasekar et al. (2018a) showed that for the linear least-square problem gradient-based methods converge to the unique global minimum that is closest to the initialization in L2 distance. Du et al. (2019); Jacot et al. (2018); Lee et al. (2019); Arora et al. (2019b) showed that over-parameterized neural networks of sufficient width (or infinite width) behave as linear models with Neural Tangent Kernel (NTK) with proper initialization and gradient descent converges linearly to a global minimum near the initial point. Other related works include (Ma et al., 2019; Gidel et al., 2019; Arora et al., 2019a; Suggala et al., 2018; Blanc et al., 2019; Neyshabur et al., 2015b; a).

3 Preliminaries

Basic Notations.   For any N, let [N]={1,,N}. 𝒗2 denotes the L2-norm of a vector 𝒗. The default base of log is e. For a function f:d, f(𝒙) stands for the gradient at 𝒙 if it exists. A function f:Xd is 𝒞k-smooth if f is k times continuously differentiable. A function f:X is locally Lipschitz if for every 𝒙X there exists a neighborhood U of 𝒙 such that the restriction of f on U is Lipschitz continuous.

Non-smooth Analysis.   For a locally Lipschitz function f:X, the Clarke’s subdifferential (Clarke, 1975; Clarke et al., 2008; Davis et al., 2020) at 𝒙X is the convex set

f(𝒙):=conv{limkf(𝒙k):𝒙k𝒙,fis differentiable at𝒙k}.

For brevity, we say that a function 𝒛:Id on the interval I is an arc if 𝒛 is absolutely continuous for any compact sub-interval of I. For an arc 𝒛, 𝒛(t) (or d𝒛dt(t)) stands for the derivative at t if it exists. Following the terminology in (Davis et al., 2020), we say that a locally Lipschitz function f:d admits a chain rule if for any arc 𝒛:[0,+)d, 𝒉f(z(t)):(f𝒛)(t)=𝒉,𝒛(t) holds for a.e. t>0 (see also Appendix I).

Binary Classification.   Let Φ be a neural network, assumed to be parameterized by 𝜽. The output of Φ on an input 𝒙dx is a real number Φ(𝜽;𝒙), and the sign of Φ(𝜽;𝒙) stands for the classification result. A dataset is denoted by 𝒟={(𝒙n,yn):n[N]}, where 𝒙ndx stands for a data input and yn{±1} stands for the corresponding label. For a loss function :, we define the training loss of Φ on the dataset 𝒟 to be (𝜽):=n=1N(ynΦ(𝜽;𝒙n)).

Gradient Descent.   We consider the process of training this neural network Φ with either gradient descent or gradient flow. For gradient descent, we assume the training loss (𝜽) is 𝒞2-smooth and describe the gradient descnet process as 𝜽(t+1)=𝜽(t)-η(t)(𝜽(t)), where η(t) is the learning rate at time t and (𝜽(t)) is the gradient of at 𝜽(t).

Gradient Flow.   For gradient flow, we do not assume the differentibility but only some regularity assumptions including locally Lipschitz. Gradient flow can be seen as gradient descent with infinitesimal step size. In this model, 𝜽 changes continuously with time, and the trajectory of parameter 𝜽 during training is an arc 𝜽:[0,+)d,t𝜽(t) that satisfies the differential inclusion d𝜽(t)dt-(𝜽(t)) for a.e. t0. The Clarke’s subdifferential is a natural generalization of the usual differential to non-differentiable functions. If (𝜽) is actually a 𝒞1-smooth function, the above differential inclusion reduces to d𝜽(t)dt=-(𝜽(t)) for all t0, which corresponds to the gradient flow with differential in the usual sense.

4 Gradient Descent / Gradient Flow on Homogeneous Model

In this section, we first state our results for gradient flow and gradient descent on homogeneous models with exponential loss (q):=e-q for simplicity of presentation. Due to space limit, we defer the more general results which hold for a large family of loss functions (including logistic loss and cross-entropy loss) to Appendix A, F and G.

4.1 Assumptions

Gradient Flow.   For gradient flow, we assume the following:

  1. (A1).

    (Regularity). For any fixed 𝒙, Φ(;𝒙) is locally Lipschitz and admits a chain rule;

  2. (A2).

    (Homogeneity). There exists L>0 such that α>0:Φ(α𝜽;𝒙)=αLΦ(𝜽;𝒙);

  3. (A3).

    (Exponential Loss). (q)=e-q;

  4. (A4).

    (Separability). There exists a time t0 such that (𝜽(t0))<1.

(A1) is a technical assumption about the regularity of the network output. As shown in (Davis et al., 2020), the output of almost every neural network admits a chain rule (as long as the neural network is composed by definable pieces in an o-minimal structure, e.g., ReLU, sigmoid, LeakyReLU).

(A2) assumes the homogeneity, the main property we assume in this work. (A3), (A4) correspond to the two conditions introduced in Section 1. The exponential loss in (A3) is main focus of this section. (A4) is a separability assumption: the condition (𝜽(t0))<1 ensures that (ynΦ(𝜽(t0);𝒙n))<1 for all n[N], and thus ynΦ(𝜽(t0);𝒙n)>0, meaning that Φ classifies every 𝒙n correctly.

Gradient Descent.   For gradient descent, we assume (A2), (A3), (A4) similarly as for gradient flow, and the following two assumptions (S1) and (S5).

  1. (S1).

    (Smoothness). For any fixed 𝒙, Φ(;𝒙) is 𝒞2-smooth on d{𝟎}.

  2. (S5).

    (Learning rate condition, Informal). η(t)=η0 for a sufficiently small constant η0. In fact, η(t) is even allowed to be as large as O((t)-1polylog1(t)). See Appendix E.1 for the details.

(S5) is natural since deep neural networks are usually trained with constant learning rates. (S1) ensures the smoothness of Φ, which is often assumed in the optimization literature in order to analyze gradient descent. While (S1) does not hold for neural networks with ReLU, it does hold for neural networks with smooth homogeneous activation such as the quadratic activation ϕ(x):=x2 (Li et al., 2018b; Du & Lee, 2018) or powers of ReLU ϕ(x):=ReLU(x)α for α>2 (Zhong et al., 2017; Klusowski & Barron, 2018; Li et al., 2019).

4.2 Main Theorem: Monotonicity of Normalized Margins

The margin for a single data point (𝒙n,yn) is defined to be qn(𝜽):=ynΦ(𝜽;𝒙n), and the margin for the entire dataset is defined to be qmin(𝜽):=minn[N]qn(𝜽). By homogenity, the margin qmin(𝜽) scales linearly with 𝜽2L for any fixed direction since qmin(c𝜽)=cLqmin(𝜽). So we consider the normalized margin defined as below:

γ¯(𝜽):=qmin(𝜽𝜽2)=qmin(𝜽)𝜽2L. (2)

We say f is an ϵ-additive approximation for the normalized margin if γ¯-ϵfγ¯, and c-multiplicative approximation if cγ¯fγ¯.

Gradient Flow.   Our first result is on the overall trend of the normalized margin γ¯(𝜽(t)). For both gradient flow and gradient descent, we identify a smoothed version of the normalized margin, and show that it is non-decreasing during training. More specifically, we have the following theorem for gradient flow.

Theorem 4.1 (Corollary of Theorem A.7).

Under assumptions (A1) - (A4), there exists an O(𝛉2-L)-additive approximation function γ~(𝛉) for the normalized margin such that the following statements are true for gradient flow:

  1. 1.

    For a.e. t>t0, ddtγ~(𝜽(t))0;

  2. 2.

    For a.e. t>t0, either ddtγ~(𝜽(t))>0 or ddt𝜽(t)𝜽(t)2=0;

  3. 3.

    (𝜽(t))0 and 𝜽(t)2 as t+; therefore, |γ¯(𝜽(t))-γ~(𝜽(t))|0.

More concretely, the function γ~(𝜽) in Theorem 4.1 is defined as

γ~(𝜽):=log1(𝜽)𝜽2L=-log(n=1Ne-qn(𝜽))𝜽2L. (3)

Note that the only difference between γ¯(𝜽) and γ~(𝜽) is that the margin qmin(𝜽) in γ¯(𝜽) is replaced by log1(𝜽)=-LSE(-q1(𝜽),,-qN(𝜽)), where LSE(a1,,aN)=log(exp(a1)++exp(aN)) is the LogSumExp function. This is indeed a very natural idea, and previous works on linear models (e.g., (Telgarsky, 2013; Nacson et al., 2019b)) also approximate qmin with LogSumExp in the analysis of margin. It is easy to see why γ~(𝜽) is an O(𝜽2-L)-additive approximation for γ¯(𝜽): eamaxn=1NeanNeamax holds for amax=max{a1,,aN}, so amaxLSE(a1,,aN)amax+logN; combining this with the definition of γ~(𝜽) gives γ¯(𝜽)-𝜽2-LlogNγ~(𝜽)γ¯(𝜽).

Gradient Descent.   For gradient descent, Theorem 4.1 holds similarly with a slightly different function γ^(𝜽) that approximates γ¯(𝜽) multiplicatively rather than additively.

Theorem 4.2 (Corollary of Theorem E.2).

Under assumptions (S1), (A2) - (A4), (S5), there exists an (1-O(1/(log1L)))-multiplicative approximation function γ^(𝛉) for the normalized margin such that the following statements are true for gradient descent:

  1. 1.

    For all t>t0, γ^(𝜽(t+1))γ^(𝜽(t));

  2. 2.

    For all t>t0, either γ^(𝜽(t+1))>γ^(𝜽(t)) or 𝜽(t+1)𝜽(t+1)2=𝜽(t)𝜽(t)2;

  3. 3.

    (𝜽(t))0 and 𝜽(t)2 as t+; therefore, |γ¯(𝜽(t))-γ^(𝜽(t))|0.

Due to the discreteness of gradient descent, the explicit formula for γ^(𝜽) is somewhat technical, and we refer the readers to Appendix E for full details.

Convergence Rates.

It is shown in Theorem 4.1, 4.2 that (𝜽(t))0 and 𝜽(t)2. In fact, with a more refined analysis, we can prove tight loss convergence and weight growth rates using the monotonicity of normalized margins.

Theorem 4.3 (Corollary of Theorem A.10 and E.5).

For gradient flow under assumptions (A1) - (A4) or gradient descent under assumptions (S1), (A2) - (A4), (S5), we have the following tight bounds for training loss and weight norm:

(𝜽(t))=Θ(1T(logT)2-2/L)  𝑎𝑛𝑑  𝜽(t)2=Θ((logT)1/L),

where T=t for gradient flow and T=τ=t0t-1η(τ) for gradient descent.

4.3 Main Theorem: Convergence to KKT Points

For gradient flow, γ~ is upper-bounded by γ~γ¯sup{qn(𝜽):𝜽2=1}. Combining this with Theorem 4.1 and the monotone convergence theorem, it is not hard to see that limt+γ¯(𝜽(t)) and limt+γ~(𝜽(t)) exist and equal to the same value. Using a similar argument, we can draw the same conclusion for gradient descent.

To understand the implicit regularization effect, a natural question arises: what optimality property does the limit of normalized margin have? To this end, we identify a natural constrained optimization problem related to margin maximization, and prove that 𝜽(t) directionally converges to its KKT points, as shown below. We note that we can extend this result to the finite time case, and show that gradient flow or gradient descent passes through an approximate KKT point after a certain amount of time. See Theorem A.9 in Appendix A and Theorem E.4 in Appendix E for the details. We will briefly review the definition of KKT points and approximate KKT points for a constraint optimization problem in Appendix C.1.

Theorem 4.4 (Corollary of Theorem A.8 and E.3).

For gradient flow under assumptions (A1) - (A4) or gradient descent under assumptions (S1), (A2) - (A4), (S5), any limit point 𝛉¯ of {𝛉(t)𝛉(t)2:t0} is along the direction of a KKT point of the following constrained optimization problem (P):

min12𝜽22   s.t.qn(𝜽)1  n[N]

That is, for any limit point 𝛉¯, there exists a scaling factor α>0 such that α𝛉¯ satisfies Karush-Kuhn-Tucker (KKT) conditions of (P).

Minimizing (P) over its feasible region is equivalent to maximizing the normalized margin over all possible directions. The proof is as follows. Note that we only need to consider all feasible points 𝜽 with qmin(𝜽)>0. For a fixed 𝜽, α𝜽 is a feasible point of (P) iff αqmin(𝜽)-1/L. Thus, the minimum objective value over all feasible points of (P) in the direction of 𝜽 is 12𝜽/qmin(𝜽)1/L22=12γ¯(𝜽)-2/L. Taking minimum over all possible directions, we can conclude that if the maximum normalized margin is γ¯*, then the minimum objective of (P) is 12γ¯*-2/L.

It can be proved that (P) satisfies the Mangasarian-Fromovitz Constraint Qualification (MFCQ) (See Lemma C.7). Thus, KKT conditions are first-order necessary conditions for global optimality. For linear models, KKT conditions are also sufficient for ensuring global optimality; however, for deep homogeneous networks, qn(𝜽) can be highly non-convex. Indeed, as gradient descent is a first-order optimization method, if we do not make further assumptions on qn(𝜽), then it is easy to construct examples that gradient descent does not lead to a normalized margin that is globally optimal. Thus, proving the convergence to KKT points is perhaps the best we can hope for in our setting, and it is an interesting future work to prove stronger convergence results with further natural assumptions.

Moreover, we can prove the following corollary, which characterizes the optimality of the normalized margin using SVM with Neural Tangent Kernel (NTK, introduced in (Jacot et al., 2018)) defined at limit points. The proof is deferred to Appendix C.6.

Corollary 4.5 (Corollary of Theorem 4.4).

Assume (S1). Then for gradient flow under assumptions (A2) - (A4) or gradient descent under assumptions (A2) - (A4), (S5), any limit point 𝛉¯ of {𝛉(t)/𝛉(t)2:t0} is along the max-margin direction for the hard-margin SVM with kernel K𝛉¯(𝐱,𝐱)=Φ𝐱(𝛉¯),Φ𝐱(𝛉¯), where Φ𝐱(𝛉):=Φ(𝛉;𝐱). That is, for some α>0, α𝛉¯ is the optimal solution for the following constrained optimization problem:

min12𝜽22   s.t.yn𝜽,Φ𝒙n(𝜽¯)1  n[N]

If we assume (A1) instead of (S1) for gradient flow, then there exists a mapping 𝐡(𝐱)Φ𝐱(𝛉¯) such that the same conclusion holds for K𝛉¯(𝐱,𝐱)=𝐡(𝐱),𝐡(𝐱).

4.4 Other Main Results

The above results can be extended to other settings as shown below.

Other Binary Classification Loss.   The results on exponential loss can be generalized to a much broader class of binary classification loss. The class includes the logistic loss which is one of the most popular loss functions, (q)=log(1+e-q). The function class also includes other losses with exponential tail, e.g., (q)=e-q3,(q)=log(1+e-q3). For all those loss functions, we can use its inverse function -1 to define the smoothed normalized margin as follows

γ~(𝜽)=-1((𝜽))𝜽2L.

Then all our results for gradient flow continue to hold (Appendix A). Using a similar modification, we can also extend it to gradient descent (Appendix F).

Cross-entropy Loss.   In multi-class classification, we can define qn to be the difference between the classification score for the true label and the maximum score for the other labels, then the margin qmin:=minn[N]qn and the normalized margin γ¯(𝜽):=qmin(𝜽)𝜽2L can be similarly defined as before. In Appendix G, we define the smoothed normalized margin for cross-entropy loss to be the same as that for logistic loss (See Remark A.4). Then we show that Theorem 4.1 and Theorem 4.4 still hold (but with a slightly different definition of (P)) for gradient flow, and we also extend the results to gradient descent.

Multi-homogeneous Models.   Some neural networks indeed possess a stronger property than homogeneity, which we call multi-homogeneity. For example, the output of a CNN (without bias terms) is 1-homogeneous with respect to the weights of each layer. In general, we say that a neural network Φ(𝜽;𝒙) with 𝜽=(𝒘1,,𝒘m) is (k1,,km)-homogeneous if for any 𝒙 and any c1,,cm>0, we have Φ(c1𝒘1,,cm𝒘m;𝒙)=i=1mcikiΦ(𝒘1,,𝒘m;𝒙). In the previous example, an L-layer CNN with layer weights 𝜽=(𝒘1,,𝒘L) is (1,,1)-homogeneous.

One can easily see that that (k1,,km)-homogeneity implies L-homogeneity, where L=i=1mki, so our previous analysis for homogeneous models still applies to multi-homogeneous models. But it would be better to define the normalized margin for multi-homogeneous model as

γ¯(𝒘1,,𝒘m):=qmin(𝒘1𝒘12,,𝒘m𝒘m2)=qmini=1m𝒘i2ki. (4)

In this case, the smoothed approximation of γ¯ for general binary classification loss (under some conditions) can be similarly defined for gradient flow:

γ~(𝒘1,,𝒘m):=-1()i=1m𝒘i2ki, (5)

It can be shown that γ~ is also non-decreasing during training when the loss is small enough (Appendix H). In the case of cross-entropy loss, we can still define γ~ by (5) while () is set to the logistic loss in the formula.

5 Proof Sketch: Gradient Flow on Homogeneous Model with Exponential Loss

In this section, we present a proof sketch in the case of gradient flow on homogeneous model with exponential loss to illustrate our proof ideas. Due to space limit, the proof for the main theorems on gradient flow and gradient descent in Section 4 are deferred to Appendix A and E respectively.

For convenience, we introduce a few more notations for a L-homogeneous neural network Φ(𝜽;𝒙). Let 𝒮d-1={𝜽d:𝜽2=1} be the set of L2-normalized parameters. Define ρ:=𝜽2 and 𝜽^:=𝜽𝜽2𝒮d-1 to be the length and direction of 𝜽. For both gradient descent and gradient flow, 𝜽 is a function of time t. For convenience, we also view the functions of 𝜽, including (𝜽),qn(𝜽),qmin(𝜽), as functions of t. So we can write (t):=(𝜽(t)),qn(t):=qn(𝜽(t)),qmin(t):=qmin(𝜽(t)).

Lemma 5.1 below is the key lemma in our proof. It decomposes the growth of the smoothed normalized margin into the ratio of two quantities related to the radial and tangential velocity components of 𝜽 respectively. We will give a proof sketch for this later in this section. We believe that this lemma is of independent interest.

Lemma 5.1 (Corollary of Lemma B.1).

For a.e. t>t0,

ddtlogρ>0𝑎𝑛𝑑ddtlogγ~L(ddtlogρ)-1d𝜽^dt22.

Using Lemma 5.1, the first two claims in Theorem 4.1 can be directly proved. For the third claim, we make use of the monotonicity of the margin to lower bound the gradient, and then show 0 and ρ+. Recall that γ~ is an O(ρ-L)-additive approximation for γ¯. So this proves the third claim. We defer the detailed proof to Appendix B.

To show Theorem 4.4, we first change the time measure to logρ, i.e., now we see t as a function of logρ. So the second inequality in Lemma 5.1 can be rewritten as dlogγ~dlogρLd𝜽^dlogρ22. Integrating on both sides and noting that γ~ is upper-bounded, we know that there must be many instant logρ such that d𝜽^dlogρ2 is small. By analyzing the landscape of training loss, we show that these points are “approximate” KKT points. Then we show that every convergent sub-sequence of {𝜽^(t):t0} can be modified to be a sequence of “approximate” KKT points which converges to the same limit. Then we conclude the proof by applying a theorem from (Dutta et al., 2013) to show that the limit of this convergent sequence of “approximate” KKT points is a KKT point. We defer the detailed proof to Appendix C.

Now we give a proof sketch for Lemma 5.1, in which we derive the formula of γ~ step by step. In the proof, we obtain several clean close form formulas for several relevant quantities, by using the chain rule and Euler’s theorem for homogeneous functions extensively.

Proof Sketch of Lemma 5.1.

For ease of presentation, we ignore the regularity issues of taking derivatives in this proof sketch. We start from the equation ddt=-(𝜽(t)),d𝜽dt=-d𝜽dt22 which follows from the chain rule (see also Lemma I.3). Then we note that d𝜽dt can be decomposed into two parts: the radial component 𝒗:=𝜽^𝜽^d𝜽dt and the tangent component 𝒖:=(𝑰-𝜽^𝜽^)d𝜽dt.

The radial component is easier to analyze. By the chain rule, 𝒗2=𝜽^d𝜽dt=1ρ𝜽,d𝜽dt=1ρ12dρ2dt. For 12dρ2dt, we have an exact formula:

12dρ2dt=𝜽,d𝜽dt=n=1Ne-qnqn,𝜽=Ln=1Ne-qnqn. (6)

The last equality is due to qn,𝜽=Lqn by homogeneity of qn. This is sometimes called Euler’s theorem for homogeneous functions (see Theorem B.2). For differentiable qn, it can be easily proved by taking the derivative over c on both sides of qn(c𝜽)=cLqn(𝜽) and letting c=1.

With (6), we can lower bound 12dρ2dt by

12dρ2dt=Ln=1Ne-qnqnLn=1Ne-qnqminLlog1, (7)

where the last inequality uses the fact that e-qmin. (7) also implies that 12dρ2dt>0 for t>t0 since (t0)<1 and is non-increasing. As ddtlogρ=12ρ2dρ2dt, this also proves the first inequality of Lemma 5.1.

Now, we have 𝒗22=1ρ2(12dρ2dt)2=12dρ2dtddtlogρ on the one hand; on the other hand, by the chain rule we have d𝜽^dt=1ρ2(ρd𝜽dt-dρdt𝜽)=1ρ2(ρd𝜽dt-(𝜽^d𝜽dt)𝜽)=𝒖ρ. So we have

-ddt=d𝜽dt22=𝒗22+𝒖22=12dρ2dtddtlogρ+ρ2d𝜽^dt22

Dividing 12dρ2dt on the leftmost and rightmost sides, we have

-ddt(12dρ2dt)-1=ddtlogρ+(ddtlogρ)-1d𝜽^dt22.

By -ddt0 and (7), the LHS is no greater than -ddt(Llog1)-1=1Lloglog1. Thus we have ddtloglog1-LddtlogρL(ddtlogρ)-1d𝜽^dt22, where the LHS is exactly ddtlogγ~. ∎

6 Discussion and Future Directions

In this paper, we analyze the dynamics of gradient flow/descent of homogeneous neural networks under a minimal set of assumptions. The main technical contribution of our work is to prove rigorously that for gradient flow/descent, the normalized margin is increasing and converges to a KKT point of a natural max-margin problem. Our results leads to some natural further questions:

  • Can we generalize our results for gradient descent on smooth neural networks to non-smooth ones? In the smooth case, we can lower bound the decrement of training loss by the gradient norm squared, multiplied by a factor related to learning rate. However, in the non-smooth case, no such inequality is known in the optimization literature.

  • Can we make more structural assumptions on the neural network to prove stronger results? In this work, we use a minimal set of assumptions to show that the convergent direction of parameters is a KKT point. A potential research direction is to identify more key properties of modern neural networks and show that the normalized margin at convergence is locally or globally optimal (in terms of optimizing (P)).

  • Can we extend our results to neural networks with bias terms? In our experiments, the normalized margin of the CNN with bias also increases during training despite that its output is non-homogeneous. It is very interesting (and technically challenging) to provide a rigorous proof for this fact.

Acknowledgments

The research is supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003,and the Zhongguancun Haihua Institute for Frontier Information Technology and Turing AI Institute of Nanjing. We thank Liwei Wang for helpful suggestions on the connection between margin and robustness. We thank Sanjeev Arora, Tianle Cai, Simon Du, Jason D. Lee, Zhiyuan Li, Tengyu Ma, Ruosong Wang for helpful discussions.

References

  • Absil et al. (2005) Pierre-Antoine Absil, Robert Mahony, and Benjamin Andrews. Convergence of the iterates of descent methods for analytic cost functions. SIAM Journal on Optimization, 16(2):531–547, 2005.
  • Ali et al. (2019) Alnur Ali, J. Zico Kolter, and Ryan J. Tibshirani. A continuous-time view of early stopping for least squares regression. In Kamalika Chaudhuri and Masashi Sugiyama (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 1370–1378. PMLR, 16–18 Apr 2019.
  • Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242–252, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  • Anil et al. (2019) Cem Anil, James Lucas, and Roger Grosse. Sorting out Lipschitz function approximation. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 291–301, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  • Arora et al. (2019a) Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 7411–7422. Curran Associates, Inc., 2019a.
  • Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8139–8148. Curran Associates, Inc., 2019b.
  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 274–283, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Banburski et al. (2019) Andrzej Banburski, Qianli Liao, Brando Miranda, Tomaso Poggio, Lorenzo Rosasco, and Jack Hidary. Theory III: Dynamics and generalization in deep networks. CBMM Memo No: 090, version 20, 2019.
  • Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6240–6249. Curran Associates, Inc., 2017.
  • Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Železný (eds.), Machine Learning and Knowledge Discovery in Databases, pp. 387–402, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  • Blanc et al. (2019) Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. arXiv preprint arXiv:1904.09080, 2019.
  • Carlini & Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57, May 2017. doi: 10.1109/SP.2017.49.
  • Cisse et al. (2017) Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 854–863, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Clarke et al. (2008) Francis H. Clarke, Yuri S. Ledyaev, Ronald J. Stern, and Peter R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
  • Clarke (1975) Frank H. Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247–262, 1975.
  • Clarke (1990) Frank H Clarke. Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics, 1990. doi: 10.1137/1.9781611971309.
  • Coste (2002) Michel Coste. An Introduction to O-minimal Geometry. 2002.
  • Curry (1944) Haskell B Curry. The method of steepest descent for non-linear minimization problems. Quarterly of Applied Mathematics, 2(3):258–261, 1944.
  • Davis et al. (2020) Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D. Lee. Stochastic subgradient method converges on tame functions. Foundations of Computational Mathematics, 20(1):119–154, Feb 2020.
  • Drusvyatskiy et al. (2015) Dmitriy Drusvyatskiy, Alexander D Ioffe, and Adrian S Lewis. Curves of descent. SIAM Journal on Control and Optimization, 53(1):114–138, 2015.
  • Du & Lee (2018) Simon Du and Jason Lee. On the power of over-parametrization in neural networks with quadratic activation. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1329–1338, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Du et al. (2019) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1675–1685, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  • Du et al. (2018) Simon S. Du, Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 382–393. Curran Associates, Inc., 2018.
  • Dutta et al. (2013) Joydeep Dutta, Kalyanmoy Deb, Rupesh Tulshyan, and Ramnik Arora. Approximate KKT points and a proximity measure for termination. Journal of Global Optimization, 56(4):1463–1499, 2013.
  • Gidel et al. (2019) Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 3196–3206. Curran Associates, Inc., 2019.
  • Giorgi et al. (2004) Giorgio Giorgi, Angelo Guerraggio, and Jörg Thierfelder. Chapter IV - Nonsmooth Optimization Problems. In Mathematics of Optimization, pp. 359 – 457. Elsevier Science, Amsterdam, 2004. ISBN 978-0-444-50550-7.
  • Golowich et al. (2018) Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 297–299. PMLR, 06–09 Jul 2018.
  • Gunasekar et al. (2018a) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1832–1841, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018a. PMLR.
  • Gunasekar et al. (2018b) Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 9482–9491. Curran Associates, Inc., 2018b.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 8571–8580. Curran Associates, Inc., 2018.
  • Ji & Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
  • Ji & Telgarsky (2019a) Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019a.
  • Ji & Telgarsky (2019b) Ziwei Ji and Matus Telgarsky. A refined primal-dual analysis of the implicit bias. arXiv preprint arXiv:1906.04540, 2019b.
  • Ji & Telgarsky (2019c) Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Alina Beygelzimer and Daniel Hsu (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 1772–1798, Phoenix, USA, 25–28 Jun 2019c. PMLR.
  • Klusowski & Barron (2018) Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu and squared relu ridge functions with 1 and 0 controls. IEEE Transactions on Information Theory, 64(12):7649–7656, 2018.
  • Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8570–8581. Curran Associates, Inc., 2019.
  • Li et al. (2019) Bo Li, Shanshan Tang, and Haijun Yu. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. arXiv preprint arXiv:1903.05858, 2019.
  • Li et al. (2018a) Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond. arXiv preprint arXiv:1806.05159, 2018a.
  • Li et al. (2018b) Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 2–47. PMLR, 06–09 Jul 2018b.
  • Ma et al. (2019) Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics, Aug 2019.
  • Nacson et al. (2019a) Mor Shpigel Nacson, Suriya Gunasekar, Jason Lee, Nathan Srebro, and Daniel Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4683–4692, Long Beach, California, USA, 09–15 Jun 2019a. PMLR.
  • Nacson et al. (2019b) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In Kamalika Chaudhuri and Masashi Sugiyama (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 3420–3428. PMLR, 16–18 Apr 2019b.
  • Nacson et al. (2019c) Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. In Kamalika Chaudhuri and Masashi Sugiyama (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 3051–3059. PMLR, 16–18 Apr 2019c.
  • Neyshabur et al. (2015a) Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 2422–2430. Curran Associates, Inc., 2015a.
  • Neyshabur et al. (2015b) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015b.
  • Neyshabur et al. (2018) Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  • Palis & De Melo (2012) J Jr Palis and Welington De Melo. Geometric theory of dynamical systems: an introduction. Springer Science & Business Media, 2012.
  • Ramdas & Pena (2016) Aaditya Ramdas and Javier Pena. Towards a deeper geometric, analytic and algorithmic understanding of margins. Optimization Methods and Software, 31(2):377–391, 2016.
  • Rosset et al. (2004) Saharon Rosset, Ji Zhu, and Trevor J. Hastie. Margin maximizing loss functions. In S. Thrun, L. K. Saul, and B. Schölkopf (eds.), Advances in Neural Information Processing Systems 16, pp. 1237–1244. MIT Press, 2004.
  • Schapire & Freund (2012) Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012. ISBN 0262017180, 9780262017183.
  • Schapire et al. (1998) Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist., 26(5):1651–1686, 10 1998. doi: 10.1214/aos/1024691352.
  • Shalev-Shwartz & Singer (2010) Shai Shalev-Shwartz and Yoram Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. Machine learning, 80(2-3):141–163, 2010.
  • Sokolic et al. (2017) Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Robust large margin deep neural networks. IEEE Trans. Signal Processing, 65(16):4265–4280, 2017. doi: 10.1109/TSP.2017.2708039.
  • Soudry et al. (2018a) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018a.
  • Soudry et al. (2018b) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018b.
  • Suggala et al. (2018) Arun Suggala, Adarsh Prasad, and Pradeep K Ravikumar. Connecting optimization and regularization paths. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 10608–10619. Curran Associates, Inc., 2018.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2013.
  • Telgarsky (2013) Matus Telgarsky. Margins, shrinkage, and boosting. In Sanjoy Dasgupta and David McAllester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 307–315, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  • van den Dries & Miller (1996) Lou van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke Mathematical Journal, 84(2):497–540, 1996.
  • Wei & Ma (2020) Colin Wei and Tengyu Ma. Improved sample complexities for deep neural networks and robust classification via an all-layer margin. In International Conference on Learning Representations, 2020.
  • Wei et al. (2019) Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 9709–9721. Curran Associates, Inc., 2019.
  • Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4148–4158. Curran Associates, Inc., 2017.
  • Xu et al. (2018) Tengyu Xu, Yi Zhou, Kaiyi Ji, and Yingbin Liang. When will gradient methods converge to max-margin classifier under relu models? arXiv preprint arXiv:1806.04339, 2018.
  • Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
  • Zhang et al. (2019) Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. In International Conference on Learning Representations, 2019.
  • Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4140–4149, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
  • Zoutendijk (1976) Guus Zoutendijk. Mathematical programming methods. 1976.

Appendix A Results for General Loss

In this section, we state our results for a broad class of binary classification loss. A major consequence of this generalization is that the logistic loss, one of the most popular loss functions, (q)=log(1+e-q) is included. The function class also includes other losses with exponential tail, e.g., (q)=e-q3,(q)=log(1+e-q3).

A.1 Assumptions

We first focus on gradient flow. We assume (A1), (A2) as we do for exponential loss. For (A3), (A4), we replace them with two weaker assumptions (B3), (B4). All the assumptions are listed below:

  1. (A1).

    (Regularity). For any fixed 𝒙, Φ(;𝒙) is locally Lipschitz and admits a chain rule;

  2. (A2).

    (Homogeneity). There exists L>0 such that α>0:Φ(α𝜽;𝒙)=αLΦ(𝜽;𝒙);

  3. (B3).

    The loss function (q) can be expressed as (q)=e-f(q) such that

    1. (B3.1).

      f: is 𝒞1-smooth.

    2. (B3.2).

      f(q)>0 for all q.

    3. (B3.3).

      There exists bf0 such that f(q)q is non-decreasing for q(bf,+), and f(q)q+ as q+.

    4. (B3.4).

      Let g:[f(bf),+)[bf,+) be the inverse function of f on the domain [bf,+). There exists bgmax{2f(bf),f(2bf)},K1 such that g(x)Kg(θx) and f(y)Kf(θy) for all x(bg,+),y(g(bg),+) and θ[1/2,1).

  4. (B4).

    (Separability). There exists a time t0 such that (t0)<e-f(bf)=(bf).

(A1) and (A2) remain unchanged. (B3) is satisfied by exponential loss (q)=e-q (with f(q)=q) and logistic loss (q)=log(1+e-q) (with f(q)=-loglog(1+e-q)). (B4) are essentially the same as (A4) but (B4) uses a threshold value that depends on the loss function. Assuming (B3), it is easy to see that (B4) ensures the separability of data since (qn)<e-f(bf) implies qn>bf0. For logistic loss, we can set bf=0 (see Remark A.2). So the corresponding threshold value in (B4) is (0)=log2.

Now we discuss each of the assumptions in (B3). (B3.1) is a natural assumption on smoothness. (B3.2) requires () to be monotone decreasing, which is also natural since () is used for binary classification. The rest of two assumptions in (B3) characterize the properties of (q) when q is large enough. (B3.3) is an assumption that appears naturally from the proof. For exponential loss, f(q)q=q is always non-decreasing, so we can set bf=0. In (B3.4), the inverse function g is defined. It is guaranteed by (B3.1) and (B3.2) that g always exists and g is also 𝒞1-smooth. Though (B3.4) looks very complicated, it essentially says that f(Θ(q))=Θ(f(q)),g(Θ(q))=Θ(g(q)) as q. (B3.4) is indeed a technical assumption that enables us to asymptotically compare the loss or the length of gradient at different data points. It is possible to base our results on weaker assumptions than (B3.4), but we use (B3.4) for simplicity since it has already been satisfied by many loss functions such as the aforementioned examples.

We summarize the corresponding f,g and bf for exponential loss and logistic loss below:

Remark A.1.

Exponential loss (q)=e-q satisfies (B3) with

f(q)=q  f(q)=1  g(q)=q  g(q)=1  bf=0  (bf)=1.
Remark A.2.

Logistic loss (q)=log(1+e-q) satisfies (B3) with

f(q)=-loglog(1+e-q)=Θ(q) f(q)=e-q(1+e-q)log(1+e-q)=Θ(1)
g(q)=-log(ee-q-1)=Θ(q) g(q)=e-qee-q-1=Θ(1)
bf=0 (bf)=log2.

The proof for Remark A.1 is trivial. For Remark A.2, we give a proof below.

Proof for Remark A.2.

By simple calculations, the formulas for f(q),f(q),g(q),g(q) are correct. (B3.1) is trivial. f(q)=e-q(1+e-q)log(1+e-q)>0, so (B3.2) is satisfied. For (B3.3), note that f(q)q=q(1+eq)log(1+e-q). The denominator is a decreasing function since

ddq((1+eq)log(1+e-q))=eqlog(1+e-q)-1<eqe-q-1=0.

Thus, f(q)q is a strictly increasing function on . As bf is required to be non-negative, we set bf=0. For proving that f(q)q+ and (B4), we only need to notice that f(q)e-q1e-q=1 and g(x)=1/f(g(x))1. ∎

A.2 Smoothed Normalized Margin

For a loss function () satisfying (B3), it is easy to see from (B3.2) that its inverse function -1() must exist. For this kind of loss functions, we define the smoothed normalized margin as follows:

Definition A.3.

For a loss function () satisfying (B3), the smoothed normalized margin γ~(𝜽) of 𝜽 is defined as

γ~(𝜽):=-1()ρL=g(log1)ρL=g(-log(n=1Ne-f(qn(𝜽))))ρL,

where -1() is the inverse function of () and ρ:=𝜽2.

Remark A.4.

For logistic loss (q)=log(1+e-q), γ~(𝛉)=ρ-Llog1exp(L)-1; for exponential loss (q)=e-q, γ~(𝛉)=ρ-Llog1L, which is the same as (3).

Now we give some insights on how well γ~(𝜽) approximates γ¯(𝜽) using a similar argument as in Section 4.2. Using the LogSumExp function, the smoothed normalized margin γ~(𝜽) can also be written as

γ~(𝜽)=g(-LSE(-f(q1),,-f(qN)))ρL.

LSE is a (logN)-additive approximation for max. So we can roughly approximate γ~(𝜽) by

γ~(𝜽)g(-max{-f(q1),,-f(qN)})ρL=g(f(qmin))ρL=γ¯(𝜽).

Note that (B3.3) is crucial to make the above approximation reasonable. Similar to exponential loss, we can show the following lemma asserting that γ~ is a good approximation of γ¯.

Lemma A.5.

Assuming (B3)44 4 Indeed, (B3.4) is not needed for showing Lemma A.5 and Theorem A.7., we have the following properties about the margin:

  1. (a)

    f(qmin)-logNlog1f(qmin).

  2. (b)

    If log1>f(bf), then there exists ξ(f(qmin)-logN,f(qmin))(bf,+) such that

    γ¯-g(ξ)logNρLγ~γ¯.
  3. (c)

    For a seuqnce of parameters {𝜽md:m}, if (𝜽m)0, then |γ~(𝜽m)-γ¯(𝜽m)|0.

Proof.

(a) can be easily deduced from e-f(qmin)Ne-f(qmin). Combining (a) and the monotonicity of g(), we further have g(s)g(log1)qmin for s:=max{f(bf),f(qmin)-logN}. By the mean value Theorem, there exists ξ(s,f(qmin)) such that g(s)=g(f(qmin))-g(ξ)(f(qmin)-s)qmin-g(ξ)logN. Dividing ρL on each side of qmin-g(ξ)logNg(log1)qmin proves (b).

Now we prove (c). Without loss of generality, we assume log1(𝜽m)>f(bf) for all 𝜽m. It follows from (b) that for every 𝜽m there exists ξm(f(qmin(𝜽m))-logN,f(qmin(𝜽m)))(bf,+) such that

γ¯(𝜽m)-(g(ξm)logN)/ρ(𝜽m)Lγ~(𝜽m)γ¯(𝜽m). (8)

Note that ξmf(qmin(𝜽m))-logNlog1(𝜽m)-logN+. So g(ξm)g(ξm)=f(g(ξm))g(ξm)+ by (B3.3). Also note that there exists a constant B0 such that γ¯(𝜽m)B0 for all m since γ¯ is continuous on the unit sphere 𝒮d-1. So we have

g(ξm)ρ(𝜽m)L=g(ξm)g(ξm)g(ξm)ρ(𝜽m)Lg(ξm)g(ξm)qmin(𝜽m)ρ(𝜽m)L=g(ξm)g(ξm)γ¯(𝜽m)g(ξm)g(ξm)B0,

where the first inequality follows since ξmf(qmin(𝜽m)). Together with (8), we have |γ~(𝜽m)-γ¯(𝜽m)|0. ∎

Remark A.6.

For exponential loss, we have already shown in Section 4.2 that γ~(𝛉) is an O(ρ-L)-additive approximation for γ¯(𝛉). For logistic loss, it follows easily from g(q)=Θ(1) and (b) of Lemma A.5 that γ~(𝛉) is an O(ρ-L)-additive approximation for γ¯(𝛉) if L is sufficiently small.

A.3 Theorems

Now we state our main theorems. For the monotonicity of the normalized margin, we have the following theorem. The proof is provided in Appendix B.

Theorem A.7.

Under assumptions (A1), (A2), (B3)footnotemark: , (B4), the following statements are true for gradient flow:

  1. 1.

    For a.e. t>t0, ddtγ~(𝜽(t))0;

  2. 2.

    For a.e. t>t0, either ddtγ~(𝜽(t))>0 or ddt𝜽(t)𝜽(t)2=0;

  3. 3.

    (𝜽(t))0 and 𝜽(t)2 as t+; therefore, |γ¯(𝜽(t))-γ~(𝜽(t))|0.

For the normalized margin at convergence, we have two theorems, one for infinite-time limiting case, and the other being a finite-time quantitative result. Their proofs can be found in Appendix C. As in the exponential loss case, we define the constrained optimization problem (P) as follows:

min 12𝜽22
s.t. qn(𝜽)1  n[N]

First, we show the directional convergence of 𝜽(t) to a KKT point of (P).

Theorem A.8.

Consider gradient flow under assumptions (A1), (A2), (B3), (B4). For every limit point 𝛉¯ of {𝛉^(t):t0}, 𝛉¯/qmin(𝛉¯)1/L is a KKT point of (P).

Second, we show that after finite time, gradient flow can pass through an approximate KKT point.

Theorem A.9.

Consider gradient flow under assumptions (A1), (A2), (B3), (B4). For any ϵ,δ>0, there exists r:=Θ(logδ-1) and Δ:=Θ(ϵ-2) such that 𝛉/qmin(𝛉)1/L is an (ϵ,δ)-KKT point at some time t* satisfying log𝛉(t*)2(r,r+Δ).

For the definitions for KKT points and approximate KKT points, we refer the readers to Appendix C.1 for more details.

With a refined analysis, we can also provide tight rates for loss convergence and weight growth. The proof is given in Appendix D.

Theorem A.10.

Under assumptions (A1), (A2), (B3), (B4), we have the following tight rates for loss convergence and weight growth:

(𝜽(t))=Θ(g(logt)2/Lt(logt)2)𝑎𝑛𝑑𝜽(t)2=Θ(g(logt)1/L).

Applying Theorem A.10 to exponential loss and logistic loss, in which g(x)=Θ(x), we have the following corollary:

Corollary A.11.

If () is the exponential or logistic loss, then,

(𝜽(t))=Θ(1t(logt)2-2/L)𝑎𝑛𝑑𝜽(t)2=Θ((logt)1/L).

Appendix B Margin Monotonicity for General Loss

In this section, we consider gradient flow and prove Theorem A.7. We assume (A1), (A2), (B3), (B4) as mentioned in Appendix A.

We follow the notations in Section 5 to define ρ:=𝜽2 and 𝜽^:=𝜽𝜽2𝒮d-1, and sometimes we view the functions of 𝜽 as functions of t.

B.1 Proof for Proposition 1 and 2

To prove the first two propositions, we generalize our key lemma (Lemma 5.1) to general loss.

Lemma B.1.

For γ~ defined in Definition A.3, the following holds for all t>t0,

ddtlogρ>0𝑎𝑛𝑑ddtlogγ~L(ddtlogρ)-1d𝜽^dt22. (9)

Before proving Lemma B.1, we review two important properties of homogeneous functions. Note that these two properties are usually shown for smooth functions. By considering Clarke’s subdifferential, we can generalize it to locally Lipschitz functions that admit chain rules:

Theorem B.2.

Let F:RdR be a locally Lipschitz function that admits a chain rule. If F is k-homogeneous, then

  1. (a)

    For all 𝒙d and α>0,

    F(αx)=αk-1F(x)

    That is, F(αx)={αk-1𝒉:𝒉F(x)}.

  2. (b)

    (Euler’s Theorem for Homogeneous Functions). For all 𝒙d,

    𝒙,F(𝒙)=kF(𝒙)

    That is, 𝒙,𝒉=kF(𝒙) for all 𝒉F(𝒙).

Proof.

Let D be the set of points 𝒙 such that F is differentiable at 𝒙. According to the definition of Clarke’s subdifferential, for proving (a), it is sufficient to show that

{limkF(α𝒙k):𝒙k𝒙,α𝒙kD}={αk-1limkF(𝒙k):𝒙k𝒙,𝒙kD} (10)

Fix 𝒙kD. Let U be a neighborhood of 𝒙k. By definition of homogeneity, for any 𝒉d and any 𝒚U{𝒙k},

F(α𝒚)-F(α𝒙k)-α𝒚-α𝒙k,αk-1𝒉α𝒚-α𝒙k2=αk-1F(𝒚)-F(𝒙k)-𝒚-𝒙k,𝒉𝒚-𝒙k2.

Taking limits 𝒚𝒙k on both sides, we know that the LHS converges to 0 iff the RHS converges to 0. Then by definition of differetiability and gradient, F is differentiable at α𝒙k iff it is differentiable at 𝒙k, and F(α𝒙k)=αk-1𝒉 iff F(𝒙k)=𝒉. This proves (10).

To prove (b), we fix 𝒙d. Let 𝒛:[0,+)d,αα𝒙 be an arc. By definition of homogeneity, (F𝒛)(α)=αkF(𝒙) for α>0. Taking derivative with respect to α on both sides (for differentiable points), we have

𝒉F(α𝒙):𝒙,𝒉=kαk-1F(𝒙) (11)

holds for a.e. α>0. Pick an arbitrary α>0 making (11) hold. Then by (a), (11) is equivalent to 𝒉F(𝒙):𝒙,αk-1𝒉=kαk-1F(𝒙), which proves (b). ∎

Applying Theorem B.2 to homogeneous neural networks, we have the following corollary:

Corollary B.3.

Under the assumptions (A1) and (A2), for any 𝛉Rd and 𝐱Rdx,

𝜽,Φ𝒙(𝜽)=LΦ𝒙(𝜽),

where Φ𝐱(𝛉)=Φ(𝛉;𝐱) is the network output for a fixed input 𝐱.

Corollary B.3 can be used to derive an exact formula for the weight growth during training.

Theorem B.4.

For a.e. t0,

12dρ2dt=Ln=1Ne-f(qn)f(qn)qn.
Proof.

The proof idea is to use Corollary B.3 and chain rules (See Appendix I for chain rules in Clarke’s sense). Applying the chain rule on tρ2=𝜽22 yields 12dρ2dt=-𝜽,𝒉 for all 𝒉 and a.e.t>0. Then applying the chain rule on 𝜽, we have

-n=1Ne-f(qn)f(qn)qn={n=1Ne-f(qn)f(qn)𝒉n:𝒉nqn}.

By Corollary B.3, 𝜽,𝒉n=Lqn, and thus 12dρ2dt=Ln=1Ne-f(qn)f(qn)qn. ∎

For convenience, we define ν(t):=n=1Ne-f(qn)f(qn)qn for all t0. Then Theorem B.4 can be rephrased as 12dρ2dt=Lν(t) for a.e. t0.

Lemma B.5.

For all t>t0,

ν(t)g(log1)g(log1).
Proof.

By Lemma A.5, qng(log1) for all n[N]. Then by Assumption (B3),

f(qn)qnf(g(log1))g(log1)=g(log1)g(log1).

Combining this with the definitions of ν(t) and gives

ν(t)=n=1Ne-f(qn)f(qn)qnn=1Ne-f(qn)g(log1)g(log1)=g(log1)g(log1).

Proof for Lemma B.1.

Note that ddtlogρ=12ρ2dρ2dt=Lν(t)ρ2 by Theorem B.4. Then it simply follows from Lemma B.5 that ddtlogρ>0 for a.e. t>t0. For the second inequality, we first prove that logγ~=log(-1()/ρL)=log(g(log1)/ρL) exists for all tt0. (t) is non-increasing with t. So (t)<e-f(bf) for all tt0. This implies that (1) log1 is always in the domain of g; (2) ρ>0 (otherwise Ne-f(0)>e-f(bf), contradicting (B4)). Therefore, γ~:=g(log1)/ρL exists and is always positive for all tt0, which proves the existence of logγ~.

By the chain rule and Lemma B.5, we have

ddtlogγ~=ddt(log(g(log1))-Llogρ) =g(log1)g(log1)1(-ddt)-L2ν(t)ρ2
1ν(t)(-ddt)-L2ν(t)ρ2
1ν(t)(-ddt-L2ν(t)2ρ2).

On the one hand, -ddt=d𝜽dt22 for a.e. t>0 by Lemma I.3; on the other hand, Lν(t)=𝜽,d𝜽dt by Theorem B.4. Combining these together yields

ddtlogγ~ 1ν(t)(d𝜽dt22-𝜽^,d𝜽dt2)=1ν(t)(𝑰-𝜽^𝜽^)d𝜽dt22.

By the chain rule, d𝜽^dt=1ρ(𝑰-𝜽^𝜽^)d𝜽dt for a.e. t>0. So we have

ddtlogγ~ρ2ν(t)d𝜽^dt22=L(ddtlogρ)-1d𝜽^dt22.

B.2 Proof for Proposition 3

To prove the third proposition, we prove the following lemma to show that 0 by giving an upper bound for . Since can never be 0 for bounded ρ, 0 directly implies ρ+. For showing |γ¯-γ~|0, we only need to apply (c) in Lemma A.5, which shows this when 0.

Lemma B.6.

For all t>t0,

G(1/(t))L2γ~(t0)2/L(t-t0)  for G(x):=1/(t0)xg(logu)2g(logu)2-2/L𝑑u.

Therefore, L(t)0 and ρ(t)+ as t.

Proof for Lemma B.6.

By Lemma I.3 and Theorem B.4,

-ddt=d𝜽dt22𝜽^,d𝜽dt2=L2ν(t)2ρ2

Using Lemma B.5 to lower bound ν and replacing ρ with (g(log1)/γ~)1/L by the definition of γ~, we have

-ddtL2(g(log1)g(log1))2(γ~(t)g(log1))2/LL2γ~(t0)2/Lg(log1)2-2/Lg(log1)2,

where the last inequality uses the monotonicity of γ~. So the following holds for a.e. tt0,

g(log1)2g(log1)2-2/Lddt1L2γ~(t0)2/L.

Integrating on both sides from t0 to t, we can conclude that

G(1/)L2γ~(t0)2/L(t-t0).

Note that 1/ is non-decreasing. If 1/ does not grow to +, then neither does G(1/). But the RHS grows to +, which leads to a contradiction. So 0.

To make 0, qmin must converge to +. So ρ+. ∎

Appendix C Convergence to the Max-margin Solution

In this section, we analyze the convergent direction of 𝜽 and prove Theorem A.8 and A.9, assuming (A1), (A2), (B3), (B4) as mentioned in Section A.

We follow the notations in Section 5 to define ρ:=𝜽2 and 𝜽^:=𝜽𝜽2𝒮d-1, and sometimes we view the functions of 𝜽 as functions of t.

C.1 Preliminaries for KKT Conditions

We first review the definition of Karush-Kuhn-Tucker (KKT) conditions for non-smooth optimization problems following from (Dutta et al., 2013).

Consider the following optimization problem (P) for 𝒙d:

min f(𝒙)
s.t. gn(𝒙)0  n[N]

where f,g1,,gn:d are locally Lipschitz functions. We say that 𝒙d is a feasible point of (P) if 𝒙 satisfies gn(𝒙)0 for all n[N].

Definition C.1 (KKT Point).

A feasible point 𝒙 of (P) is a KKT point if 𝒙 satisfies KKT conditions: there exists λ1,,λN0 such that

  1. 1.

    0f(𝒙)+n[N]λngn(𝒙);

  2. 2.

    n[N]:λngn(𝒙)=0.

It is important to note that a global minimum of (P) may not be a KKT point, but under some regularity assumptions, the KKT conditions become a necessary condition for global optimality. The regularity condition we shall use in this paper is the non-smooth version of Mangasarian-Fromovitz Constraint Qualification (MFCQ) (see, e.g., the constraint qualification (C.Q.5) in (Giorgi et al., 2004)):

Definition C.2 (MFCQ).

For a feasible point 𝒙 of (P), (P) is said to satisfy MFCQ at 𝒙 if there exists 𝒗d such that for all n[N] with gn(𝒙)=0,

𝒉gn(𝒙):𝒉,𝒗>0.

Following from (Dutta et al., 2013), we define an approximate version of KKT point, as shown below. Note that this definition is essentially the modified ϵ-KKT point defined in their paper, but these two definitions differ in the following two ways: (1) First, in their paper, the subdifferential is allowed to be evaluated in a neighborhood of 𝒙, so our definition is slightly stronger; (2) Second, their paper fixes δ=ϵ2, but in our definition we make them independent.

Definition C.3 (Approximate KKT Point).

For ϵ,δ>0, a feasible point 𝒙 of (P) is an (ϵ,δ)-KKT point if there exists λn0,𝒌f(𝒙),𝒉ngn(𝒙) for all n[N] such that

  1. 1.

    𝒌+n[N]λn𝒉n(𝒙)2ϵ;

  2. 2.

    n[N]:λngn(𝒙)-δ.

As shown in (Dutta et al., 2013), (ϵ,δ)-KKT point is an approximate version of KKT point in the sense that a series of (ϵ,δ)-KKT points can converge to a KKT point. We restate their theorem in our setting:

Theorem C.4 (Corollary of Theorem 3.6 in (Dutta et al., 2013)).

Let {𝐱kRd:kN} be a sequence of feasible points of (P), {ϵk>0:kN} and {δk>0:kN} be two sequences. 𝐱k is an (ϵk,δk)-KKT point for every k, and ϵk0,δk0. If 𝐱k𝐱 as k+ and MFCQ holds at 𝐱, then 𝐱 is a KKT point of (P).

C.2 KKT Conditions for (P)

Recall that for a homogeneous neural network, the optimization problem (P) is defined as follows:

min 12𝜽22
s.t. qn(𝜽)1  n[N]

Using the terminologies and notations in Appendix C.1, the objective and constraints are f(𝒙)=12𝒙22 and gn(𝒙)=1-qn(𝒙). The KKT points and approximate KKT points for (P) are defined as follows:

Definition C.5 (KKT Point of (P)).

A feasible point 𝜽 of (P) is a KKT point if there exist λ1,,λN0 such that

  1. 1.

    𝜽-n=1Nλn𝒉n=𝟎 for some 𝒉1,,𝒉N satisfying 𝒉nqn(𝜽);

  2. 2.

    n[N]:λn(qn(𝜽)-1)=0.

Definition C.6 (Approximate KKT Point of (P)).

A feasible point 𝜽 of (P) is an (ϵ,δ)-KKT point of (P) if there exists λ1,,λN0 such that

  1. 1.

    𝜽-n=1Nλn𝒉n2ϵ for some 𝒉1,,𝒉N satisfying 𝒉nqn(𝜽);

  2. 2.

    n[N]:λn(qn(𝜽)-1)δ.

By the homogeneity of qn, it is easy to see that (P) satisfies MFCQ, and thus KKT conditions are first-order necessary condition for global optimality.

Lemma C.7.

(P) satisfies MFCQ at every feasible point 𝛉.

Proof.

Take 𝒗:=𝜽. For all n[N] satisfying qn=1, by homogeneity of qn,

𝒗,𝒉=Lqn(𝜽)=L>0

holds for any 𝒉-qn(𝜽)=(1-qn(𝜽)). ∎

C.3 Key Lemmas

Define β(t):=1d𝜽dt2𝜽^,d𝜽dt to be the cosine of the angle between 𝜽 and d𝜽dt. Here β(t) is only defined for a.e. t>0. Since qn is locally Lipschitz, it can be shown that qn is (globally) Lipschitz on the compact set 𝒮d-1, which is the unit sphere in d. Define

B0:= sup{qnρL:𝜽d{𝟎}}
= sup{qn:𝜽𝒮d-1}<.
B1:= sup{𝒉2ρL-1:𝜽d{𝟎},𝒉qn,n[N]}
= sup{𝒉2:𝜽𝒮d-1,𝒉qn,n[N]}<.

For showing Theorem A.8 and Theorem A.9, we first prove Lemma C.8. In light of this lemma, if we aim to show that 𝜽 is along the direction of an approximate KKT point, we only need to show β1 (which makes ϵ0) and 0 (which makes δ0).

Lemma C.8.

Let C1,C2 be two constants defined as

C1:=2γ~(t0)1/L,C2:=2eNK2Lγ~(t0)2/L(B1γ~(t0))log2K,

where K,bg are constants specified in (B3.4). If log1Lbg at time t>t0, then 𝛉~:=𝛉/qmin(𝛉)1/L is an (ϵ,δ)-KKT point of (P), where ϵ:=C11-β and δ:=C2/(log1L).

Proof.

Let 𝒉(t):=d𝜽dt(t) for a.e. t>0. By the chain rule, there exist 𝒉1,,𝒉N such that 𝒉nqn and 𝒉=n[N]e-f(qn)f(qn)𝒉n. Let 𝒉~n:=𝒉n/qmin1-1/Lqn(𝜽~) (Recall that 𝜽~:=𝜽/qmin(𝜽)1/L). Construct λn:=qmin1-2/Lρe-f(qn)f(qn)/𝒉2. Now we only need to show

𝜽~-n=1Nλn𝒉~n22 2γ~2/L(1-β) (12)
n=1Nλn(qn(𝜽~)-1) (2eNK2Lγ~2/L(B1γ~)log2K)1f(γ~ρL). (13)

Then 𝜽~ can be shown to be an (ϵ,δ)-KKT point by the monotonicity γ~(t)γ~(t0) for t>t0.

Proof of (12).

From our construction, n=1Nλn𝒉~n=qmin-1/Lρ𝒉/𝒉2. So

𝜽~-n=1Nλn𝒉~n22=qmin-2/Lρ2𝜽^-𝒉𝒉222=qmin-2/Lρ2(2-2β)2γ~2/L(1-β),

where the last equality is by Lemma A.5.

Proof for (13).

According to our construction,

n=1Nλn(qn(𝜽~)-1)=qmin-2/Lρ𝒉2n=1Ne-f(qn)f(qn)(qn-qmin).

Note that 𝒉2𝒉,𝜽^=Lν/ρ. By Lemma B.5 and Lemma D.1, we have

νg(log1)g(log1)12Klog112Kf(γ~ρL)e-f(qmin),

where the last inequality uses f(γ~ρL)=log1 and e-f(qmin). Combining these gives

n=1Nλn(qn(𝜽~)-1)2Kqmin-2/Lρ2Lf(γ~ρL)n=1Nef(qmin)-f(qn)f(qn)(qn-qmin).

If qn>qmin, then by the mean value theorem there exists ξn(qmin,qn) such that f(qn)-f(qmin)=f(ξn)(qn-qmin). By Assumption (B3.4), we know that f(qn)Klog2(qn/ξn)f(ξn). Note that log2(qn/ξn)log2(2B1ρL/qmin)log2(2B1/γ~). Then we have

n=1Nλn(qn(𝜽~)-1) 2Kqmin-2/Lρ2Lf(γ~ρL)Klog2(2B1/γ~)n:qnqmine-f(ξn)(qn-qmin)f(ξn)(qn-qmin)
2Kγ~-2/Lρ2Lf(γ~ρL)Klog2(2B1/γ~)Ne

where the second inequality uses qmin-2/Lρ2γ~-2/L by Lemma A.5 and the fact that the function xe-xx on (0,+) attains the maximum value e at x=1. ∎

By Theorem A.7, we have already known that 0. So it remains to bound β(t). For this, we first prove the following lemma to bound the integral of β(t).

Lemma C.9.

For all t2>t1t0,

t1t2(β(τ)-2-1)ddτlogρ(τ)𝑑τ1Llogγ~(t2)γ~(t1).
Proof.

By Lemma B.4, ddtlogρ=12ρ2dρ2dt=Lνρ2 for a.e. t>0. By Lemma B.1, for a.e. t(t1,t2),

ddtlogγ~Lρ2Lνd𝜽^dt22ddtlogρ. (14)

By the chain rule, d𝜽^dt=1ρ(𝑰-𝜽^𝜽^)d𝜽dt. So we have

ρ2Lνd𝜽^dt22=ρLν(t)(𝑰-𝜽^𝜽^)d𝜽dt22=d𝜽dt22-d𝜽dt,𝜽^2d𝜽dt,𝜽^2=β-2-1. (15)

where the last equality follows from the definition of β. Combining 14 and 15, we have

ddtlogγ~L(β-2-1)ddtlogρ.

Integrating on both sides from t1 to t2 proves the lemma. ∎

A direct corollary of Lemma C.9 is the upper bound for the minimum β2-1 within a time interval:

Corollary C.10.

For all t2>t1t0, then there exists t*(t1,t2) such that

β(t*)-2-11Llogγ~(t2)-logγ~(t1)logρ(t2)-logρ(t1).
Proof.

Denote the RHS as C. Assume to the contrary that β(τ)-2-1>C for a.e. τ(t1,t2). By Lemma B.1, logρ(τ)>0 for a.e. τ(t1,t2). Then by Lemma C.9, we have

1Llogγ~(t2)γ~(t1)>t1t2Cddτlogρ(τ)𝑑τ=C(logρ(t2)-logρ(t1))=1Llogγ~(t2)γ~(t1),

which leads to a contradiction. ∎

In the rest of this section, we present both asymptotic and non-asymptotic analyses for the directional convergence by using Corollary C.10 to bound β(t).

C.4 Asymptotic Analysis

We first prove an auxiliary lemma which gives an upper bound for the change of 𝜽^.

Lemma C.11.

For a.e. t>t0,

d𝜽^dt2B1Lγ~ddtlogρ.
Proof.

Observe that d𝜽^dt2=1ρ(𝑰-𝜽^𝜽^)d𝜽dt21ρd𝜽dt2. It is sufficient to bound d𝜽dt2. By the chain rule, there exists 𝒉1,,𝒉N:[0,+)d satisfying that for a.e. t>0, 𝒉n(t)qn and d𝜽dt=n[N]e-f(qn)f(qn)𝒉n(t). By definition of B1, 𝒉n2B1ρL-1 for a.e. t>0. So we have

d𝜽dt2 n[N]e-f(qn)f(qn)𝒉n2
n[N]e-f(qn)f(qn)qn1qnB1ρL-1.

Note that every summand is positive. By Lemma A.5, qn is lower-bounded by qnqming(log1), so we can replace qn with g(log1) in the above inequality. Combining with the fact that n[N]e-f(qn)f(qn)qn is just ν, we have

d𝜽dt2νg(log1)B1ρL-1=B1νγ~ρ. (16)

So we have d𝜽^dt21ρd𝜽dt2B1Lγ~Lνρ2=B1Lγ~ddtlogρ. ∎

To prove Theorem A.8, we consider each limit point 𝜽¯/qmin(𝜽¯)1/L, and construct a series of approximate KKT points converging to it. Then 𝜽¯/qmin(𝜽¯)1/L can be shown to be a KKT point by Theorem C.4. The following lemma ensures that such construction exists.

Lemma C.12.

For every limit point 𝛉¯ of {𝛉^(t):t0}, there exists a sequence of {tm:mN} such that tm+,𝛉^(tm)𝛉¯, and β(tm)1.

Proof.

Let {ϵm>0:m} be an arbitrary sequence with ϵm0. Now we construct {tm} by induction. Suppose t1<t2<<tm-1 have already been constructed. Since 𝜽¯ is a limit point and γ~(t)γ~ (recall that γ~:=limt+γ~(t)), there exists sm>tm-1 such that

𝜽^(sm)-𝜽¯2ϵmand1Llogγ~γ~(sm)ϵm3.

Let sm>sm be a time such that logρ(sm)=logρ(sm)+ϵm. According to Theorem A.7, logρ+, so sm must exist. We construct tm(sm,sm) to be a time that β(tm)-2-1ϵm2, where the existence can be shown by Corollary C.10.

Now we show that this construction meets our requirement. It follows from β(tm)-2-1ϵm2 that β(tm)1/1+ϵm21. By Lemma C.11, we also know that

𝜽^(tm)-𝜽¯2𝜽^(tm)-𝜽^(sm)2+𝜽^(sm)-𝜽¯2B1Lγ~(t0)ϵm+ϵm0.

This completes the proof. ∎

Proof of Theorem A.8.

Let 𝜽~:=𝜽¯/qmin(𝜽¯)1/L for short. Let {tm:m} be the sequence constructed as in Lemma C.12. For each tm, define ϵ(tm) and δ(tm) as in Lemma C.8. Then we know that 𝜽(tm)/qmin(tm)1/L is an (ϵ(tm),δ(tm))-KKT point and 𝜽(tm)/qmin(tm)1/L𝜽~,ϵ(tm)0,δ(tm)0. By Lemma C.7, (P) satisfies MFCQ. Applying Theorem C.4 proves the theorem. ∎

C.5 Non-asymptotic Analysis

Proof of Theorem A.9.

Let C0:=1Llogγ~γ~(t0). Without loss of generality, we assume ϵ<62C1 and δ<C2/f(bf).

Let t1 be the time such that logρ(t1)=1Llogg(C2δ-1)γ~(t0)=Θ(logδ-1) and t2 be the time such that logρ(t2)-logρ(t1)=12C0C12ϵ-2=Θ(ϵ-2). By Corollary C.10, there exists t*(t1,t2) such that β(t*)-2-12ϵ2C1-2.

Now we argue that 𝜽~(t*) is an (ϵ,δ)-KKT point. By Lemma C.8, we only need to show C11-β(t*)ϵ and C2/f(γ~(t*)ρ(t*)L)δ. For the first inequality, by assumption ϵ<62C1, we know that β(t*)-2-1<3, which implies |β(t*)|<12. Then we have 1-β(t*)2ϵ2C1-2β(t*)21+β(t*)ϵ2C1-2. Therefore, C11-β(t*)ϵ holds. For the second inequality, γ~(t*)ρ(t*)Lγ~(t*)g(C2δ-1)γ~(t0)g(C2δ-1). Therefore, C2/f(γ~(t*)ρ(t*)L)δ holds. ∎

C.6 Proof for Corollary 4.5

By the homogeneity of qn, we can characterize KKT points using kernel SVM.

Lemma C.13.

If 𝛉* is KKT point of (P), then there exists 𝐡nΦ𝐱n(𝛉*) for n[N] such that 1L𝛉* is an optimal solution for the following constrained optimization problem (Q):

min 12𝜽22
s.t. yn𝜽,𝒉n1  n[N]
Proof.

It is easy to see that (Q) is a convex optimization problem. For 𝜽=2L𝜽*, from Theorem B.2, we can see that yn𝜽,𝒉n=2qn(𝜽*)2>1, which implies Slater’s condition. Thus, we only need to show that 1L𝜽* satisfies KKT conditions for (Q).

By the KKT conditions for (P), we can construct 𝒉nqn(𝜽*) for n[N] such that 𝜽*-n=1Nλnyn𝒉n=𝟎 for some λ1,,λN0 and λn(qn(𝜽*)-1)=0. Thus, 1L𝜽* satisfies

  1. 1.

    1L𝜽*-n=1N1Lλnyn𝒉n=𝟎;

  2. 2.

    1Lλn(yn1L𝜽*,𝒉n-1)=1Lλn(qn(𝜽)-1)0.

So 1L𝜽* satisfies KKT conditions for (Q). ∎

Now we prove Corollary 4.5 in Section 4.3.

Proof.

By Theorem A.8, every limit point 𝜽¯ is along the direction of a KKT point of (P). Combining this with Lemma C.13, we know that every limit point 𝜽¯ is also along the max-margin direction of (Q).

For smooth models, 𝒉n in (Q) is exactly the gradient Φ𝒙n(𝜽¯). So, (Q) is the optimization problem for SVM with kernel K𝜽¯(𝒙,𝒙)=Φ𝒙(𝜽¯),Φ𝒙(𝜽¯). For non-smooth models, we can construct an arbitrary function 𝒉(𝒙)Φ𝒙(𝜽¯) that ensures 𝒉(𝒙n)=𝒉n. Then, (Q) is the optimization problem for SVM with kernel K𝜽¯(𝒙,𝒙)=𝒉(𝒙),𝒉(𝒙). ∎

Appendix D Tight Bounds for Loss Convergence and Weight Growth

In this section, we give proof for Theorem A.10, which gives tight bounds for loss convergence and weight growth under Assumption (A1), (A2), (B3), (B4).

D.1 Consequences of (B3.4)

Before proving Theorem A.10, we show some consequences of (B3.4).

Lemma D.1.

For f() and g(), we have

  1. 1.

    For all x[bg,+), g(x)g(x)[12Kx,2Kx];

  2. 2.

    For all y[g(bg),+), f(y)f(y)[12Ky,2Ky].

Thus, g(x)=Θ(xg(x)),f(y)=Θ(yf(y)).

Proof.

To prove Item 1, it is sufficient to show that

g(x)=bf+f(bf)xg(u)𝑑u x/2xg(u)𝑑u
(x/2)g(x)K=12Kxg(x).
x=f(bf)+f(bf)xg(u)f(g(u))𝑑u f(g(x))Kf(g(x)/2)xg(u)𝑑u
=(g(x)/2)f(g(x))K=12Kg(x)g(x).

To prove Item 2, we only need to notice that Item 1 implies yf(y)=g(f(y))g(f(y))[12Kf(y),2Kf(y)] for all y[g(bg),+). ∎

Recall that (B3.4) directly implies that f(Θ(x))=Θ(f(x)) and g(Θ(x))=Θ(g(x)). Combining this with Lemma D.1, we have the following corollary:

Corollary D.2.

f(Θ(x))=Θ(f(x)) and g(Θ(x))=Θ(g(x)).

Also, note that Lemma D.1 essentially shows that (logf(x))=Θ(1/x) and (logg(x))=Θ(1/x). So logf(x)=Θ(logx) and logg(x)=Θ(logx), which means that f and g grow at most polynomially.

Corollary D.3.

f(x)=xΘ(1) and g(x)=xΘ(1).

D.2 Proof for Theorem A.10

We follow the notations in Section 5 to define ρ:=𝜽2 and 𝜽^:=𝜽𝜽2𝒮d-1, and sometimes we view the functions of 𝜽 as functions of t. And we use the notations B0,B1 from Appendix C.3.

The key idea to prove Theorem A.10 is to utilize Lemma B.6, in which (t) is bounded from above by 1G-1(Ω(t)). So upper bounding (t) reduces to lower bounding G-1. In the following lemma, we obtain tight asymptotic bounds for G() and G-1():

Lemma D.4.

For function G() defined in Lemma B.6 and its inverse function G-1(), we have the following bounds:

G(x)=Θ(g(logx)2/L(logx)2x)𝑎𝑛𝑑G-1(y)=Θ((logy)2g(logy)2/Ly).
Proof.

We first prove the bounds for G(x), and then prove the bounds for G-1(y).

Bounding for G(x).

Let CG=1/(t0)exp(bg)g(logu)2g(logu)2-2/L𝑑u. For xexp(bg),

G(x) =CG+exp(bg)xg(logu)2g(logu)2-2/L𝑑u
=CG+exp(bg)x(g(logu)logug(logu))2g(logu)2/L(logu)2𝑑u,

By Lemma D.1,

G(x)CG+4K2g(logx)2/Lexp(bg)x1(logu)2𝑑u=O(g(logx)2/L(logx)2x).

On the other hand, for xexp(2bg), we have

G(x)14K2xxg(logu)2/L(logu)2𝑑ug((logx)/2)2/L4K2xx1(logu)2𝑑u=Ω(g(logx)2/L(logx)2x).

Bounding for G-1(y).

Let x=G-1(y) for y0. G(x) always has a finite value whenever x is finite. So x+ when y+. According to the first part of the proof, we know that y=Θ(g(logx)2/L(logx)2x). Taking logarithm on both sides and using Corollary D.3, we have logy=Θ(logx). By Corollary D.2, g(logy)=g(Θ(logx))=Θ(g(logx)). Therefore,

(logy)2g(logy)2/Ly=Θ((logx)2g(logx)2/Ly)=Θ(x).

This implies that x=Θ((logy)2g(logy)2/Ly). ∎

For other bounds, we derive them as follows. We first show that g(log1)=Θ(ρL). With this equivalence, we derive an upper bound for the gradient at each time t in terms of , and take an integration to bound (t) from below. Now we have both lower and upper bounds for (t). Plugging these two bounds to g(log1)=Θ(ρL) gives the lower and upper bounds for ρ(t).

Proof for Theorem A.10.

We first prove the upper bound for . Then we derive lower and upper bounds for ρ in terms of , and use these bounds to give a lower bound for . Finally, we plug in the tight bounds for to obtain the lower and upper bounds for ρ in terms of t.

Upper Bounding .

By Lemma B.6, we have 1G-1(Ω(t)). Using Lemma D.4, we have 1=Ω((logt)2g(logt)2/Lt), which completes the proof.

Bounding ρ in Terms of .

γ~(t)γ~(t0), so ρL1γ~(t0)g(log1)=O(g(log1)). On the other hand, g(log1)qminB0ρL. So ρL=Ω(g(log1)). Therefore, we have the following relationship between ρL and g(log1):

ρL=Θ(g(log1)). (17)

Lower Bounding .

Let 𝒉1,,𝒉N be a set of vectors such that 𝒉nqn𝜽 and

d𝜽dt=n=1Ne-f(qn)f(qn)𝒉n.

By (17) and Corollary D.2, f(qn)=f(O(ρL))=f(O(g(log1)))=O(f(g(log1)))=O(1/g(log1)). Again by (17), we have 𝒉n2B1ρL-1=O(g(log1)1-1/L). Combining these two bounds together, it follows from Corollary D.2 that

f(qn)𝒉n2=O(g(log1)1-1/Lg(log1))=O(log1g(log1)1/L).

Thus,

-ddt=d𝜽dt22 =n=1Ne-f(qn)f(qn)𝒉n22
(n=1Ne-f(qn)maxn[N]{f(qn)𝒉n2})22O((log1)2g(log1)2/L).

By definition of G(), this implies that there exists a constant c such that ddtG(1)c for any that is small enough. We can complete our proof by applying Lemma D.4.

Bounding ρ in Terms of t.

By (17) and the tight bound for (t), ρL=Θ(g(log1))=Θ(g(Θ(logt))). Using Corollary D.2, we can conclude that ρL=Θ(g(logt)). ∎

Appendix E Gradient Descent: Smooth Homogeneous Models with Exponential Loss

In this section, we discretize our proof to prove similar results for gradient descent on smooth homogeneous models. As usual, the update rule of gradient descent is defined as

𝜽(t+1)=𝜽(t)-η(t)(t) (18)

Here η(t) is the learning rate, and (t):=(𝜽(t)) is the gradient of at 𝜽(t).

We first focus on the exponential loss. At the end of this section (Appendix F), we discuss how to extend the proof to general loss functions with a similar assumption as (B3).

The main difficulty for discretizing our previous analysis comes from the fact that the original version of the smoothed normalized margin γ~(𝜽):=ρ-Llog1 becomes less smooth when ρ+. Thus, if we take a Taylor expansion for γ~(𝜽(t+1)) from the point 𝜽(t), although one can show that the first-order term is positive as in the gradient flow analysis, the second-order term is unlikely to be bounded during gradient descent with a constant step size. To get a smoothed version of the normalized margin that is monotone increasing, we need to define another one that is even smoother than γ~.

Technically, recall that ddt=-22 does not hold exactly for gradient descent. However, if the smoothness can be bounded by s(t), then it is well-known that

(t+1)-(t)-η(t)(1-s(t)η(t))22.

By analyzing the landscape of , one can easily find that the smoothness is bounded locally by O(polylog(1)). Thus, if we set η(t) to a constant or set it appropriately according to the loss, then this discretization error becomes negligible. Using this insight, we define the new smoothed normalized margin γ^ in a way that it increases slightly slower than γ~ during training to cancel the effect of discretization error.

E.1 Assumptions

As stated in Section 4.1, we assume (A2), (A3), (A4) similarly as for gradient flow, and two additional assumptions (S1) and (S5).

  1. (S1).

    (Smoothness). For any fixed 𝒙, Φ(;𝒙) is 𝒞2-smooth on d{𝟎}.

  2. (A2).

    (Homogeneity). There exists L>0 such that α>0:Φ(α𝜽;𝒙)=αLΦ(𝜽;𝒙);

  3. (A3).

    (Exponential Loss). (q)=e-q;

  4. (A4).

    (Separability). There exists a time t0 such that (𝜽(t0))<1.

  5. (S5).

    (Learing rate condition). t0η(t)=+ and η(t)H((𝜽(t))).

Here H() is a function of the current training loss. The explicit formula of H() is given below:

H(x):=μ(x)Cηκ(x)=Θ(1x(log1x)3-2/L),

where Cη is a constant, and κ(x),μ(x) are two non-decreasing functions. For constant learning rate η(t)=η0, (S5) is satisfied when η0 if sufficiently small.

Roughly speaking, Cηκ(x) is an upper bound for the smoothness of in a neighborhood of 𝜽 when x=(𝜽). And we set the learning rate η(t) to be the inverse of the smoothness multiplied by a factor μ(x)=o(1). In our analysis, μ(x) can be any non-decreasing function that maps (0,(t0)] to (0,1/2] and makes the integral 01/2μ(x)𝑑x exist. But for simplicity, we define μ(x) as

μ(x):=log1(t0)2log1x=Θ(1log1x).

The value of Cη will be specified later. The definition of κ(x) depends on L. For 0<L1, we define κ(x) as

κ(x):=x(log1x)2-2/L

For L>1, we define κ(x) as

κ(x):={x(log1x)2-2/Lx(0,e2/L-2]κmaxx(e2/L-2,1)

where κmax:=e(2-2/L)(ln(2-2/L)-1). The specific meaning of Cη,κ(x) and μ(x) will become clear in our analysis.

E.2 Smoothed Normalized Margin

Now we define the smoothed normalized margins. As usual, we define γ~(𝜽):=log1ρL. At the same time, we also define

γ^(𝜽):=eϕ()ρL.

Here ϕ:(0,(t0)](0,+) is constructed as follows. Construct the first-order derivative of ϕ(x) as

ϕ(x)=-sup{1+2(1+λ(w)/L)μ(w)wlog1w:w[x,(t0)]},

where λ(x):=(log1x)-1. And then we set ϕ(x) to be

ϕ(x)=loglog1x+0x(ϕ(w)+1wlog1w)𝑑w.

It can be verified that ϕ(x) is well-defined and ϕ(x) is indeed the first-order derivative of ϕ(x). Moreover, we have the following relationship among γ^,γ~ and γ¯.

Lemma E.1.

γ^(𝜽) is well-defined for L(𝛉)L(t0) and has the following properties:

  1. (a)

    If (𝜽)(t0), then γ^(𝜽)<γ~(𝜽)γ¯(𝜽).

  2. (b)

    Let {𝜽md:m} be a sequence of parameters. If (𝜽m)0, then

    γ^(𝜽m)γ¯(𝜽m)=1-O((log1(𝜽m))-1)1.
Proof.

First we verify that γ^ is well-defined. To see this, we only need to verify that

I(x):=0x(ϕ(w)+1wlog1w)𝑑w

exists for all x(0,(t0)], then it is trivial to see that ϕ(w) is indeed the derivative of ϕ(w) by I(x)=ϕ(x)+1xlog1x.

Note that I(x) exists for all x(0,(t0)] as long as I(x) exists for a small enough x>0. By definition, it is easy to verify that r(w):=1+2(1+λ(w)/L)μ(w)wlog1w is decreasing when w is small enough. Thus, for a small enough w>0, we have

-ϕ(w)=r(w)=1+2(1+1L(log1w)-1)log1(t0)2log1wwlog1w=1wlog1w+log1(t0)(1+1L(log1w)-1)w(log1w)2.

So we have the following for small enough x:

I(x)=0x-log1(t0)(1+1L(log1w)-1)w(log1w)2dw =-log1(t0)(1log1w+12L(log1w)2)|0x
=-log1(t0)(1log1x+12L(log1x)2). (19)

This proves the existence of I(x).

Now we prove (a). By Lemma A.5, γ~(𝜽)γ¯(𝜽), so we only need to prove that γ^(𝜽)<γ~(𝜽). To see this, note that for all w(t0), r(w)>1wlog1w. So we have -ϕ(w)>1wlog1w, and this implies I(x)<0 for all x(t0). Then it holds for all 𝜽 with (𝜽)(t0) that

γ^(𝜽)γ~(𝜽)=eϕ((𝜽))log1(𝜽)=eloglog1(𝜽)+I((𝜽))eloglog1(𝜽)=eI((𝜽))<0. (20)

To prove (b), we combine (19) and (20), then for small enough (𝜽m), we have

γ^(𝜽)γ¯(𝜽) =γ^(𝜽)γ~(𝜽)γ~(𝜽)γ¯(𝜽)
=exp(-log1(t0)(1log1(𝜽m)+12L(log1(𝜽m))2))log1(𝜽m)qmin(𝜽m)
=exp(-O(1log1(𝜽m)))log1(𝜽m)log1(𝜽m)+O(1)
=1-O((log1(𝜽m))-1).

So γ^(𝜽)γ¯(𝜽)=1-O((log1(𝜽m))-1)1. ∎

Now we specify the value of Cη. By (S1) and (S2), we can define B0,B1,B2 as follows:

B0 :=sup{qn:𝜽𝒮d-1,n[N]},
B1 :=sup{qn2:𝜽𝒮d-1,n[N]},
B2 :=sup{2qn2:𝜽𝒮d-1,n[N]}.

Then we set Cη:=12(B12+ρ(t0)-LB2)min{γ^(t0)-2+2/L,B0-2+2/L}.

E.3 Theorems

Now we state our main theorems for the monotonicity of the normalized margin and the convergence to KKT points. We will prove Theorem E.2 in Appendix E.4, and prove Theorem E.3 and E.4 in Appendix E.5.

Theorem E.2.

Under assumptions (S1), (A2) - (A4), (S5), the following are true for gradient descent:

  1. 1.

    For all tt0, γ^(t+1)γ^(t);

  2. 2.

    For all tt0, either γ^(t+1)>γ^(t) or 𝜽^(t+1)=𝜽^(t);

  3. 3.

    (t)0 and ρ(t) as t+; therefore, |γ¯(t)-γ~(t)|0.

Theorem E.3.

Consider gradient flow under assumptions (S1), (A2) - (A4), (S5). For every limit point 𝛉¯ of {𝛉^(t):t0}, 𝛉¯/qmin(𝛉¯)1/L is a KKT point of (P).

Theorem E.4.

Consider gradient descent under assumptions (S1), (A2) - (A4), (S5). For any ϵ,δ>0, there exists r:=Θ(logδ-1) and Δ:=Θ(ϵ-2) such that 𝛉/qmin(𝛉)1/L is an (ϵ,δ)-KKT point at some time t* satisfying logρ(t*)(r,r+Δ).

With a refined analysis, we can also derive tight rates for loss convergence and weight growth. We defer the proof to Appendix E.6.

Theorem E.5.

Under assumptions (S1), (A2) - (A4), (S5), we have the following tight rates for training loss and weight norm:

(t)=Θ(1T(logT)2-2/L)𝑎𝑛𝑑ρ(t)=Θ((logT)1/L),

where T=τ=t0t-1η(τ).

E.4 Proof for Theorem E.2

We define ν(t):=n=1Ne-qn(t)qn(t) as we do for gradient flow. Then we can get a closed form for 𝜽(t),-(t) easily from Corollary B.3. Also, we can get a lower bound for ν(t) using Lemma B.5 for exponential loss directly.

Corollary E.6.

𝜽(t),-(t)=Lν(t). If L(t)<1, then ν(t)L(t)λ(L(t)).

As we are analyzing gradient descent, the norm of and 2 appear naturally in discretization error terms. To bound them, we have the following lemma when γ~(𝜽) and ρ(𝜽) have lower bounds:

Lemma E.7.

For any 𝛉, if γ~(𝛉)γ^(t0),ρ(𝛉)ρ(t0), then

(𝜽)222Cηκ((𝜽))(𝜽)𝑎𝑛𝑑2(𝜽)22Cηκ((𝜽)).
Proof.

By the chain rule and the definitions of B1,B2, we have

2 =-n=1Ne-qnqn2ρL-1B1
22 =n=1Ne-qn(qnqn-2qn)2
n=1Ne-qn(B12ρ2L-2+B2ρL-2)ρ2L-2(B12+ρ-LB2).

Note that γ^(t0)ρLγ~ρLlog1B0ρL. So γ^(t0)-1log1ρLB0-1log1. Combining all these formulas together gives

22 B12min{γ^(t0)-2+2/L,B0-2+2/L}2(log1)2-2/L2Cηκ()
22 (B12+ρ-LB2)min{γ^(t0)-2+2/L,B0-2+2/L}(log1)2-2/L2Cηκ(),

which completes the proof. ∎

For proving the first two propositions in Theorem E.2, we only need to prove Lemma E.8. (P1) gives a lower bound for γ~. (P2) gives both lower and upper bounds for the weight growth using ν(t). (P3) gives a lower bound for the decrement of training loss. Finally, (P4) shows the monotonicity of γ^, and it is trivial to deduce the first two propositions in Theorem E.2 from (P4).

Lemma E.8.

For all t=t0,t0+1,, we interpolate between 𝛉(t) and 𝛉(t+1) by defining 𝛉(t+α)=𝛉(t)-αη(t)L(t) for α(0,1). Then for all integer tt0, ν(t)>0, and the following holds for all α[0,1]:

  1. (P1).

    γ~(t+α)>γ^(t0).

  2. (P2).

    2Lαη(t)ν(t)ρ(t+α)2-ρ(t)22Lαη(t)ν(t)(1+λ((t))μ((t))L).

  3. (P3).

    (t+α)-(t)-αη(t)(1-μ((t)))(t)22.

  4. (P4).

    logγ^(t+α)-logγ^(t)ρ(t)2Lν(t)2(𝑰-𝜽^(t)𝜽^(t))(t)22logρ(t+α)ρ(t).

To prove Lemma E.8, we only need to prove the following lemma and then use an induction:

Lemma E.9.

Fix an integer Tt0. Suppose that (P1), (P2), (P3), (P4) hold for any t+αT. Then if (P1) holds for (t,α){T}×[0,A) for some A(0,1], then all of (P1), (P2), (P3), (P4) hold for (t,α){T}×[0,A].

Proof for Lemma E.8.

We prove this lemma by induction. For t=t0,α=0, ν(t)>0 by (S4) and Corollary E.6. (P2), (P3), (P4) hold trivially since logγ^(t+α)=logγ^(t), (t+α)=(t) and logγ^(t+α)=logγ^(t). By Lemma E.1, (P1) also holds trivially.

Now we fix an integer Tt0 and assume that (P1), (P2), (P3), (P4) hold for any t+αT (where tt0 is an integer and α[0,1]). By (P3), (t)(t0)<1, so ν(t)>0. We only need to show that (P1), (P2), (P3), (P4) hold for t=T and α[0,1].

Let A:=inf{α[0,1]:α=1 or (P1) does not hold for (T,α)}. If A=0, then (P1) holds for (T,A) since (P1) holds for (T-1,1); if A>0, we can also know that (P1) holds for (T,A) by Lemma E.9. Suppose that A<1. Then by the continuity of γ~(T+α) (with respect to α), we know that there exists A>A such that γ~(T+α)>γ^(t0) for all α[A,A], which contradicts to the definition of A. Therefore, A=1. Using Lemma E.9 again, we can conclude that (P1), (P2), (P3), (P4) hold for t=T and α[0,1]. ∎

Now we turn to prove Lemma E.9.

Proof for Lemma E.9.

Applying (P3) on (t,α){t0,,T-1}×1, we have (t)(t0)<1. Then by Corollary E.6, we have ν(t)>0. Applying (P2) on (t,α){t0,,T-1}×1, we can get ρ(t)ρ(t0).

Fix t=T. By (P2) with α[0,A) and the continuity of γ~, we have γ~(t+A)γ^(t0). Thus, γ~(t+α)γ^(t0) for all α[0,A].

Proof for (P2).

By definition, ρ(t+α)2=𝜽(t)-αη(t)(t)22=ρ(t)2+α2η(t)2(t)22-2αη(t)𝜽(t),(t). By Corollary E.6, we have

ρ(t+α)2-ρ(t)2=α2η(t)2(t)22+2Lαη(t)ν(t).

So ρ(t+α)2-ρ(t)22Lαη(t)ν(t). For the other direction, we have the following using Corollary E.6 and Lemma E.7,

ρ(t+α)2-ρ(t)2 =2Lαη(t)ν(t)(1+αη(t)(t)222Lν(t))
2Lαη(t)ν(t)(1+Cη-1κ((t))-1μ((t))2Cηκ((t))(t)2L(t)λ((t))-1)
=2Lαη(t)ν(t)(1+λ((t))μ((t))/L).

Proof for (P3).

(P3) holds trivially for α=0 or (t)=𝟎. So now we assume that α0 and (t)𝟎. By the update rule (18) and Taylor expansion, there exists ξ(0,α) such that

(t+α) =(t)+(𝜽(t+α)-𝜽(t))(t)+12(𝜽(t+α)-𝜽(t))2(t+ξ)(𝜽(t+α)-𝜽(t))
(t)-αη(t)(t)22+12α2η(t)22(t+ξ)2(t)22
=(t)-αη(t)(1-12αη(t)2(t+ξ)2)(t)22.

By Lemma E.7, 2(t+ξ)22Cηκ((t+ξ)), so we have

(t+α)(t)-αη(t)(1-αCηη(t)κ((t+ξ)))(t)22. (21)

Now we only need to show that (t+α)<(t) for all α(0,A]. Assuming this, we can have κ((t+ξ))κ((t)) by the monotonicity of κ, and thus

(t+α) (t)-αη(t)(1-αCηη(t)κ((t)))(t)22
(t)-αη(t)(1-μ((t)))(t)22.

Now we show that (t+α)<(t) for all α(0,A]. Assume to the contrary that α0:=inf{α(0,A]:(t+α)(t)} exists. If α0=0, let p:[0,1],α(t+α), then p(0)=limα0(t+α)-(t)α0, but it contradicts to p(0)=-η(t)(t)22<0. If α0>0, then by the monotonicity of κ we have κ((t+ξ))<κ((t)) for all ξ(0,α0), and thus

(t+α)(t)-αη(t)(1-μ((t)))(t)22<(t),

which leads to a contradiction.

Proof for (P4).

We define 𝒗(t):=𝜽^(t)𝜽^(t)(-(t)) and 𝒖(t):=(𝑰-𝜽^(t)𝜽^(t))(-(t)) similarly as in the analysis for gradient flow. For 𝒗(t), we have

𝒗(t)2=1ρ(t)𝜽(t),-(t)=1ρ(t)Lν(t).

Decompose (t)22=𝒗(t)22+𝒖(t)22. Then by (P3), we have

(t+α)-(t)-αη(t)(1-μ((t)))(1ρ(t)2L2ν(t)2+𝒖22).

Multiplying 1+λ((t))μ((t))/L(1-μ((t)))ν(t) on both sides, we have

1+λ((t))μ((t))/L(1-μ((t)))ν(t)((t+α)-(t))-αη(t)(1+λ((t))μ((t))L)(L2ν(t)ρ(t)2+𝒖22ν(t))

By Corollary E.6, we can bound ν(t) by ν(t)(t)/λ((t)). By (P2), we have the inequality αη(t)ν(t)(1+λ((t))μ((t))L)ρ(t+α)2-ρ(t)22L. So we further have

1+λ((t))μ((t))/L(1-μ((t)))(t)/λ((t))((t+α)-(t))-1ρ(t)2(ρ(t+α)2-ρ(t)2)(L2+ρ(t)22Lν(t)2𝒖22)

From the definition ϕ, it is easy to see that -ϕ((t))1+λ((t))μ((t))/L(1-μ((t)))(t)/λ((t)). Let ψ(x)=-logx, then ψ(x)=-1x. Combining these together gives

ϕ((t))((t+α)-(t))+ψ(ρ(t)2)(ρ(t+α)2-ρ(t)2)(L2+ρ(t)22Lν(t)2𝒖22)0

Then by convexity of ϕ and ψ, we have

(ϕ((t+α))-ϕ((t)))+(log1ρ(t+α)2-log1ρ(t)2)(L2+ρ(t)22Lν(t)2𝒖22)0

And by definition of γ^, this can be re-written as

logγ^(t+α)-logγ^(t) =(ϕ((t+α))-ϕ((t)))+L(log1ρ(t+α)-log1ρ(t))
-(log1ρ(t+α)2-log1ρ(t)2)ρ(t)22Lν(t)2𝒖22
=ρ(t)2Lν(t)2𝒖22logρ(t+α)ρ(t).

Proof for (P1).

By (P4), logγ^(t+α)logγ^(t)logγ^(t0). Note that ϕ(x)loglog1x. So we have

γ~(t+α)>γ^(t+α)γ^(t0),

which completes the proof. ∎

For showing the third proposition in Theorem E.2, we use (P1) to give a lower bound for (t)2, and use (P3) to show the speed of loss decreasing. Then it can be seen that (t)0 and ρ(t)+. By Lemma E.1, we then have |γ¯-γ^|0.

Lemma E.10.

Let E0:=L(t0)2(log1L(t0))2-2/L. Then for all t>t0,

E((t))12L2γ^(t0)2/Lτ=t0t-1η(τ)  for E(x):=x(t0)1min{u2(log1u)2-2/L,E0}𝑑u.

Therefore, L(t)0 and ρ(t)+ as t.

Proof.

For any integer tt0, μ((t))12 and (t)2𝒗(t)2. Combining these with (P3), we have

(t+1)-(t) -12η(t)𝒗(t)22=-12η(t)L2ν(t)2ρ(t)2.

By (P1), ρ(t)-2γ^(t0)2/L(log1(t))-2/L. By Corollary E.6, ν(t)2(t)2(log1(t))2. Thus we have

(t+1)-(t)-12η(t)L2γ^(t0)2/L(t)2(log1(t))2-2/L.

It is easy to see that 1u2(log1u)2-2/L is unimodal in (0,1), so E(x) is non-decreasing and E(x) is convex. So we have

E((t+1))-E((t)) E((t))((t+1)-(t))
12η(t)L2γ^(t0)2/L(t)2(log1(t))2-2/Lmin{(t)2(log1(t))2-2/L,E0}
12η(t)L2γ^(t0)2/L,

which proves E((t))12L2γ^(t0)2/Lτ=t0t-1η(τ). Note that is non-decreasing. If does not decreases to 0, then neither does E(). But the RHS grows to +, which leads to a contradiction. So 0. To make 0, qmin must converge to +. So ρ+. ∎

E.5 Proof for Theorem E.3 and E.4

The proofs for Theorem E.3 and E.4 are similar as those for Theorem A.8 and A.9 in Appendix C.

Define β(t):=1(t)2𝜽^,-(t) as we do in Appendix C. It is easy to see that Lemma C.8 still holds if we replace γ~(t0) with γ^(t0). So we only need to show 0 and β1 for proving convergence to KKT points. 0 can be followed from Theorem E.2. Similar as the proof for Lemma C.9, it follows from Lemma E.8 and (15) that for all t2>t1t0,

t=t1t2-1(β(t)-2-1)logρ(t+1)ρ(t)1Llogγ^(t2)γ^(t1). (22)

Now we prove Theorem E.4.

Proof for Theorem E.4.

We make the following changes in the proof for Theorem A.9. First, we replace γ~(t0) with γ^(t0), since γ~(t) (tt0) is lower bounded by γ^(t0) rather than γ~(t0). Second, when choosing t1 and t2, we make logρ(t1) and logρ(t2) equal to the chosen values approximately with an additive error o(1), rather than make them equal exactly. This is possible because it can be shown from (P2) in Lemma E.8 that the following holds:

ρ(t+1)2-ρ(t)2 =O(η(t)ν(t))
=o(κ((t))-1(t)ρ(t)L)=o(ρ(t)L(log1(t))2-2/L)=o(ρ(t)-L+2).

Dividing ρ(t)2 on the leftmost and rightmost sides, we have

ρ(t+1)/ρ(t)=1+o(ρ(t)-L)=1+o(1),

which implies that logρ(t+1)-logρ(t)=o(1). Therefore, for any R, we can always find the minimum time t such that logρ(t)R, and it holds for sure that logρ(t)-R0 as R+. ∎

For proving Theorem E.3, we also need the following lemma as a variant of Lemma C.11.

Lemma E.11.

For all tt0,

𝜽^(t+1)-𝜽^(t)2(B1Lγ~(t)ρ(t+1)ρ(t)+1)logρ(t+1)ρ(t)
Proof.

According to the update rule, we have

𝜽^(t+1)-𝜽^(t)2 =1ρ(t+1)𝜽(t+1)-ρ(t+1)ρ(t)𝜽(t)2
1ρ(t+1)(𝜽(t+1)-𝜽(t)2+(ρ(t+1)ρ(t)-1)𝜽(t)2)
=η(t)ρ(t+1)(t)2+(1-ρ(t)ρ(t+1)).

By (16), (t)2B1ν(t)γ~(t)ρ(t). So we can bound the first term as

η(t)ρ(t+1)(t)2B1γ~(t)η(t)ν(t)ρ(t+1)ρ(t) B1γ~(t)ρ(t+1)2-ρ(t)22Lρ(t+1)ρ(t)
=B12Lγ~(t)ρ(t+1)2-ρ(t)2ρ(t+1)2ρ(t+1)ρ(t)
B1Lγ~(t)ρ(t+1)ρ(t)logρ(t+1)ρ(t)

where the last inequality uses the inequality a-balog(a/b). Using this inequality again, we can bound the second term by

1-ρ(t)ρ(t+1)logρ(t+1)ρ(t).

Combining these together gives 𝜽^(t+1)-𝜽^(t)2(B1Lγ~(t)ρ(t+1)ρ(t)+1)logρ(t+1)ρ(t). ∎

Now we are ready to prove Theorem E.3.

Proof for Theorem E.3.

As discussed above, we only need to show a variant of Lemma C.12 for gradient descent: for every limit point 𝜽¯ of {𝜽^(t):t0}, there exists a sequence of {tm:m} such that tm+,𝜽^(tm)𝜽¯, and β(tm)1.

We only need to change the choices of sm,sm,tm in the proof for Lemma C.12. We choose sm>tm-1 to be a time such that

𝜽^(sm)-𝜽¯2ϵmand1Llog(limt+γ^(t)γ^(sm))ϵm3.

Then we let sm>sm be the minimum time such that logρ(sm)logρ(sm)+ϵm. According to Theorem E.2, sm and sm must exist. Finally, we construct tm{sm,,sm-1} to be a time that β(tm)-2-1ϵm2, where the existence can be shown by (22).

To see that this construction meets our requirement, note that β(tm)-2-1ϵm20 and

𝜽^(tm)-𝜽¯2𝜽^(tm)-𝜽^(sm)2+𝜽^(sm)-𝜽¯2(B1Lγ^(t0)eϵm+1)ϵm+ϵm0,

where the last inequality is by Lemma E.11. ∎

E.6 Proof for Theorem E.5

Proof.

By a similar analysis as Lemma D.4, we have

E(x)=x(t0)Θ(1u2(log1u)2-2/L)𝑑u =1/(t0)1/xΘ(1(logu)2-2/L)𝑑u
=Θ(1x(log1x)2-2/L).

We can also bound the inverse function E-1(y) by Θ(1y(logy)2-2/L). With these, we can use a similar analysis as Theorem A.10 to prove Theorem E.5.

First, using a similar proof as for (17), we have ρL=Θ(log1). So we only need to show (t)=Θ(1T(logT)2-2/L). With a similar analysis as for (P3) in Lemma E.9, we have the following bound for (τ+1)-(τ):

(τ+1)-(τ)-η(τ)(1+μ((τ)))(τ)22.

Using the fact that μ1/2, we have (τ+1)-(τ)-32η(τ)(τ)22. By Lemma E.7, (τ)22Cηκ((τ))(τ). Using a similar proof as for Lemma E.10, we can show that E((t))O(T). Combining this with Lemma E.10, we have E((t))=Θ(T). Therefore, (t)=Θ(1T(logT)2-2/L). ∎

Appendix F Gradient Descent: General Loss Functions

It is worth to note that the above analysis can be extended to other loss functions. For this, we need to replace (B3) with a strong assumption (S3), which takes into account the second-order derivatives of f.

  1. (S3).

    The loss function (q) can be expressed as (q)=e-f(q) such that

    1. (S3.1).

      f: is 𝒞2-smooth.

    2. (S3.2).

      f(q)>0 for all q.

    3. (S3.3).

      There exists bf0 such that f(q)q is non-decreasing for q(bf,+), and f(q)q+ as q+.

    4. (S3.4).

      Let g:[f(bf),+)[bf,+) be the inverse function of f on the domain [bf,+). There exists p0 such that for all x>f(bf),y>bf,

      |g′′(x)g(x)|pxand|f′′(y)f(y)|py.

It can be verified that (S3) is satisfied by exponential loss and logistic loss. Now we explain each of the assumptions in (S3). (S3.2) and (S3.3) are essentially the same as (B3.2) and (B3.3). (S3.1) and (B3.1) are the same except that (S3.1) assumes f is 𝒞2-smooth rather than 𝒞1-smooth.

(S3.4) can also be written in the following form:

|ddxlogg(x)|pddxlogxand|ddylogf(y)|pddylogy. (23)

That is, logg(x) and logf(y) grow no faster than O(logx) and O(logy), respectively. In fact, (B3.4) can be deduced from (S3.4). Recall that (B3.4) ensures that Θ(g(x))=g(Θ(x)) and Θ(f(y))=f(Θ(y)). Thus, (S3.4) also gives us the interchangeability between f,g and Θ.

Lemma F.1.

(S3.4) implies (B3.4) with bg=max{2f(bf),f(2bf)} and K=2p.

Proof.

Fix x(bg,+),y(g(bg),+) and θ[1/2,1). Integrating (23) on both sides of the inequalities from θx to x and θy to y, we have

|logg(x)-logg(θx)|plog1θand|logf(y)-logf(θy)|plog1θ.

Therefore, we have g(x)θ-pg(θx)Kg(θx) and f(y)θ-pf(θy)Kf(θx). ∎

To extend our results to general loss functions, we need to redefine κ(x),λ(x),ϕ(x),Cη,H(x) in order. For κ(x) and λ(x), we redefine them as follows:

κ(x):=sup{w(log1w)2-2/Lg(log1w)2:w(0,x]}   λ(x):=g(log1x)g(log1x).

By Lemma D.1, w(log1w)2-2/Lg(log1w)2=O(w(log1w)4-2/L/g(log1w)2)0. So κ(x) is well-defined. Using λ(x), we can define ϕ(x) and γ^(𝜽) as follows.

ϕ(x) :=-sup{λ(w)w(1+2(1+λ(w)/L)μ(w)):w[x,(t0)]}
ϕ(x) :=log(g(log1x))+0x(ϕ(w)+λ(w)w)𝑑w
γ^(𝜽) :=eϕ((𝜽))ρL.

Using a similar argument as in Lemma E.1, we can show that γ^(𝜽) is well-defined and γ^(𝜽)<γ~(𝜽):=g(log1)/ρLγ¯(𝜽). When (𝜽)0, we also have γ^(𝜽)/γ¯(𝜽)1.

For Cη, we define it to be the following.

Cη:=12B1((B0γ^(t0))pB1+2p+1log1(t0)(pB1+B2))(B0γ^(t0))pmin{γ^(t0)-2+2/L,B0-2+2/L}.

Finally, the definitions for μ(x):=(log1(t0))/(2log1x) and H(x):=μ(x)/(Cηκ(x)) in (S5) remain unchanged except that κ(x) and Cη now use the new definitions.

Similar as gradient flow, we define ν(t):=n=1Ne-f(qn(t))f(qn(t))qn(t). The key idea behind the above definitions is that we can prove similar bounds for ν(t),2,22 as Corollary E.6 and Lemma E.7.

Lemma F.2.

𝜽(t),-(t)=Lν(t). If L(t)<e-f(bf), then ν(t)L(t)λ(L(t)) and λ(L(t)) has the lower bound λ(L(t))2p+1(log1L(t))-1.

Proof.

It can be easily proved by combining Theorem B.4, Lemma B.5 and Lemma D.1 together. ∎

Lemma F.3.

For any 𝛉, if L(𝛉)L(t0),γ~(𝛉)γ^(t0), then

(𝜽)222Cηκ((𝜽))(𝜽)𝑎𝑛𝑑2(𝜽)22Cηκ((𝜽)).
Proof.

Note that a direct corollary of (23) is that f(x1)f(x2)(x1/x2)p for x1x2>bf. So we have

f(qn)f(g(log1))(qng(log1))p=1g(log1)(qn/ρLγ~)pRg(log1),

where R:=(B0/γ^(t0))p. Applying (S3.4), we can also deduce that

|f′′(qn)|pqnf(qn)pRg(log1)g(log1).

Now we bound 2 and 22. By the chain rule, we have

2 =-n=1Ne-f(qn)f(qn)qn2B1RρL-1g(log1)
22 =n=1Ne-f(qn)((f(qn)2-f′′(qn))qnqn-f(qn)2qn)2
n=1Ne-f(qn)((R2g(log1)2+pRg(log1)g(log1))B12ρ2L-2+Rg(log1)B2ρL-2)
ρ2L-2g(log1)2((R+pλ())B12+g(log1)ρLB2)R.

For ρ we have γ^(t0)ρLlog1B0ρL, so we can bound ρ2L-2 by ρ2L-2M(log1)2L-2, where M:=min{γ^(t0)-2+2/L,B0-2+2/L}. By Lemma F.2, we have λ()2p+11log12p+1log1(t0), and also g(log1)ρL=λ()γ~2p+1B1log1(t0). Combining all these formulas together gives

22 B12R2M2g(log1)2(log1)2-2/L2Cηκ()
22 ((R+p2p+1log1(t0))B12+2p+1B1B2log1(t0))RMg(log1)2(log1)2-2/L2Cηκ(),

which completes the proof. ∎

With Corollary E.6 and Lemma E.7, we can prove Lemma E.8 with exactly the same argument. Then Theorem E.2, E.3, E.4 can also be proved similarly. For Theorem E.5, we can follow the argument for gradient flow to show that it holds with slightly different tight bounds:

Theorem F.4.

Under assumptions (S1), (A2), (S3), (A4), (S5), we have the following tight rates for training loss and weight norm:

(t)=Θ(g(logT)2/LT(logT)2)𝑎𝑛𝑑ρ(t)=Θ(g(logT)1/L),

where T=τ=t0t-1η(τ).

Appendix G Extension: Multi-class Classification

In this section, we generalize our results to multi-class classification with cross-entropy loss. This part of analysis is inspired by Theorem 1 in (Zhang et al., 2019), which gives a lower bound for the gradient in terms of the loss .

Since now a neural network has multiple outputs, we need to redefine our notations. Let C be the number of classes. The output of a neural network Φ is a vector 𝚽(𝜽;𝒙)C. We use Φj(𝜽;𝒙) to denote the j-th output of Φ on the input 𝒙dx. A dataset is denoted by 𝒟={𝒙n,yn}n=1N={(𝒙n,yn):n[N]}, where 𝒙ndx is a data input and yn[C] is the corresponding label. The loss function of Φ on the dataset 𝒟 is defined as

(𝜽):=