Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD?

Abstract

Stagewise training strategy is commonly used for learning neural networks,which uses a stochastic algorithm (e.g., SGD) starting with a relatively largestep size (aka learning rate) and geometrically decreasing the step size aftera number of iterations. It has been observed that the stagewise SGD has muchfaster convergence than the vanilla SGD with a continuously decreasing stepsize in terms of both training error and testing error. {\it But how to explainthis phenomenon has been largely ignored by existing studies.} This paperprovides theoretical evidence for explaining this faster convergence. Inparticular, we consider the stagewise training strategy for minimizingempirical risk that satisfies the Polyak-{\L}ojasiewicz condition, which hasbeen observed/proved for neural networks and also holds for a broad family ofconvex functions. For convex loss functions and "nice-behaviored" non-convexloss functions that are close to a convex function (namely weakly convexfunctions), we establish that the faster convergence of stagewise training thanthe vanilla SGD under the same condition on both training error and testingerror lies on better dependence on the condition number of the problem. Indeed,the proposed algorithm has additional favorable features that come withtheoretical guarantee for the considered non-convex optimization problems,including using explicit algorithmic regularization at each stage, usingstagewise averaged solution for restarting, and returning the last stagewiseaveraged solution as the final solution. To differentiate from commonly usedstagewise SGD, we refer to our algorithm as stagewise regularized trainingalgorithm or {\start}. Of independent interest, the proved testing error boundof {\start} for a family of non-convex loss functions is dimensionality andnorm independent.

Quick Read (beta)

loading the full paper ...