Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Abstract

We show that any smooth bi-Lipschitz $h$ can be represented exactly as acomposition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are closeto the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ isLipschitz, and the Lipschitz constant decreases inversely with the number $m$of functions composed. This implies that $h$ can be represented to any accuracyby a deep residual network whose nonlinear layers compute functions with asmall Lipschitz constant. Next, we consider nonlinear regression with acomposition of near-identity nonlinear maps. We show that, regarding Fr\'echetderivatives with respect to the $h_1,...,h_m$, any critical point of aquadratic criterion in this near-identity region must be a global minimizer. Incontrast, if we consider derivatives with respect to parameters of a fixed-sizeresidual network with sigmoid activation functions, we show that there arenear-identity critical points that are suboptimal, even in the realizable case.Informally, this means that functional gradient methods for residual networkscannot get stuck at suboptimal critical points corresponding to near-identitylayers, whereas parametric gradient methods for sigmoidal residual networkssuffer from suboptimal critical points in the near-identity region.

Quick Read (beta)

loading the full paper ...