SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

Abstract

We study gradient flows for loss landscapes of fully connected feed forwardneural networks with commonly used continuously differentiable activationfunctions such as the logistic, hyperbolic tangent, softplus or GELU function.We prove that the gradient flow either converges to a critical point ordiverges to infinity while the loss converges to an asymptotic critical value.Moreover, we prove the existence of a threshold $\varepsilon>0$ such that theloss value of any gradient flow initialized at most $\varepsilon$ above theoptimal level converges to it. For polynomial target functions and sufficientlybig architecture and data set, we prove that the optimal loss value is zero andcan only be realized asymptotically. From this setting, we deduce our mainresult that any gradient flow with sufficiently good initialization diverges toinfinity. Our proof heavily relies on the geometry of o-minimal structures. Weconfirm these theoretical findings with numerical experiments and extend ourinvestigation to real-world scenarios, where we observe an analogous behavior.

Quick Read (beta)

loading the full paper ...