Gradient Starvation: A Learning Proclivity in Neural Networks

Abstract

We identify and formalize a fundamental gradient descent phenomenon resultingin a learning proclivity in over-parameterized neural networks. GradientStarvation arises when cross-entropy loss is minimized by capturing only asubset of features relevant for the task, despite the presence of otherpredictive features that fail to be discovered. This work provides atheoretical explanation for the emergence of such feature imbalance in neuralnetworks. Using tools from Dynamical Systems theory, we identify simpleproperties of learning dynamics during gradient descent that lead to thisimbalance, and prove that such a situation can be expected given certainstatistical structure in training data. Based on our proposed formalism, wedevelop guarantees for a novel regularization method aimed at decouplingfeature learning dynamics, improving accuracy and robustness in cases hinderedby gradient starvation. We illustrate our findings with simple and real-worldout-of-distribution (OOD) generalization experiments.

Quick Read (beta)

loading the full paper ...