Minnorm training: an algorithm for training over-parameterized deep neural networks

Abstract

In this work, we propose a new training method for finding minimum weightnorm solutions in over-parameterized neural networks (NNs). This method seeksto improve training speed and generalization performance by framing NN trainingas a constrained optimization problem wherein the sum of the norm of theweights in each layer of the network is minimized, under the constraint ofexactly fitting training data. It draws inspiration from support vectormachines (SVMs), which are able to generalize well, despite often having aninfinite number of free parameters in their primal form, and from recenttheoretical generalization bounds on NNs which suggest that lower normsolutions generalize better. To solve this constrained optimization problem,our method employs Lagrange multipliers that act as integrators of error overtraining and identify `support vector'-like examples. The method can beimplemented as a wrapper around gradient based methods and uses standardback-propagation of gradients from the NN for both regression andclassification versions of the algorithm. We provide theoretical justificationsfor the effectiveness of this algorithm in comparison to early stopping and$L_2$-regularization using simple, analytically tractable settings. Inparticular, we show faster convergence to the max-margin hyperplane in ashallow network (compared to vanilla gradient descent); faster convergence tothe minimum-norm solution in a linear chain (compared to $L_2$-regularization);and initialization-independent generalization performance in a deep linearnetwork. Finally, using the MNIST dataset, we demonstrate that this algorithmcan boost test accuracy and identify difficult examples in real-world datasets.

Quick Read (beta)

loading the full paper ...