Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization

Abstract

Despite deep neural networks' powerful representation learning capabilities,theoretical understanding of how networks can simultaneously achieve meaningfulfeature learning and global convergence remains elusive. Existing approacheslike the neural tangent kernel (NTK) are limited because features stay close totheir initialization in this parametrization, leaving open questions aboutfeature properties during substantial evolution. In this paper, we investigatethe training dynamics of infinitely wide, $L$-layer neural networks using thetensor program (TP) framework. Specifically, we show that, when trained withstochastic gradient descent (SGD) under the Maximal Update parametrization($\mu$P) and mild conditions on the activation function, SGD enables thesenetworks to learn linearly independent features that substantially deviate fromtheir initial values. This rich feature space captures relevant datainformation and ensures that any convergent point of the training process is aglobal minimum. Our analysis leverages both the interactions among featuresacross layers and the properties of Gaussian random variables, providing newinsights into deep representation learning. We further validate our theoreticalfindings through experiments on real-world datasets.

Quick Read (beta)

loading the full paper ...