Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Abstract

Neural scaling laws describe how the performance of deep neural networksscales with key factors such as training data size, model complexity, andtraining time, often following power-law behaviors over multiple orders ofmagnitude. Despite their empirical observation, the theoretical understandingof these scaling laws remains limited. In this work, we employ techniques fromstatistical mechanics to analyze one-pass stochastic gradient descent within astudent-teacher framework, where both the student and teacher are two-layerneural networks. Our study primarily focuses on the generalization error andits behavior in response to data covariance matrices that exhibit power-lawspectra. For linear activation functions, we derive analytical expressions forthe generalization error, exploring different learning regimes and identifyingconditions under which power-law scaling emerges. Additionally, we extend ouranalysis to non-linear activation functions in the feature learning regime,investigating how power-law spectra in the data covariance matrix impactlearning dynamics. Importantly, we find that the length of the symmetricplateau depends on the number of distinct eigenvalues of the data covariancematrix and the number of hidden units, demonstrating how these plateaus behaveunder various configurations. In addition, our results reveal a transition fromexponential to power-law convergence in the specialized phase when the datacovariance matrix possesses a power-law spectrum. This work contributes to thetheoretical understanding of neural scaling laws and provides insights intooptimizing learning performance in practical scenarios involving complex datastructures.

Quick Read (beta)

loading the full paper ...