Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

Abstract

Recent work by Power et al. (2022) highlighted a surprising "grokking"phenomenon in learning arithmetic tasks: a neural net first "memorizes" thetraining set, resulting in perfect training accuracy but near-random testaccuracy, and after training for sufficiently longer, it suddenly transitionsto perfect test accuracy. This paper studies the grokking phenomenon intheoretical setups and shows that it can be induced by a dichotomy of early andlate phase implicit biases. Specifically, when training homogeneous neural netswith large initialization and small weight decay on both classification andregression tasks, we prove that the training process gets trapped at a solutioncorresponding to a kernel predictor for a long time, and then a very sharptransition to min-norm/max-margin predictors occurs, leading to a dramaticchange in test accuracy.

Quick Read (beta)

loading the full paper ...