Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Abstract

In this paper we propose to study generalization of neural networks on smallalgorithmically generated datasets. In this setting, questions about dataefficiency, memorization, generalization, and speed of learning can be studiedin great detail. In some situations we show that neural networks learn througha process of "grokking" a pattern in the data, improving generalizationperformance from random chance level to perfect generalization, and that thisimprovement in generalization can happen well past the point of overfitting. Wealso study generalization as a function of dataset size and find that smallerdatasets require increasing amounts of optimization for generalization. Weargue that these datasets provide a fertile ground for studying a poorlyunderstood aspect of deep learning: generalization of overparametrized neuralnetworks beyond memorization of the finite training dataset.

Quick Read (beta)

loading the full paper ...