Critical Data Size of Language Models from a Grokking Perspective

Abstract

We explore the critical data size in language models, a threshold that marksa fundamental shift from quick memorization to slow generalization. Weformalize the phase transition under the grokking configuration into the DataEfficiency Hypothesis and identify data insufficiency, sufficiency, and surplusregimes in language models training dynamics. We develop a grokkingconfiguration to reproduce grokking on simplistic language models stably byrescaling initialization and weight decay. We show that generalization occursonly when language models reach a critical size. We analyze grokking acrosssample-wise and model-wise, verifying the proposed data efficiency hypothesis.Our experiments reveal smoother phase transitions occurring at the criticaldataset size for language datasets. As the model size increases, this criticalpoint also becomes larger, indicating that larger models require more data. Ourresults deepen the understanding of language model training, offering a novelperspective on the role of data in the learning mechanism of language models.

Quick Read (beta)

loading the full paper ...