Abstract
Despite the recent observation that large language models (LLMs) can storesubstantial factual knowledge, there is a limited understanding of themechanisms of how they acquire factual knowledge through pretraining. This workaddresses this gap by studying how LLMs acquire factual knowledge duringpretraining. The findings reveal several important insights into the dynamicsof factual knowledge acquisition during pretraining. First, counterintuitively,we observe that pretraining on more data shows no significant improvement inthe model's capability to acquire and maintain factual knowledge. Next, thereis a power-law relationship between training steps and forgetting ofmemorization and generalization of factual knowledge, and LLMs trained withduplicated training data exhibit faster forgetting. Third, training LLMs withlarger batch sizes can enhance the models' robustness to forgetting. Overall,our observations suggest that factual knowledge acquisition in LLM pretrainingoccurs by progressively increasing the probability of factual knowledgepresented in the pretraining data at each step. However, this increase isdiluted by subsequent forgetting. Based on this interpretation, we demonstratethat we can provide plausible explanations for recently observed behaviors ofLLMs, such as the poor performance of LLMs on long-tail knowledge and thebenefits of deduplicating the pretraining corpus.