Abstract
Large language models (LLMs) are trained on enormous documents that containextensive world knowledge. However, it is still not well-understood howknowledge is acquired via autoregressive pre-training. This lack ofunderstanding greatly hinders effective knowledge learning, especially forcontinued pretraining on up-to-date information, as this evolving informationoften lacks diverse repetitions like foundational knowledge. In this paper, wefocus on understanding and improving LLM knowledge learning. We found andverified that knowledge learning for LLMs can be deemed as an implicitsupervised task hidden in the autoregressive pre-training objective. Ourfindings suggest that knowledge learning for LLMs would benefit from methodsdesigned to improve generalization ability for supervised tasks. Based on ouranalysis, we propose the formatting-based data augmentation to growin-distribution samples, which does not present the risk of altering the factsembedded in documents as text paraphrasing. We also introduce sharpness-awareminimization as an effective optimization algorithm to better improvegeneralization. Moreover, our analysis and method can be readily extended toinstruction tuning. Extensive experiment results validate our findings anddemonstrate our methods' effectiveness in both continued pre-training andinstruction tuning. This paper offers new perspectives and insights tointerpret and design effective strategies for LLM knowledge learning.