Learning Dynamics in Continual Pre-Training for Large Language Models

Abstract

Continual Pre-Training (CPT) has become a popular and effective method toapply strong foundation models to specific downstream tasks. In this work, weexplore the learning dynamics throughout the CPT process for large languagemodels. We specifically focus on how general and downstream domain performanceevolves at each training step, with domain performance measured via validationlosses. We have observed that the CPT loss curve fundamentally characterizesthe transition from one curve to another hidden curve, and could be describedby decoupling the effects of distribution shift and learning rate annealing. Wederive a CPT scaling law that combines the two factors, enabling the predictionof loss at any (continual) training steps and across learning rate schedules(LRS) in CPT. Our formulation presents a comprehensive understanding of severalcritical factors in CPT, including loss potential, peak learning rate, trainingsteps, replay ratio, etc. Moreover, our approach can be adapted to customizetraining hyper-parameters to different CPT goals such as balancing general anddomain-specific performance. Extensive experiments demonstrate that our scalinglaw holds across various CPT datasets and training hyper-parameters.

Quick Read (beta)

loading the full paper ...