Abstract
Large language models (LLMs) show best-in-class performance across a widerange of natural language processing applications. Training these models is anextremely computationally expensive task; frontier Artificial Intelligence (AI)research companies are investing billions of dollars into supercomputinginfrastructure to train progressively larger models on increasingly massivedatasets. Unfortunately, information about the scaling performance and trainingconsiderations of these large training pipelines is scarce in publicliterature. Working with large-scale datasets and models can be complex andpractical recommendations are scarce in the public literature for tuningtraining performance when scaling up large language models. In this paper, weaim to demystify the large language model pretraining pipeline somewhat - inparticular with respect to distributed training, managing large datasets acrosshundreds of nodes, and scaling up data parallelism with an emphasis on fullyleveraging available GPU compute capacity.