Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

Abstract

The impressive capabilities of Large Language Models (LLMs) across diversetasks are now well-established, yet their effective deployment necessitatescareful hyperparameter optimization. Through extensive empirical studiesinvolving grid searches across diverse configurations, we discover universalscaling laws governing these hyperparameters: optimal learning rate follows apower-law relationship with both model parameters and data sizes, while optimalbatch size scales primarily with data sizes. Our analysis reveals a convexoptimization landscape for hyperparameters under fixed models and data sizeconditions. This convexity implies an optimal hyperparameter plateau. Wecontribute a universal, plug-and-play optimal hyperparameter tool for thecommunity. Its estimated values on the test set are merely 0.09% away from theglobally optimal LLM performance found via an exhaustive search. These lawsdemonstrate remarkable robustness across variations in model sparsity, trainingdata distribution, and model shape. To our best known, this is the first workthat unifies different model shapes and structures, such as Mixture-of-Expertsmodels and dense transformers, as well as establishes optimal hyperparameterscaling laws across diverse data distributions. This exhaustive optimizationprocess demands substantial computational resources, utilizing nearly onemillion NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes andhyperparameters from scratch and consuming approximately 100 trillion tokens intotal. To facilitate reproducibility and further research, we willprogressively release all loss measurements and model checkpoints through ourdesignated repository https://step-law.github.io/

Quick Read (beta)

loading the full paper ...