Abstract
Robust and effective scaling of models from small to large width typicallyrequires the precise adjustment of many algorithmic and architectural details,such as parameterization and optimizer choices. In this work, we propose a newperspective on parameterization by investigating a key assumption in prior workabout the alignment between parameters and data and derive new theoreticalresults under weaker assumptions and a broader set of optimizers. Our extensiveempirical investigation includes tens of thousands of models trained with allcombinations of three optimizers, four parameterizations, several alignmentassumptions, more than a dozen learning rates, and fourteen model sizes up to26.8B parameters. We find that the best learning rate scaling prescriptionwould often have been excluded by the assumptions in prior work. Our resultsshow that all parameterizations, not just maximal update parameterization(muP), can achieve hyperparameter transfer; moreover, our novel per-layerlearning rate prescription for standard parameterization outperforms muP.Finally, we demonstrate that an overlooked aspect of parameterization, theepsilon parameter in Adam, must be scaled correctly to avoid gradient underflowand propose Adam-atan2, a new numerically stable, scale-invariant version ofAdam that eliminates the epsilon hyperparameter entirely.