u-$μ$P: The Unit-Scaled Maximal Update Parametrization

Abstract

The Maximal Update Parametrization ($\mu$P) aims to make the optimalhyperparameters (HPs) of a model independent of its size, allowing them to beswept using a cheap proxy model rather than the full-size target model. Wepresent a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it withUnit Scaling, a method for designing models that makes them easy to train inlow-precision. The two techniques have a natural affinity: $\mu$P ensures thatthe scale of activations is independent of model size, and Unit Scaling ensuresthat activations, weights and gradients begin training with a scale of one.This synthesis opens the door to a simpler scheme, whose default values arenear-optimal. This in turn facilitates a more efficient sweeping strategy, withu-$\mu$P models reaching a lower loss than comparable $\mu$P models and workingout-of-the-box in FP8.

Quick Read (beta)

loading the full paper ...