Abstract
Preserving training dynamics across batch sizes is an important tool forpractical machine learning as it enables the trade-off between batch size andwall-clock time. This trade-off is typically enabled by a scaling rule, forexample, in stochastic gradient descent, one should scale the learning ratelinearly with the batch size. Another important tool for practical machinelearning is the model Exponential Moving Average (EMA), which is a model copythat does not receive gradient information, but instead follows its targetmodel with some momentum. This model EMA can improve the robustness andgeneralization properties of supervised learning, stabilize pseudo-labeling,and provide a learning signal for Self-Supervised Learning (SSL). Prior workshave treated the model EMA separately from optimization, leading to differenttraining dynamics across batch sizes and lower model performance. In this work,we provide a scaling rule for optimization in the presence of model EMAs anddemonstrate its validity across a range of architectures, optimizers, and datamodalities. We also show the rule's validity where the model EMA contributes tothe optimization of the target model, enabling us to train EMA-basedpseudo-labeling and SSL methods at small and large batch sizes. For SSL, weenable training of BYOL up to batch size 24,576 without sacrificingperformance, optimally a 6$\times$ wall-clock time reduction.