Abstract
The cost of hyperparameter tuning in deep learning has been rising with modelsizes, prompting practitioners to find new tuning methods using a proxy ofsmaller networks. One such proposal uses $\mu$P parameterized networks, wherethe optimal hyperparameters for small width networks transfer to networks witharbitrarily large width. However, in this scheme, hyperparameters do nottransfer across depths. As a remedy, we study residual networks with a residualbranch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$Pparameterization. We provide experiments demonstrating that residualarchitectures including convolutional ResNets and Vision Transformers trainedwith this parameterization exhibit transfer of optimal hyperparameters acrosswidth and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findingsare supported and motivated by theory. Using recent developments in thedynamical mean field theory (DMFT) description of neural network learningdynamics, we show that this parameterization of ResNets admits a well-definedfeature learning joint infinite-width and infinite-depth limit and showconvergence of finite-size network dynamics towards this limit.