Abstract
The remarkable progress in deep learning in recent years is largely driven byimprovements in scale, where bigger models are trained on larger datasets forlonger schedules. To predict the benefit of scale empirically, we argue for amore rigorous methodology based on the extrapolation loss, instead of reportingthe best-fitting (interpolating) parameters. We then present a recipe forestimating scaling law parameters reliably from learning curves. We demonstratethat it extrapolates more accurately than previous methods in a wide range ofarchitecture families across several domains, including image classification,neural machine translation (NMT) and language modeling, in addition to tasksfrom the BIG-Bench evaluation benchmark. Finally, we release a benchmarkdataset comprising of 90 evaluation tasks to facilitate research in thisdomain.