Measuring the Effects of Data Parallelism on Neural Network Training

Abstract

Recent hardware developments have made unprecedented amounts of dataparallelism available for accelerating neural network training. Among thesimplest ways to harness next-generation accelerators is to increase the batchsize in standard mini-batch neural network training algorithms. In this work,we aim to experimentally characterize the effects of increasing the batch sizeon training time, as measured in the number of steps necessary to reach a goalout-of-sample error. Eventually, increasing the batch size will no longerreduce the number of training steps required, but the exact relationshipbetween the batch size and how many training steps are necessary is of criticalimportance to practitioners, researchers, and hardware designers alike. Westudy how this relationship varies with the training algorithm, model, anddataset and find extremely large variation between workloads. Along the way, wereconcile disagreements in the literature on whether batch size affects modelquality. Finally, we discuss the implications of our results for efforts totrain neural networks much faster in the future.

Quick Read (beta)

loading the full paper ...