Abstract
Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPUtakes 14 days. This training requires 10^18 single precision operations intotal. On the other hand, the world's current fastest supercomputer can finish2 * 10^17 single precision operations per second (Dongarra et al 2017,https://www.top500.org/lists/2017/06/). If we can make full use of thesupercomputer for DNN training, we should be able to finish the 90-epochResNet-50 training in five seconds. However, the current bottleneck for fastDNN training is in the algorithm level. Specifically, the current batch size(e.g. 512) is too small to make efficient use of many processors For large-scale DNN training, we focus on using large-batch data-parallelismsynchronous SGD without losing accuracy in the fixed epochs. The LARS algorithm(You, Gitman, Ginsburg, 2017, arXiv:1708.03888) enables us to scale the batchsize to extremely large case (e.g. 32K). We finish the 100-epoch ImageNettraining with AlexNet in 11 minutes on 1024 CPUs. Faster than Facebook's result(Goyal et al 2017, arXiv:1706.02677), we finish the 90-epoch ImageNet trainingwith ResNet-50 in 48 minutes on 1024 CPUs. Furthermore, when we increase thebatch size to above 10K, our accuracy is much higher than Facebook's oncorresponding batch sizes. Our source code is available upon request. It willalso be released in Intel Caffe.