Synchronized stochastic gradient descent (SGD) optimizers with dataparallelism are widely used in training large-scale deep neural networks.Although using larger mini-batch sizes can improve the system scalability byreducing the communication-to-computation ratio, it may hurt the generalizationability of the models. To this end, we build a highly scalable deep learningtraining system for dense GPU clusters with three main contributions: (1) Wepropose a mixed-precision training method that significantly improves thetraining throughput of a single GPU without losing accuracy. (2) We propose anoptimization approach for extremely large mini-batch size (up to 64k) that cantrain CNN models on the ImageNet dataset without losing accuracy. (3) Wepropose highly optimized all-reduce algorithms that achieve up to 3x and 11xspeedup on AlexNet and ResNet-50 respectively than NCCL-based training on acluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, thestate-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutesand achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training systemcan achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1test accuracy within 4 minutes, which also outperforms all other existingsystems.