Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

  • 2020-10-20 17:16:29
  • Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu, Xiaowen Chu
  • 0

Abstract

Distributed training techniques have been widely deployed in large-scale deepneural networks (DNNs) training on dense-GPU clusters. However, on public cloudclusters, due to the moderate inter-connection bandwidth between instances,traditional state-of-the-art distributed training systems cannot scale well intraining large-scale models. In this paper, we propose a new computing andcommunication efficient top-k sparsification communication library fordistributed training. To further improve the system scalability, we optimizeI/O by proposing a simple yet efficient multi-level data caching mechanism andoptimize the update operation by introducing a novel parallel tensor operator.Experimental results on a 16-node Tencent Cloud cluster (each node with 8Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster thanexisting state-of-the-art systems on CNNs and Transformer. We finally break therecord on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

 

Quick Read (beta)

loading the full paper ...