Abstract
Distributed training techniques have been widely deployed in large-scale deepneural networks (DNNs) training on dense-GPU clusters. However, on public cloudclusters, due to the moderate inter-connection bandwidth between instances,traditional state-of-the-art distributed training systems cannot scale well intraining large-scale models. In this paper, we propose a new computing andcommunication efficient top-k sparsification communication library fordistributed training. To further improve the system scalability, we optimizeI/O by proposing a simple yet efficient multi-level data caching mechanism andoptimize the update operation by introducing a novel parallel tensor operator.Experimental results on a 16-node Tencent Cloud cluster (each node with 8Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster thanexisting state-of-the-art systems on CNNs and Transformer. We finally break therecord on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.