Integrated Model and Data Parallelism in Training Neural Networks

Abstract

We propose a new integrated method of exploiting both model and dataparallelism for the training of deep neural networks (DNNs) on largedistributed-memory computers using mini-batch stochastic gradient descent(SGD). Our goal is to find an efficient parallelization strategy for a fixedbatch size using $P$ processes. Our method is inspired by thecommunication-avoiding algorithms in numerical linear algebra. We see $P$processes as logically divided into a $P_r \times P_c$ grid where the $P_r$dimension is implicitly responsible for model parallelism and the $P_c$dimension is implicitly responsible for data parallelism. In practice, theintegrated matrix-based parallel algorithm encapsulates both types ofparallelism automatically. We analyze the communication complexity andanalytically demonstrate that the lowest communication costs are often achievedneither with pure model parallelism nor with pure data parallelism. We alsoshow the positive effect of our approach in the computational performance ofSGD based DNN training where the reduced number of processes responsible fordata parallelism result in "fatter" matrices that enable higher-throughputmatrix multiplication.

Quick Read (beta)

loading the full paper ...