HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Abstract

Deep neural networks (DNNs) exploit many layers and a large number ofparameters to achieve excellent performance. The training process of DNN modelsgenerally handles large-scale input data with many sparse features, whichincurs high Input/Output (IO) cost, while some layers are compute-intensive.The training process generally exploits distributed computing resources toreduce training time. In addition, heterogeneous computing resources, e.g.,CPUs, GPUs of multiple types, are available for the distributed trainingprocess. Thus, the scheduling of multiple layers to diverse computing resourcesis critical for the training process. To efficiently train a DNN model usingthe heterogeneous computing resources, we propose a distributed framework,i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of adistributed architecture and a Reinforcement Learning (RL)-based schedulingmethod. The advantages of Paddle-HeterPS are three-fold compared with existingframeworks. First, Paddle-HeterPS enables efficient training process of diverseworkloads with heterogeneous computing resources. Second, Paddle-HeterPSexploits an RL-based method to efficiently schedule the workload of each layerto appropriate computing resources to minimize the cost while satisfyingthroughput constraints. Third, Paddle-HeterPS manages data storage and datacommunication among distributed computing resources. We carry out extensiveexperiments to show that Paddle-HeterPS significantly outperformsstate-of-the-art approaches in terms of throughput (14.5 times higher) andmonetary cost (312.3% smaller). The codes of the framework are publiclyavailable at: https://github.com/PaddlePaddle/Paddle.

Quick Read (beta)

loading the full paper ...