TF-Replicator: Distributed Machine Learning for Researchers

Abstract

We describe TF-Replicator, a framework for distributed machine learningdesigned for DeepMind researchers and implemented as an abstraction overTensorFlow. TF-Replicator simplifies writing data-parallel and model-parallelresearch code. The same models can be effortlessly deployed to differentcluster architectures (i.e. one or many machines containing CPUs, GPUs or TPUaccelerators) using synchronous or asynchronous training regimes. Todemonstrate the generality and scalability of TF-Replicator, we implement andbenchmark three very different models: (1) A ResNet-50 for ImageNetclassification, (2) a SN-GAN for class-conditional ImageNet image generation,and (3) a D4PG reinforcement learning agent for continuous control. Our resultsshow strong scalability performance without demanding any distributed systemsexpertise of the user. The TF-Replicator programming model will be open-sourcedas part of TensorFlow 2.0 (seehttps://github.com/tensorflow/community/pull/25).

Quick Read (beta)

loading the full paper ...