TF-Replicator: Distributed Machine Learning for Researchers

  • 2019-02-01 17:26:07
  • Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, Fabio Viola, Dan Belov
  • 50

Abstract

We describe TF-Replicator, a framework for distributed machine learningdesigned for DeepMind researchers and implemented as an abstraction overTensorFlow. TF-Replicator simplifies writing data-parallel and model-parallelresearch code. The same models can be effortlessly deployed to differentcluster architectures (i.e. one or many machines containing CPUs, GPUs or TPUaccelerators) using synchronous or asynchronous training regimes. Todemonstrate the generality and scalability of TF-Replicator, we implement andbenchmark three very different models: (1) A ResNet-50 for ImageNetclassification, (2) a SN-GAN for class-conditional ImageNet image generation,and (3) a D4PG reinforcement learning agent for continuous control. Our resultsshow strong scalability performance without demanding any distributed systemsexpertise of the user. The TF-Replicator programming model will be open-sourcedas part of TensorFlow 2.0 (seehttps://github.com/tensorflow/community/pull/25).

 

Quick Read (beta)

loading the full paper ...