Abstract
The interest and demand for training deep neural networks have beenexperiencing rapid growth, spanning a wide range of applications in bothacademia and industry. However, training them distributed and at scale remainsdifficult due to the complex ecosystem of tools and hardware involved. Oneconsequence is that the responsibility of orchestrating these complexcomponents is often left to one-off scripts and glue code customized forspecific problems. To address these restrictions, we introduce \emph{Alchemist}- an internal service built at Apple from the ground up for \emph{easy},\emph{fast}, and \emph{scalable} distributed training. We discuss its design,implementation, and examples of running different flavors of distributedtraining. We also present case studies of its internal adoption in thedevelopment of autonomous systems, where training times have been reduced by10x to keep up with the ever-growing data collection.