Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

  • 2024-09-23 18:38:52
  • Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase
  • 0

Abstract

Given the popularity of generative AI, Large Language Models (LLMs) oftenconsume hundreds or thousands of GPUs for parallelizing and accelerating thetraining process. Communication overhead becomes more pronounced when trainingLLMs at scale. To eliminate communication overhead in distributed LLM training,we propose Domino, which provides a generic scheme to hide communication behindcomputation. By breaking data dependency of a single batch training intosmaller independent pieces, Domino pipelines these independent pieces trainingand provides generic strategy of fine-grained communication and computationoverlapping. Extensive results show that, comparing with Megatron-LM, Dominoachieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

 

Quick Read (beta)

loading the full paper ...