Abstract
Given the popularity of generative AI, Large Language Models (LLMs) oftenconsume hundreds or thousands of GPUs for parallelizing and accelerating thetraining process. Communication overhead becomes more pronounced when trainingLLMs at scale. To eliminate communication overhead in distributed LLM training,we propose Domino, which provides a generic scheme to hide communication behindcomputation. By breaking data dependency of a single batch training intosmaller independent pieces, Domino pipelines these independent pieces trainingand provides generic strategy of fine-grained communication and computationoverlapping. Extensive results show that, comparing with Megatron-LM, Dominoachieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.