Optimized Network Architectures for Large Language Model Training with Billions of Parameters

Abstract

This paper challenges the well-established paradigm for building any-to-anynetworks for training Large Language Models (LLMs). We show that LLMs exhibit aunique communication pattern where only small groups of GPUs requirehigh-bandwidth any-to-any communication within them, to achieve near-optimaltraining performance. Across these groups of GPUs, the communication isinsignificant, sparse, and homogeneous. We propose a new network architecturethat closely resembles the communication requirement of LLMs. Our architecturepartitions the cluster into sets of GPUs interconnected with non-blockingany-to-any high-bandwidth interconnects that we call HB domains. Across the HBdomains, the network only connects GPUs with communication demands. We callthis network a "rail-only" connection, and show that our proposed architecturereduces the network cost by up to 75% compared to the state-of-the-artany-to-any Clos networks without compromising the performance of LLM training.

Quick Read (beta)

loading the full paper ...