Abstract
This paper presents a low-cost network architecture for training largelanguage models (LLMs) at hyperscale. We study the optimal parallelizationstrategy of LLMs and propose a novel datacenter network design tailored toLLM's unique communication pattern. We show that LLM training generates sparsecommunication patterns in the network and, therefore, does not requireany-to-any full-bisection network to complete efficiently. As a result, ourdesign eliminates the spine layer in traditional GPU clusters. We name thisdesign a Rail-only network and demonstrate that it achieves the same trainingperformance while reducing the network cost by 38% to 77% and network powerconsumption by 37% to 75% compared to a conventional GPU datacenter. Ourarchitecture also supports Mixture-of-Expert (MoE) models with all-to-allcommunication through forwarding, with only 4.1% to 5.6% completion timeoverhead for all-to-all traffic. We study the failure robustness of Rail-onlynetworks and provide insights into the performance impact of different networkand training parameters.