Abstract
Large Artificial Intelligence (AI) training workloads spanning several tensof thousands of GPUs present unique power management challenges. These arisedue to the high variability in power consumption during the training. Given thesynchronous nature of these jobs, during every iteration there is acomputation-heavy phase, where each GPU works on the local data, and acommunication-heavy phase where all the GPUs synchronize on the data. Becausecompute-heavy phases require much more power than communication phases, largepower swings occur. The amplitude of these power swings is ever increasing withthe increase in the size of training jobs. An even bigger challenge arises fromthe frequency spectrum of these power swings which, if harmonized with criticalfrequencies of utilities, can cause physical damage to the power gridinfrastructure. Therefore, to continue scaling AI training workloads safely, weneed to stabilize the power of such workloads. This paper introduces thechallenge with production data and explores innovative solutions across thestack: software, GPU hardware, and datacenter infrastructure. We present thepros and cons of each of these approaches and finally present a multi-prongedapproach to solving the challenge. The proposed solutions are rigorously testedusing a combination of real hardware and Microsoft's in-house cloud powersimulator, providing critical insights into the efficacy of these interventionsunder real-world conditions.