Power Stabilization for AI Training Datacenters

  • 2025-08-21 17:25:27
  • Esha Choukse, Brijesh Warrier, Scot Heath, Luz Belmont, April Zhao, Hassan Ali Khan, Brian Harry, Matthew Kappel, Russell J. Hewett, Kushal Datta, Yu Pei, Caroline Lichtenberger, John Siegler, David Lukofsky, Zaid Kahn, Gurpreet Sahota, Andy Sullivan, Charles Frederick, Hien Thai, Rebecca Naughton, Daniel Jurnove, Justin Harp, Reid Carper, Nithish Mahalingam, Srini Varkala, Alok Gautam Kumbhare, Satyajit Desai, Venkatesh Ramamurthy, Praneeth Gottumukkala, Girish Bhatia, Kelsey Wildstone, Laurentiu Olariu, Ileana Incorvaia, Alex Wetmore, Prabhat Ram, Melur Raghuraman, Mohammed Ayna, Mike Kendrick, Ricardo Bianchini, Aaron Hurst, Reza Zamani, Xin Li, Michael Petrov, Gene Oden, Rory Carmichael, Tom Li, Apoorv Gupta, Pratikkumar Patel, Nilesh Dattani, Lawrence Marwong, Rob Nertney, Hirofumi Ko
  • 0

Abstract

Large Artificial Intelligence (AI) training workloads spanning several tensof thousands of GPUs present unique power management challenges. These arisedue to the high variability in power consumption during the training. Given thesynchronous nature of these jobs, during every iteration there is acomputation-heavy phase, where each GPU works on the local data, and acommunication-heavy phase where all the GPUs synchronize on the data. Becausecompute-heavy phases require much more power than communication phases, largepower swings occur. The amplitude of these power swings is ever increasing withthe increase in the size of training jobs. An even bigger challenge arises fromthe frequency spectrum of these power swings which, if harmonized with criticalfrequencies of utilities, can cause physical damage to the power gridinfrastructure. Therefore, to continue scaling AI training workloads safely, weneed to stabilize the power of such workloads. This paper introduces thechallenge with production data and explores innovative solutions across thestack: software, GPU hardware, and datacenter infrastructure. We present thepros and cons of each of these approaches and finally present a multi-prongedapproach to solving the challenge. The proposed solutions are rigorously testedusing a combination of real hardware and Microsoft's in-house cloud powersimulator, providing critical insights into the efficacy of these interventionsunder real-world conditions.

 

Quick Read (beta)

loading the full paper ...