Abstract
Large language model (LLM) training is one of the most demanding distributedcomputations today, often requiring thousands of GPUs with frequentsynchronization across machines. Such a workload pattern makes it susceptibleto stragglers, where the training can be stalled by few slow workers. AtByteDance we find stragglers are not trivially always caused by hardwarefailures, but can arise from multiple complex factors. This work aims topresent a comprehensive study on the straggler issues in LLM training, using afive-month trace collected from our ByteDance LLM training cluster. The coremethodology is what-if analysis that simulates the scenario without anystragglers and contrasts with the actual case. We use this method to study thefollowing questions: (1) how often do stragglers affect training jobs, and whateffect do they have on job performance; (2) do stragglers exhibit temporal orspatial patterns; and (3) what are the potential root causes for stragglers?