MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

  • 2025-10-21 17:25:32
  • Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu
  • 0

Abstract

The adoption of long context windows has become a standard feature in LargeLanguage Models (LLMs), as extended contexts significantly enhance theircapacity for complex reasoning and broaden their applicability across diversescenarios. Dynamic sparse attention is a promising approach for reducing thecomputational cost of long-context. However, efficiently training LLMs withdynamic sparse attention on ultra-long contexts-especially in distributedsettings-remains a significant challenge, due in large part to worker- andstep-level imbalance. This paper introduces MTraining, a novel distributedmethodology leveraging dynamic sparse attention to enable efficient trainingfor LLMs with ultra-long contexts. Specifically, MTraining integrates three keycomponents: a dynamic sparse training pattern, balanced sparse ring attention,and hierarchical sparse ring attention. These components are designed tosynergistically address the computational imbalance and communication overheadsinherent in dynamic sparse attention mechanisms during the training of modelswith extensive context lengths. We demonstrate the efficacy of MTraining bytraining Qwen2.5-3B, successfully expanding its context window from 32K to 512Ktokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suiteof downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In AHaystack, reveal that MTraining achieves up to a 6x higher training throughputwhile preserving model accuracy. Our code is available athttps://github.com/microsoft/MInference/tree/main/MTraining.

 

Quick Read (beta)

loading the full paper ...