Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

  • 2025-04-11 08:47:04
  • Yichun Yin, Wenyong Huang, Kaikai Song, Yehui Tang, Xueyu Wu, Wei Guo, Peng Guo, Yaoyuan Wang, Xiaojun Meng, Yasheng Wang, Dong Li, Can Chen, Dandan Tu, Yin Li, Fisher Yu, Ruiming Tang, Yunhe Wang, Baojun Wang, Bin Wang, Bo Wang, Boxiao Liu, Changzheng Zhang, Duyu Tang, Fei Mi, Hui Jin, Jiansheng Wei, Jiarui Qin, Jinpeng Li, Jun Zhao, Liqun Deng, Lin Li, Minghui Xu, Naifu Zhang, Nianzu Zheng, Qiang Li, Rongju Ruan, Shengjun Cheng, Tianyu Guo, Wei He, Wei Li, Weiwen Liu, Wulong Liu, Xinyi Dai, Yonghan Dong, Yu Pan, Yue Li, Yufei Wang, Yujun Li, Yunsheng Ni, Zhe Liu, Zhenhe Zhang, Zhicheng Liu
  • 0

Abstract

We present Pangu Ultra, a Large Language Model (LLM) with 135 billionparameters and dense Transformer modules trained on Ascend Neural ProcessingUnits (NPUs). Although the field of LLM has been witnessing unprecedentedadvances in pushing the scale and capability of LLM in recent years, trainingsuch a large-scale model still involves significant optimization and systemchallenges. To stabilize the training process, we propose depth-scaled sandwichnormalization, which effectively eliminates loss spikes during the trainingprocess of deep models. We pre-train our model on 13.2 trillion diverse andhigh-quality tokens and further enhance its reasoning capabilities duringpost-training. To perform such large-scale training efficiently, we utilize8,192 Ascend NPUs with a series of system optimizations. Evaluations onmultiple diverse benchmarks indicate that Pangu Ultra significantly advancesthe state-of-the-art capabilities of dense LLMs such as Llama 405B and MistralLarge 2, and even achieves competitive results with DeepSeek-R1, whose sparsemodel structure contains much more parameters. Our exploration demonstratesthat Ascend NPUs are capable of efficiently and effectively training densemodels with more than 100 billion parameters. Our model and system will beavailable for our commercial customers.

 

Quick Read (beta)

loading the full paper ...