Abstract
We present Pangu Ultra, a Large Language Model (LLM) with 135 billionparameters and dense Transformer modules trained on Ascend Neural ProcessingUnits (NPUs). Although the field of LLM has been witnessing unprecedentedadvances in pushing the scale and capability of LLM in recent years, trainingsuch a large-scale model still involves significant optimization and systemchallenges. To stabilize the training process, we propose depth-scaled sandwichnormalization, which effectively eliminates loss spikes during the trainingprocess of deep models. We pre-train our model on 13.2 trillion diverse andhigh-quality tokens and further enhance its reasoning capabilities duringpost-training. To perform such large-scale training efficiently, we utilize8,192 Ascend NPUs with a series of system optimizations. Evaluations onmultiple diverse benchmarks indicate that Pangu Ultra significantly advancesthe state-of-the-art capabilities of dense LLMs such as Llama 405B and MistralLarge 2, and even achieves competitive results with DeepSeek-R1, whose sparsemodel structure contains much more parameters. Our exploration demonstratesthat Ascend NPUs are capable of efficiently and effectively training densemodels with more than 100 billion parameters. Our model and system will beavailable for our commercial customers.