Serving Large Language Models on Huawei CloudMatrix384

  • 2025-06-18 11:04:59
  • Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao, Depeng Liang, Dong Cao, Juncheng Liu, Yongqiang Yang, Xiaolong Bai, Yi Li, Huaguo Xie, Huatao Wu, Zhibin Yu, Lv Chen, Hu Liu, Yujun Ding, Haipei Zhu, Jing Xia, Yi Xiong, Zhou Yu, Heng Liao
  • 0

Abstract

The rapid evolution of large language models (LLMs), driven by growingparameter scales, adoption of mixture-of-experts (MoE) architectures, andexpanding context lengths, imposes unprecedented demands on AI infrastructure.Traditional AI clusters face limitations in compute intensity, memorybandwidth, inter-chip communication, and latency, compounded by variableworkloads and strict service-level objectives. Addressing these issues requiresfundamentally redesigned hardware-software integration. This paper introducesHuawei CloudMatrix, a next-generation AI datacenter architecture, realized inthe production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910CNPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth UnifiedBus (UB) network, enabling direct all-to-all communication and dynamic poolingof resources. These features optimize performance for communication-intensiveoperations, such as large-scale MoE expert parallelism and distributedkey-value cache access. To fully leverage CloudMatrix384, we proposeCloudMatrix-Infer, an advanced LLM serving solution incorporating three coreinnovations: a peer-to-peer serving architecture that independently scalesprefill, decode, and caching; a large-scale expert parallelism strategysupporting EP320 via efficient UB-based token dispatch; and hardware-awareoptimizations including specialized operators, microbatch-based pipelining, andINT8 quantization. Evaluation with the DeepSeek-R1 model showsCloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 msTPOT). It effectively balances throughput and latency, sustaining 538 tokens/sper NPU even under stringent 15 ms latency constraints, while INT8 quantizationmaintains model accuracy across benchmarks.

 

Quick Read (beta)

loading the full paper ...