Serving Large Language Models on Huawei CloudMatrix384

Abstract

The rapid evolution of large language models (LLMs), driven by growingparameter scales, adoption of mixture-of-experts (MoE) architectures, andexpanding context lengths, imposes unprecedented demands on AI infrastructure.Traditional AI clusters face limitations in compute intensity, memorybandwidth, inter-chip communication, and latency, compounded by variableworkloads and strict service-level objectives. Addressing these issues requiresfundamentally redesigned hardware-software integration. This paper introducesHuawei CloudMatrix, a next-generation AI datacenter architecture, realized inthe production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910CNPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth UnifiedBus (UB) network, enabling direct all-to-all communication and dynamic poolingof resources. These features optimize performance for communication-intensiveoperations, such as large-scale MoE expert parallelism and distributedkey-value cache access. To fully leverage CloudMatrix384, we proposeCloudMatrix-Infer, an advanced LLM serving solution incorporating three coreinnovations: a peer-to-peer serving architecture that independently scalesprefill, decode, and caching; a large-scale expert parallelism strategysupporting EP320 via efficient UB-based token dispatch; and hardware-awareoptimizations including specialized operators, microbatch-based pipelining, andINT8 quantization. Evaluation with the DeepSeek-R1 model showsCloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 msTPOT). It effectively balances throughput and latency, sustaining 538 tokens/sper NPU even under stringent 15 ms latency constraints, while INT8 quantizationmaintains model accuracy across benchmarks.

Quick Read (beta)

loading the full paper ...