Abstract
Mainstream Transformer-based large language models face major efficiencybottlenecks: training computation scales quadratically with sequence length,and inference memory grows linearly, limiting long-context processing. Buildinglarge models on non-NVIDIA platforms also poses challenges for stable andefficient training. To address this, we introduce SpikingBrain, a family ofbrain-inspired models designed for efficient long-context training andinference. SpikingBrain leverages the MetaX GPU cluster and focuses on threeaspects: (1) Model Architecture: linear and hybrid-linear attentionarchitectures with adaptive spiking neurons; (2) Algorithmic Optimizations: anefficient, conversion-based training pipeline and a dedicated spike codingframework; (3) System Engineering: customized training frameworks, operatorlibraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM,and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate thefeasibility of large-scale LLM development on non-NVIDIA platforms.SpikingBrain achieves performance comparable to open-source Transformerbaselines while using only about 150B tokens for continual pre-training. Ourmodels significantly improve long-sequence training efficiency and deliverinference with (partially) constant memory and event-driven spiking behavior.For example, SpikingBrain-7B attains over 100x speedup in Time to First Tokenfor 4M-token sequences. Training remains stable for weeks on hundreds of MetaXC550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4percent. The proposed spiking scheme achieves 69.15 percent sparsity, enablinglow-power operation. Overall, this work demonstrates the potential ofbrain-inspired mechanisms to drive the next generation of efficient andscalable large model design.