Abstract
Sequence parallelism (SP), which divides the sequence dimension of inputtensors across multiple computational devices, is becoming key to unlocking thelong-context capabilities of generative AI models. This paper investigates thestate-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, andproposes a unified SP approach, which is more robust to transformer modelarchitectures and network hardware topology. This paper compares thecommunication and memory cost of SP and existing parallelism, includingdata/tensor/zero/expert/pipeline parallelism, and discusses the best practicesfor designing hybrid 4D parallelism involving SP. We achieved 86\% MFU on two8xA800 nodes using SP for sequence length 208K for the LLAMA3-8B model. Ourcode is publicly available on\url{https://github.com/feifeibear/long-context-attention}.