PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Abstract

It is widely acknowledged that large models have the potential to deliversuperior performance across a broad range of domains. Despite the remarkableprogress made in the field of machine learning systems research, which hasenabled the development and exploration of large models, such abilities remainconfined to a small group of advanced users and industry leaders, resulting inan implicit technical barrier for the wider community to access and leveragethese technologies. In this paper, we introduce PyTorch Fully Sharded DataParallel (FSDP) as an industry-grade solution for large model training. FSDPhas been closely co-designed with several key PyTorch core components includingTensor implementation, dispatcher system, and CUDA memory caching allocator, toprovide non-intrusive user experiences and high training efficiency.Additionally, FSDP natively incorporates a range of techniques and settings tooptimize resource utilization across a variety of hardware configurations. Theexperimental results demonstrate that FSDP is capable of achieving comparableperformance to Distributed Data Parallel while providing support forsignificantly larger models with near-linear scalability in terms of TFLOPS.

Quick Read (beta)

loading the full paper ...