Dion: A Communication-Efficient Optimizer for Large Models

Abstract

Training large AI models efficiently requires distributing computation acrossmultiple accelerators, but this often incurs significant communication overhead-- especially during gradient synchronization. We introduce Dion, acommunication-efficient optimizer that retains the synchronous semantics ofstandard distributed training (e.g., DDP, FSDP) while substantially reducingI/O costs. Unlike conventional optimizers that synchronize full gradientmatrices, Dion leverages orthonormalized updates with device-local momentumbuffers, eliminating the need for full gradient exchange. It further supportsan efficient sharding strategy that avoids reconstructing large matrices duringtraining.

Quick Read (beta)

loading the full paper ...