Abstract
Training large AI models efficiently requires distributing computation acrossmultiple accelerators, but this often incurs significant communication overhead-- especially during gradient synchronization. We introduce Dion, acommunication-efficient optimizer that retains the synchronous semantics ofstandard distributed training (e.g., DDP, FSDP) while substantially reducingI/O costs. Unlike conventional optimizers that synchronize full gradientmatrices, Dion leverages orthonormalized updates with device-local momentumbuffers, eliminating the need for full gradient exchange. It further supportsan efficient sharding strategy that avoids reconstructing large matrices duringtraining.
Quick Read (beta)
loading the full paper ...