COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

Abstract

Large Language Models (LLMs) have demonstrated remarkable success acrossvarious domains, yet their optimization remains a significant challenge due tothe complex and high-dimensional loss landscapes they inhabit. While adaptiveoptimizers such as AdamW are widely used, they suffer from criticallimitations, including an inability to capture interdependencies betweencoordinates and high memory consumption. Subsequent research, exemplified bySOAP, attempts to better capture coordinate interdependence but incurs greatermemory overhead, limiting scalability for massive LLMs. An alternative approachaims to reduce memory consumption through low-dimensional projection, but thisleads to substantial approximation errors, resulting in less effectiveoptimization (e.g., in terms of per-token efficiency). In this paper, wepropose COSMOS, a novel hybrid optimizer that leverages the varying importanceof eigensubspaces in the gradient matrix to achieve memory efficiency withoutcompromising optimization performance. The design of COSMOS is motivated by ourempirical insights and practical considerations. Specifically, COSMOS appliesSOAP to the leading eigensubspace, which captures the primary optimizationdynamics, and MUON to the remaining eigensubspace, which is less critical butcomputationally expensive to handle with SOAP. This hybrid strategysignificantly reduces memory consumption while maintaining robust optimizationperformance, making it particularly suitable for massive LLMs. Numericalexperiments on various datasets and transformer architectures are provided todemonstrate the effectiveness of COSMOS. Our code is available athttps://github.com/lliu606/COSMOS.

Quick Read (beta)

loading the full paper ...