Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

Abstract

Recent developments in deep learning optimization have brought aboutradically new algorithms based on the Linear Minimization Oracle (LMO)framework, such as $\sf Muon$ and $\sf Scion$. After over a decade of $\sfAdam$'s dominance, these LMO-based methods are emerging as viable replacements,offering several practical advantages such as improved memory efficiency,better hyperparameter transferability, and most importantly, superior empiricalperformance on large-scale tasks, including LLM training. However, asignificant gap remains between their practical use and our current theoreticalunderstanding: prior analyses (1) overlook the layer-wise LMO application ofthese optimizers in practice, and (2) rely on an unrealistic smoothnessassumption, leading to impractically small stepsizes. To address both, wepropose a new LMO-based method called $\sf Gluon$, capturing priortheoretically analyzed methods as special cases, and introduce a new refinedgeneralized smoothness model that captures the layer-wise geometry of neuralnetworks, matches the layer-wise practical implementation of $\sf Muon$ and$\sf Scion$, and leads to convergence guarantees with strong practicalpredictive power. Unlike prior results, our theoretical stepsizes closely matchthe fine-tuned values reported by Pethick et al. (2025). Our experiments withNanoGPT and CNN confirm that our assumption holds along the optimizationtrajectory, ultimately closing the gap between theory and practice.

Quick Read (beta)

loading the full paper ...