Abstract
Recent developments in deep learning optimization have brought aboutradically new algorithms based on the Linear Minimization Oracle (LMO)framework, such as $\sf Muon$ and $\sf Scion$. After over a decade of $\sfAdam$'s dominance, these LMO-based methods are emerging as viable replacements,offering several practical advantages such as improved memory efficiency,better hyperparameter transferability, and most importantly, superior empiricalperformance on large-scale tasks, including LLM training. However, asignificant gap remains between their practical use and our current theoreticalunderstanding: prior analyses (1) overlook the layer-wise LMO application ofthese optimizers in practice, and (2) rely on an unrealistic smoothnessassumption, leading to impractically small stepsizes. To address both, wepropose a new LMO-based method called $\sf Gluon$, capturing priortheoretically analyzed methods as special cases, and introduce a new refinedgeneralized smoothness model that captures the layer-wise geometry of neuralnetworks, matches the layer-wise practical implementation of $\sf Muon$ and$\sf Scion$, and leads to convergence guarantees with strong practicalpredictive power. Unlike prior results, our theoretical stepsizes closely matchthe fine-tuned values reported by Pethick et al. (2025). Our experiments withNanoGPT and CNN confirm that our assumption holds along the optimizationtrajectory, ultimately closing the gap between theory and practice.