Muon Optimizer Accelerates Grokking

Abstract

This paper investigates the impact of different optimizers on the grokkingphenomenon, where models exhibit delayed generalization. We conductedexperiments across seven numerical tasks (primarily modular arithmetic) using amodern Transformer architecture. The experimental configuration systematicallyvaried the optimizer (Muon vs. AdamW) and the softmax activation function(standard softmax, stablemax, and sparsemax) to assess their combined effect onlearning dynamics. Our empirical evaluation reveals that the Muon optimizer,characterized by its use of spectral norm constraints and second-orderinformation, significantly accelerates the onset of grokking compared to thewidely used AdamW optimizer. Specifically, Muon reduced the mean grokking epochfrom 153.09 to 102.89 across all configurations, a statistically significantdifference (t = 5.0175, p = 6.33e-08). This suggests that the optimizer choiceplays a crucial role in facilitating the transition from memorization togeneralization.

Quick Read (beta)

loading the full paper ...