The Epochal Sawtooth Phenomenon: Unveiling Training Loss Oscillations in Adam and Other Optimizers

Abstract

In this paper, we identify and analyze a recurring training loss pattern,which we term the \textit{Epochal Sawtooth Phenomenon (ESP)}, commonly observedduring training with adaptive gradient-based optimizers, particularly Adamoptimizer. This pattern is characterized by a sharp drop in loss at thebeginning of each epoch, followed by a gradual increase, resulting in asawtooth-shaped loss curve. Through empirical observations, we demonstrate thatwhile this effect is most pronounced with Adam, it persists, although lessseverely, with other optimizers such as RMSProp. We empirically analyze themechanisms underlying ESP, focusing on key factors such as Adam's $\beta$parameters, batch size, data shuffling, and sample replacement. Our analysisshows that ESP arises from adaptive learning rate adjustments controlled by thesecond moment estimate. Additionally, we identify the ``immediate re-exposureto samples'' effect during data shuffling, which causes the model to learn ormemorize more at the beginning of each epoch. We also find that smaller valuesof $\beta_2$ exacerbate ESP but can act as a form of regularization. While ESPis not necessarily indicative of overfitting, higher model capacity can amplifythe phenomenon. To further support our analysis, we replicate ESP through ahigh-dimensional quadratic minimization task. We demonstrate that ESP canemerge even in simple optimization scenarios, reinforcing the generality ofthis pattern. The code for reproducing our experiments is available athttps://github.com/qiliuchn/training-loss-pattern.

Quick Read (beta)

loading the full paper ...