EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

Abstract

The superior performance of modern visual backbones usually comes with acostly training procedure. We contribute to this issue by generalizing the ideaof curriculum learning beyond its original formulation, i.e., training modelsusing easier-to-harder data. Specifically, we reformulate the trainingcurriculum as a soft-selection function, which uncovers progressively moredifficult patterns within each example during training, instead of performingeasier-to-harder sample selection. Our work is inspired by an intriguingobservation on the learning dynamics of visual backbones: during the earlierstages of training, the model predominantly learns to recognize some'easier-to-learn' discriminative patterns in the data. These patterns, whenobserved through frequency and spatial domains, incorporate lower-frequencycomponents, and the natural image contents without distortion or dataaugmentation. Motivated by these findings, we propose a curriculum where themodel always leverages all the training data at every learning stage, yet theexposure to the 'easier-to-learn' patterns of each example is initiated first,with harder patterns gradually introduced as training progresses. To implementthis idea in a computationally efficient way, we introduce a cropping operationin the Fourier spectrum of the inputs, enabling the model to learn from onlythe lower-frequency components. Then we show that exposing the contents ofnatural images can be readily achieved by modulating the intensity of dataaugmentation. Finally, we integrate these aspects and design curriculumschedules with tailored search algorithms. The resulting method,EfficientTrain++, is simple, general, yet surprisingly effective. It reducesthe training time of a wide variety of popular models by 1.5-3.0x onImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy inself-supervised learning (e.g., MAE).

Quick Read (beta)

loading the full paper ...