Effective Vision Transformer Training: A Data-Centric Perspective

Abstract

Vision Transformers (ViTs) have shown promising performance compared withConvolutional Neural Networks (CNNs), but the training of ViTs is much harderthan CNNs. In this paper, we define several metrics, including Dynamic DataProportion (DDP) and Knowledge Assimilation Rate (KAR), to investigate thetraining process, and divide it into three periods accordingly: formation,growth and exploration. In particular, at the last stage of training, weobserve that only a tiny portion of training examples is used to optimize themodel. Given the data-hungry nature of ViTs, we thus ask a simple but importantquestion: is it possible to provide abundant ``effective'' training examples atEVERY stage of training? To address this issue, we need to address two criticalquestions, \ie, how to measure the ``effectiveness'' of individual trainingexamples, and how to systematically generate enough number of ``effective''examples when they are running out. To answer the first question, we find thatthe ``difficulty'' of training samples can be adopted as an indicator tomeasure the ``effectiveness'' of training samples. To cope with the secondquestion, we propose to dynamically adjust the ``difficulty'' distribution ofthe training data in these evolution stages. To achieve these two purposes, wepropose a novel data-centric ViT training framework to dynamically measure the``difficulty'' of training samples and generate ``effective'' samples formodels at different training stages. Furthermore, to further enlarge the numberof ``effective'' samples and alleviate the overfitting problem in the latetraining stage of ViTs, we propose a patch-level erasing strategy dubbedPatchErasing. Extensive experiments demonstrate the effectiveness of theproposed data-centric ViT training framework and techniques.

Quick Read (beta)

loading the full paper ...