Training Vision Transformers with Only 2040 Images

Abstract

Vision Transformers (ViTs) is emerging as an alternative to convolutionalneural networks (CNNs) for visual recognition. They achieve competitive resultswith CNNs but the lack of the typical convolutional inductive bias makes themmore data-hungry than common CNNs. They are often pretrained on JFT-300M or atleast ImageNet and few works study training ViTs with limited data. In thispaper, we investigate how to train ViTs with limited data (e.g., 2040 images).We give theoretical analyses that our method (based on parametric instancediscrimination) is superior to other methods in that it can capture bothfeature alignment and instance similarities. We achieve state-of-the-artresults when training from scratch on 7 small datasets under various ViTbackbones. We also investigate the transferring ability of small datasets andfind that representations learned from small datasets can even improvelarge-scale ImageNet training.

Quick Read (beta)

loading the full paper ...