Semi-supervised Vision Transformers at Scale

Abstract

We study semi-supervised learning (SSL) for vision transformers (ViT), anunder-explored topic despite the wide adoption of the ViT architectures todifferent tasks. To tackle this problem, we propose a new SSL pipeline,consisting of first un/self-supervised pre-training, followed by supervisedfine-tuning, and finally semi-supervised fine-tuning. At the semi-supervisedfine-tuning stage, we adopt an exponential moving average (EMA)-Teacherframework instead of the popular FixMatch, since the former is more stable anddelivers higher accuracy for semi-supervised vision transformers. In addition,we propose a probabilistic pseudo mixup mechanism to interpolate unlabeledsamples and their pseudo labels for improved regularization, which is importantfor training ViTs with weak inductive bias. Our proposed method, dubbedSemi-ViT, achieves comparable or better performance than the CNN counterpartsin the semi-supervised classification setting. Semi-ViT also enjoys thescalability benefits of ViTs that can be readily scaled up to large-size modelswith increasing accuracies. For example, Semi-ViT-Huge achieves an impressive80% top-1 accuracy on ImageNet using only 1% labels, which is comparable withInception-v4 using 100% ImageNet labels.

Quick Read (beta)

loading the full paper ...