Emerging Properties in Self-Supervised Vision Transformers

Abstract

In this paper, we question if self-supervised learning provides newproperties to Vision Transformer (ViT) that stand out compared to convolutionalnetworks (convnets). Beyond the fact that adapting self-supervised methods tothis architecture works particularly well, we make the following observations:first, self-supervised ViT features contain explicit information about thesemantic segmentation of an image, which does not emerge as clearly withsupervised ViTs, nor with convnets. Second, these features are also excellentk-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our studyalso underlines the importance of momentum encoder, multi-crop training, andthe use of small patches with ViTs. We implement our findings into a simpleself-supervised method, called DINO, which we interpret as a form ofself-distillation with no labels. We show the synergy between DINO and ViTs byachieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Quick Read (beta)

loading the full paper ...