SiT: Self-supervised vIsion Transformer

Abstract

Self-supervised learning methods are gaining increasing traction in computervision due to their recent success in reducing the gap with supervisedlearning. In natural language processing (NLP) self-supervised learning andtransformers are already the methods of choice. The recent literature suggeststhat the transformers are becoming increasingly popular also in computervision. So far, the vision transformers have been shown to work well whenpretrained either using a large scale supervised data or with some kind ofco-supervision, e.g. in terms of teacher network. These supervised pretrainedvision transformers achieve very good results in downstream tasks with minimalchanges. In this work we investigate the merits of self-supervised learning forpretraining image/vision transformers and then using them for downstreamclassification tasks. We propose Self-supervised vIsion Transformers (SiT) anddiscuss several self-supervised training mechanisms to obtain a pretext model.The architectural flexibility of SiT allows us to use it as an autoencoder andwork with multiple self-supervised tasks seamlessly. We show that a pretrainedSiT can be finetuned for a downstream classification task on small scaledatasets, consisting of a few thousand images rather than several millions. Theproposed approach is evaluated on standard datasets using common protocols. Theresults demonstrate the strength of the transformers and their suitability forself-supervised learning. We outperformed existing self-supervised learningmethods by large margin. We also observed that SiT is good for few shotlearning and also showed that it is learning useful representation by simplytraining a linear classifier on top of the learned features from SiT.Pretraining, finetuning, and evaluation codes will be available under:https://github.com/Sara-Ahmed/SiT.

Quick Read (beta)

loading the full paper ...