Abstract
In the realms of computer vision, it is evident that deep neural networksperform better in a supervised setting with a large amount of labeled data. Therepresentations learned with supervision are not only of high quality but alsohelps the model in enhancing its accuracy. However, the collection andannotation of a large dataset are costly and time-consuming. To avoid the same,there has been a lot of research going on in the field of unsupervised visualrepresentation learning especially in a self-supervised setting. Amongst therecent advancements in self-supervised methods for visual recognition, inSimCLR Chen et al. shows that good quality representations can indeed belearned without explicit supervision. In SimCLR, the authors maximize thesimilarity of augmentations of the same image and minimize the similarity ofaugmentations of different images. A linear classifier trained with therepresentations learned using this approach yields 76.5% top-1 accuracy on theImageNet ILSVRC-2012 dataset. In this work, we propose that, with thenormalized temperature-scaled cross-entropy (NT-Xent) loss function (as used inSimCLR), it is beneficial to not have images of the same category in the samebatch. In an unsupervised setting, the information of images pertaining to thesame category is missing. We use the latent space representation of a denoisingautoencoder trained on the unlabeled dataset and cluster them with k-means toobtain pseudo labels. With this apriori information we batch images, where notwo images from the same category are to be found. We report comparableperformance enhancements on the CIFAR10 dataset and a subset of the ImageNetdataset. We refer to our method as G-SimCLR.