Abstract
Despite recent progress made by self-supervised methods in representationlearning with residual networks, they still underperform supervised learning onthe ImageNet classification benchmark, limiting their applicability inperformance-critical settings. Building on prior theoretical insights fromMitrovic et al., 2021, we propose ReLICv2 which combines an explicit invarianceloss with a contrastive objective over a varied set of appropriatelyconstructed data views. ReLICv2 achieves 77.1% top-1 classification accuracy onImageNet using linear evaluation with a ResNet50 architecture and 80.6% withlarger ResNet models, outperforming previous state-of-the-art self-supervisedapproaches by a wide margin. Most notably, ReLICv2 is the first representationlearning method to consistently outperform the supervised baseline in alike-for-like comparison using a range of standard ResNet architectures.Finally we show that despite using ResNet encoders, ReLICv2 is comparable tostate-of-the-art self-supervised vision transformers.