Abstract
Despite the success of a number of recent techniques for visualself-supervised deep learning, there remains limited investigation into therepresentations that are ultimately learned. By using recent advances incomparing neural representations, we explore in this direction by comparing aconstrastive self-supervised algorithm (SimCLR) to supervision for simple imagedata in a common architecture. We find that the methods learn similarintermediate representations through dissimilar means, and that therepresentations diverge rapidly in the final few layers. We investigate thisdivergence, finding that it is caused by these layers strongly fitting to thedistinct learning objectives. We also find that SimCLR's objective implicitlyfits the supervised objective in intermediate layers, but that the reverse isnot true. Our work particularly highlights the importance of the learnedintermediate representations, and raises important questions for auxiliary taskdesign.