Understanding self-supervised Learning Dynamics without Contrastive Pairs

Abstract

Contrastive approaches to self-supervised learning (SSL) learnrepresentations by minimizing the distance between two augmented views of thesame data point (positive pairs) and maximizing the same from different datapoints (negative pairs). However, recent approaches like BYOL and SimSiam, showremarkable performance {\it without} negative pairs, raising a fundamentaltheoretical question: how can SSL with only positive pairs avoidrepresentational collapse? We study the nonlinear learning dynamics ofnon-contrastive SSL in simple linear networks. Our analysis yields conceptualinsights into how non-contrastive SSL methods learn, how they avoidrepresentational collapse, and how multiple factors, like predictor networks,stop-gradients, exponential moving averages, and weight decay all come intoplay. Our simple theory recapitulates the results of real-world ablationstudies in both STL-10 and ImageNet. Furthermore, motivated by our theory wepropose a novel approach that \emph{directly} sets the predictor based on thestatistics of its inputs. In the case of linear predictors, our approachoutperforms gradient training of the predictor by $5\%$ and on ImageNet itperforms comparably with more complex two-layer non-linear predictors thatemploy BatchNorm. Code is released inhttps://github.com/facebookresearch/luckmatters/tree/master/ssl.

Quick Read (beta)

loading the full paper ...