The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Abstract

Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL)method by Grill et al. shows the negative term in contrastive loss can beremoved if we add the so-called prediction head to the network. This initiatedthe research of non-contrastive self-supervised learning. It is mysterious whyeven when there exist trivial collapsed global optimal solutions, neuralnetworks trained by (stochastic) gradient descent can still learn competitiverepresentations. This phenomenon is a typical example of implicit bias in deeplearning and remains little understood. In this work, we present our empirical and theoretical discoveries onnon-contrastive self-supervised learning. Empirically, we find that when theprediction head is initialized as an identity matrix with only its off-diagonalentries being trainable, the network can learn competitive representations eventhough the trivial optima still exist in the training objective. Theoretically,we present a framework to understand the behavior of the trainable, butidentity-initialized prediction head. Under a simple setting, we characterizedthe substitution effect and acceleration effect of the prediction head. Thesubstitution effect happens when learning the stronger features in some neuronscan substitute for learning these features in other neurons through updatingthe prediction head. And the acceleration effect happens when the substitutedfeatures can accelerate the learning of other weaker features to prevent themfrom being ignored. These two effects enable the neural networks to learn allthe features rather than focus only on learning the stronger features, which islikely the cause of the dimensional collapse phenomenon. To the best of ourknowledge, this is also the first end-to-end optimization guarantee fornon-contrastive methods using nonlinear neural networks with a trainableprediction head and normalization.

Quick Read (beta)

loading the full paper ...