Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning

Abstract

While the empirical success of self-supervised learning (SSL) heavily relieson the usage of deep nonlinear models, existing theoretical works on SSLunderstanding still focus on linear ones. In this paper, we study the role ofnonlinearity in the training dynamics of contrastive learning (CL) on one andtwo-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. Wehave two major theoretical discoveries. First, the presence of nonlinearity canlead to many local optima even in 1-layer setting, each corresponding tocertain patterns from the data distribution, while with linear activation, onlyone major pattern can be learned. This suggests that models with lots ofparameters can be regarded as a \emph{brute-force} way to find these localoptima induced by nonlinearity. Second, in the 2-layer case, linear activationis proven not capable of learning specialized weights into diverse patterns,demonstrating the importance of nonlinearity. In addition, for 2-layer setting,we also discover \emph{global modulation}: those local patterns discriminativefrom the perspective of global-level patterns are prioritized to learn, furthercharacterizing the learning process. Simulation verifies our theoreticalfindings.

Quick Read (beta)

loading the full paper ...