Analyzing Data-Centric Properties for Contrastive Learning on Graphs

Abstract

Recent analyses of self-supervised learning (SSL) find the followingdata-centric properties to be critical for learning good representations:invariance to task-irrelevant semantics, separability of classes in some latentspace, and recoverability of labels from augmented samples. However, giventheir discrete, non-Euclidean nature, graph datasets and graph SSL methods areunlikely to satisfy these properties. This raises the question: how do graphSSL methods, such as contrastive learning (CL), work well? To systematicallyprobe this question, we perform a generalization analysis for CL when usinggeneric graph augmentations (GGAs), with a focus on data-centric properties.Our analysis yields formal insights into the limitations of GGAs and thenecessity of task-relevant augmentations. As we empirically show, GGAs do notinduce task-relevant invariances on common benchmark datasets, leading to onlymarginal gains over naive, untrained baselines. Our theory motivates asynthetic data generation process that enables control over task-relevantinformation and boasts pre-defined optimal augmentations. This flexiblebenchmark helps us identify yet unrecognized limitations in advancedaugmentation techniques (e.g., automated methods). Overall, our work rigorouslycontextualizes, both empirically and theoretically, the effects of data-centricproperties on augmentation strategies and learning paradigms for graph SSL.

Quick Read (beta)

loading the full paper ...