Abstract
Severe data imbalance naturally exists among web-scale vision-languagedatasets. Despite this, we find CLIP pre-trained thereupon exhibits notablerobustness to the data imbalance compared to supervised learning, anddemonstrates significant effectiveness in learning generalizablerepresentations. With an aim to investigate the reasons behind this finding, weconduct controlled experiments to study various underlying factors, and revealthat CLIP's pretext task forms a dynamic classification problem wherein only asubset of classes is present in training. This isolates the bias from dominantclasses and implicitly balances the learning signal. Furthermore, therobustness and discriminability of CLIP improve with more descriptive languagesupervision, larger data scale, and broader open-world concepts, which areinaccessible to supervised learning. Our study not only uncovers the mechanismsbehind CLIP's generalizability beyond data imbalance but also providestransferable insights for the research community. The findings are validated inboth supervised and self-supervised learning, enabling models trained onimbalanced data to achieve CLIP-level performance on diverse recognition tasks.Code will be available at: https://github.com/CVMI-Lab/clip-beyond-tail.