Improved baselines for vision-language pre-training

Abstract

Contrastive learning has emerged as an efficient framework to learnmultimodal representations. CLIP, a seminal work in this area, achievedimpressive results by training on paired image-text data using the contrastiveloss. Recent work claims improvements over CLIP using additionalnon-contrastive losses inspired from self-supervised learning. However, it issometimes hard to disentangle the contribution of these additional losses fromother implementation details, e.g., data augmentation or regularizationtechniques, used to train the model. To shed light on this matter, in thispaper, we first propose, implement and evaluate several baselines obtained bycombining contrastive learning with recent advances in self-supervisedlearning. In particular, we use the loss functions that were proven successfulfor visual self-supervised learning to align image and text modalities. We findthat these baselines outperform a basic implementation of CLIP. However, when astronger training recipe is employed, the advantage disappears. Indeed, we findthat a simple CLIP baseline can also be improved substantially, up to a 25%relative improvement on downstream zero-shot tasks, by using well-knowntraining techniques that are popular in other subfields. Moreover, we discoverthat it is enough to apply image and text augmentations to make up for most ofthe improvement attained by prior works. With our improved training recipe forCLIP, we obtain state-of-the-art performance on four standard datasets, andconsistently outperform prior work (up to +4% on the largest dataset), whilebeing substantially simpler.

Quick Read (beta)

loading the full paper ...