Abstract
Contrastively trained image-text models such as CLIP, ALIGN, and BASIC havedemonstrated unprecedented robustness to multiple challenging naturaldistribution shifts. Since these image-text models differ from previoustraining approaches in several ways, an important question is what causes thelarge robustness gains. We answer this question via a systematic experimentalinvestigation. Concretely, we study five different possible causes for therobustness gains: (i) the training set size, (ii) the training distribution,(iii) language supervision at training time, (iv) language supervision at testtime, and (v) the contrastive loss function. Our experiments show that the morediverse training distribution is the main cause for the robustness gains, withthe other factors contributing little to no robustness. Beyond our experimentalresults, we also introduce ImageNet-Captions, a version of ImageNet withoriginal text annotations from Flickr, to enable further controlled experimentsof language-image training.