Self Supervision Does Not Help Natural Language Supervision at Scale

Abstract

Self supervision and natural language supervision have emerged as twoexciting ways to train general purpose image encoders which excel at a varietyof downstream tasks. Recent works such as M3AE and SLIP have suggested thatthese approaches can be effectively combined, but most notably their resultsuse small pre-training datasets (<50M samples) and don't effectively reflectthe large-scale regime (>100M examples) that is commonly used for theseapproaches. Here we investigate whether a similar approach can be effectivewhen trained with a much larger amount of data. We find that a combination oftwo state of the art approaches: masked auto-encoders, MAE and contrastivelanguage image pre-training, CLIP provides a benefit over CLIP when trained ona corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on asuite of common vision tasks) over CLIP when trained on a large corpus of 1.4Bimages. Our work provides some much needed clarity into the effectiveness (orlack thereof) of self supervision for large-scale image-text training.

Quick Read (beta)

loading the full paper ...