Self supervision and natural language supervision have emerged as twoexciting ways to train general purpose image encoders which excel at a varietyof downstream tasks. Recent works such as M3AE and SLIP have suggested thatthese approaches can be effectively combined, but most notably their resultsuse small pre-training datasets (<50M samples) and don't effectively reflectthe large-scale regime (>100M examples) that is commonly used for theseapproaches. Here we investigate whether a similar approach can be effectivewhen trained with a much larger amount of data. We find that a combination oftwo state of the art approaches: masked auto-encoders, MAE and contrastivelanguage image pre-training, CLIP provides a benefit over CLIP when trained ona corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on asuite of common vision tasks) over CLIP when trained on a large corpus of 1.4Bimages. Our work provides some much needed clarity into the effectiveness (orlack thereof) of self supervision for large-scale image-text training.