Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Abstract

Pre-trained representations are becoming crucial for many NLP and perceptiontasks. While representation learning in NLP has transitioned to training on rawtext without human annotations, visual and vision-language representationsstill rely heavily on curated training datasets that are expensive or requireexpert knowledge. For vision applications, representations are mostly learnedusing datasets with explicit class labels such as ImageNet or OpenImages. Forvision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP allinvolve a non-trivial data collection (and cleaning) process. This costlycuration process limits the size of datasets and hence hinders the scaling oftrained models. In this paper, we leverage a noisy dataset of over one billionimage alt-text pairs, obtained without expensive filtering or post-processingsteps in the Conceptual Captions dataset. A simple dual-encoder architecturelearns to align visual and language representations of the image and text pairsusing a contrastive loss. We show that the scale of our corpus can make up forits noise and leads to state-of-the-art representations even with such a simplelearning scheme. Our visual representation achieves strong performance whentransferred to classification tasks such as ImageNet and VTAB. The alignedvisual and language representations also set new state-of-the-art results onFlickr30K and MSCOCO benchmarks, even when compared with more sophisticatedcross-attention models. The representations also enable cross-modality searchwith complex text and text + image queries.

Quick Read (beta)

loading the full paper ...