Abstract
Large-scale vision and language representation learning has shown promisingimprovements on various vision-language tasks. Most existing methods employ atransformer-based multimodal encoder to jointly model visual tokens(region-based image features) and word tokens. Because the visual tokens andword tokens are unaligned, it is challenging for the multimodal encoder tolearn image-text interactions. In this paper, we introduce a contrastive lossto ALign the image and text representations BEfore Fusing (ALBEF) them throughcross-modal attention, which enables more grounded vision and languagerepresentation learning. Unlike most existing methods, our method does notrequire bounding box annotations nor high-resolution images. In order toimprove learning from noisy web data, we propose momentum distillation, aself-training method which learns from pseudo-targets produced by a momentummodel. We provide a theoretical analysis of ALBEF from a mutual informationmaximization perspective, showing that different training tasks can beinterpreted as different ways to generate views for an image-text pair. ALBEFachieves state-of-the-art performance on multiple downstream vision-languagetasks. On image-text retrieval, ALBEF outperforms methods that are pre-trainedon orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achievesabsolute improvements of 2.37% and 3.84% compared to the state-of-the-art,while enjoying faster inference speed. Code and pre-trained models areavailable at https://github.com/salesforce/ALBEF/.