Abstract
We present ViLBERT (short for Vision-and-Language BERT), a model for learningtask-agnostic joint representations of image content and natural language. Weextend the popular BERT architecture to a multi-modal two-stream model,pro-cessing both visual and textual inputs in separate streams that interactthrough co-attentional transformer layers. We pretrain our model through twoproxy tasks on the large, automatically collected Conceptual Captions datasetand then transfer it to multiple established vision-and-language tasks --visual question answering, visual commonsense reasoning, referring expressions,and caption-based image retrieval -- by making only minor additions to the basearchitecture. We observe significant improvements across tasks compared toexisting task-specific models -- achieving state-of-the-art on all four tasks.Our work represents a shift away from learning groundings between vision andlanguage only as part of task training and towards treating visual grounding asa pretrainable and transferable capability.