Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Abstract

We propose Unicoder-VL, a universal encoder that aims to learn jointrepresentations of vision and language in a pre-training manner. Borrow ideasfrom cross-lingual pre-trained models, such as XLM and Unicoder, both visualand linguistic contents are fed into a multi-layer transformer for thecross-modal pre-training, where three pre-trained tasks are employed, includingmasked language model, masked object label prediction and visual-linguisticmatching. The first two tasks learn context-aware representations for inputtokens based on linguistic and visual contents jointly. The last task tries topredict whether an image and a text describe each other. After pretraining onlarge amounts of image-caption pairs, we transfer Unicoder-VL to image-textretrieval tasks with just one additional output layer, and achievestate-of-the-art performances on both MSCOCO and Flicker30K.

Quick Read (beta)

loading the full paper ...