Abstract
This paper presents a unified Vision-Language Pre-training (VLP) model. Themodel is unified in that (1) it can be fine-tuned for either vision-languagegeneration (e.g., image captioning) or understanding (e.g., visual questionanswering) tasks, and (2) it uses a shared multi-layer transformer network forboth encoding and decoding, which differs from many existing methods where theencoder and decoder are implemented using separate models. The unified VLPmodel is pre-trained on a large amount of image-text pairs using theunsupervised learning objectives of two tasks: bidirectional andsequence-to-sequence (seq2seq) masked vision-language prediction. The two tasksdiffer solely in what context the prediction conditions on. This is controlledby utilizing specific self-attention masks for the shared transformer network.To the best of our knowledge, VLP is the first reported model that achievesstate-of-the-art results on both vision-language generation and understandingtasks, as disparate as image captioning and visual question answering, acrossthree challenging benchmark datasets: COCO Captions, Flickr30k Captions, andVQA 2.0. The code and the pre-trained models are available athttps://github.com/LuoweiZhou/VLP.