Abstract
Vision-language pre-training (VLP) on large-scale image-text pairs hasachieved huge success for the cross-modal downstream tasks. The most existingpre-training methods mainly adopt a two-step training procedure, which firstlyemploys a pre-trained object detector to extract region-based visual features,then concatenates the image representation and text embedding as the input ofTransformer to train. However, these methods face problems of usingtask-specific visual representation of the specific object detector for genericcross-modal understanding, and the computation inefficiency of two-stagepipeline. In this paper, we propose the first end-to-end vision-languagepre-trained model for both V+L understanding and generation, namely E2E-VLP,where we build a unified Transformer framework to jointly learn visualrepresentation, and semantic alignments between image and text. We incorporatethe tasks of object detection and image captioning into pre-training with aunified Transformer encoder-decoder architecture for enhancing visual learning.An extensive set of experiments have been conducted on well-establishedvision-language downstream tasks to demonstrate the effectiveness of this novelVLP paradigm.