E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

  • 2021-06-03 12:50:26
  • Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang
  • 39

Abstract

Vision-language pre-training (VLP) on large-scale image-text pairs hasachieved huge success for the cross-modal downstream tasks. The most existingpre-training methods mainly adopt a two-step training procedure, which firstlyemploys a pre-trained object detector to extract region-based visual features,then concatenates the image representation and text embedding as the input ofTransformer to train. However, these methods face problems of usingtask-specific visual representation of the specific object detector for genericcross-modal understanding, and the computation inefficiency of two-stagepipeline. In this paper, we propose the first end-to-end vision-languagepre-trained model for both V+L understanding and generation, namely E2E-VLP,where we build a unified Transformer framework to jointly learn visualrepresentation, and semantic alignments between image and text. We incorporatethe tasks of object detection and image captioning into pre-training with aunified Transformer encoder-decoder architecture for enhancing visual learning.An extensive set of experiments have been conducted on well-establishedvision-language downstream tasks to demonstrate the effectiveness of this novelVLP paradigm.

 

Quick Read (beta)

loading the full paper ...