Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Abstract

Vision-language pre-training has been an emerging and fast-developingresearch topic, which transfers multi-modal knowledge from rich-resourcepre-training task to limited-resource downstream tasks. Unlike existing worksthat predominantly learn a single generic encoder, we present a pre-trainableUniversal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-languageperception (e.g., visual question answering) and generation (e.g., imagecaptioning). Uni-EDEN is a two-stream Transformer based structure, consistingof three modules: object and sentence encoders that separately learns therepresentations of each modality, and sentence decoder that enables bothmulti-modal reasoning and sentence generation via inter-modal interaction.Considering that the linguistic representations of each image can spandifferent granularities in this hierarchy including, from simple tocomprehensive, individual label, a phrase, and a natural sentence, we pre-trainUni-EDEN through multi-granular vision-language proxy tasks: Masked ObjectClassification (MOC), Masked Region Phrase Generation (MRPG), Image-SentenceMatching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN isendowed with the power of both multi-modal representation extraction andlanguage modeling. Extensive experiments demonstrate the compellinggeneralizability of Uni-EDEN by fine-tuning it to four vision-languageperception and generation downstream tasks.

Quick Read (beta)

loading the full paper ...