KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Abstract

Self-supervised vision-and-language pretraining (VLP) aims to learntransferable multi-modal representations from large-scale image-text data andto achieve strong performances on a broad scope of vision-language tasks afterfinetuning. Previous mainstream VLP approaches typically adopt a two-stepstrategy relying on external object detectors to encode images in a multi-modalTransformer framework, which suffer from restrictive object concept space,limited image context and inefficient computation. In this paper, we propose anobject-aware end-to-end VLP framework, which directly feeds image grid featuresfrom CNNs into the Transformer and learns the multi-modal representationsjointly. More importantly, we propose to perform object knowledge distillationto facilitate learning cross-modal alignment at different semantic levels. Toachieve that, we design two novel pretext tasks by taking object features andtheir semantic labels from external detectors as supervision: 1.) Object-guidedmasked vision modeling task focuses on enforcing object-aware representationlearning in the multi-modal Transformer; 2.) Phrase-region alignment task aimsto improve cross-modal alignment by utilizing the similarities between nounphrases and object labels in the linguistic space. Extensive experiments on awide range of vision-language tasks demonstrate the efficacy of our proposedframework, and we achieve competitive or superior performances over theexisting pretraining strategies.

Quick Read (beta)

loading the full paper ...