Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Abstract

We propose Pixel-BERT to align image pixels with text by deep multi-modaltransformers that jointly learn visual and language embedding in a unifiedend-to-end framework. We aim to build a more accurate and thorough connectionbetween image pixels and language semantics directly from image and sentencepairs instead of using region-based image features as the most recent visionand language tasks. Our Pixel-BERT which aligns semantic connection in pixeland text level solves the limitation of task-specific visual representation forvision and language tasks. It also relieves the cost of bounding boxannotations and overcomes the unbalance between semantic labels in visual taskand language semantic. To provide a better representation for down-streamtasks, we pre-train a universal end-to-end model with image and sentence pairsfrom Visual Genome dataset and MS-COCO dataset. We propose to use a randompixel sampling mechanism to enhance the robustness of visual representation andto apply the Masked Language Model and Image-Text Matching as pre-trainingtasks. Extensive experiments on downstream tasks with our pre-trained modelshow that our approach makes the most state-of-the-arts in downstream tasks,including Visual Question Answering (VQA), image-text retrieval, NaturalLanguage for Visual Reasoning for Real (NLVR). Particularly, we boost theperformance of a single model in VQA task by 2.17 points compared with SOTAunder fair comparison.

Quick Read (beta)

loading the full paper ...