"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

Abstract

Well-formed context aware image captions and tags in enterprise content suchas marketing material are critical to ensure their brand presence and contentrecall. Manual creation and updates to ensure the same is non trivial given thescale and the tedium towards this task. We propose a new unifiedVision-Language (VL) model based on the One For All (OFA) model, with a focuson context-assisted image captioning where the caption is generated based onboth the image and its context. Our approach aims to overcome thecontext-independent (image and text are treated independently) nature of theexisting approaches. We exploit context by pretraining our model with datasetsof three tasks: news image captioning where the news article is the context,contextual visual entailment, and keyword extraction from the context. Thesecond pretraining task is a new VL task, and we construct and release twodatasets for the task with 1.1M and 2.2K data instances. Our system achievesstate-of-the-art results with an improvement of up to 8.34 CIDEr score on thebenchmark news image captioning datasets. To the best of our knowledge, ours isthe first effort at incorporating contextual information in pretraining themodels for the VL tasks.

Quick Read (beta)

loading the full paper ...