CapWAP: Captioning with a Purpose

Abstract

The traditional image captioning task uses generic reference captions toprovide textual information about images. Different user populations, however,will care about different visual aspects of images. In this paper, we propose anew task, Captioning with a Purpose (CapWAP). Our goal is to develop systemsthat can be tailored to be useful for the information needs of an intendedpopulation, rather than merely provide generic information about an image. Inthis task, we use question-answer (QA) pairs---a natural expression ofinformation need---from users, instead of reference captions, for both trainingand post-inference evaluation. We show that it is possible to use reinforcementlearning to directly optimize for the intended information need, by rewardingoutputs that allow a question answering model to provide correct answers tosampled user questions. We convert several visual question answering datasetsinto CapWAP datasets, and demonstrate that under a variety of scenarios ourpurposeful captioning system learns to anticipate and fulfill specificinformation needs better than its generic counterparts, as measured by QAperformance on user questions from unseen images, when using the caption aloneas context.

Quick Read (beta)

loading the full paper ...