Abstract
Image captioning aims to describe an image with a natural language sentence,allowing powerful language models to understand images. The framework ofcombining image captioning with language models has been successful on variousvision-language tasks. However, an image contains much more information than asingle sentence, leading to underspecification of which visual entities shouldbe described in the caption sentence. For example, when performing visualquestioning answering (VQA), generic image captions often miss visual detailsthat are essential for the language model to answer correctly. To address thischallenge, we propose PromptCap, a captioning model that takes anatural-language prompt to control the contents of the generated caption. Theprompt contains a question that the caption should help to answer, and alsosupports taking auxiliary text inputs such as scene text within the imageitself. To finetune a general image caption model for prompt-guided captioning,we propose a pipeline to synthesize and filter training examples with GPT-3 andexisting VQA datasets. For evaluation, we start with an existing pipeline inwhich a language model is prompted with image captions to carry out VQA. Withthe same language model, a higher QA accuracy shows that our generated captionsare more relevant to the question prompts. PromptCap outperforms genericcaptions by a large margin on a variety of VQA tasks and achieves thestate-of-the-art accuracy of 58.8 % on OK-VQA and 58.0 % on A-OKVQA. Zero-shotexperiments on WebQA show that PromptCap generalizes well to unseen domains.