ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

Abstract

Asking insightful questions is crucial for acquiring knowledge and expandingour understanding of the world. However, the importance of questioning has beenlargely overlooked in AI research, where models have been primarily developedto answer questions. With the recent advancements of large language models(LLMs) like ChatGPT, we discover their capability to ask high-quality questionswhen provided with a suitable prompt. This discovery presents a new opportunityto develop an automatic questioning system. In this paper, we introduceChatCaptioner, a novel automatic-questioning method deployed in imagecaptioning. Here, ChatGPT is prompted to ask a series of informative questionsabout images to BLIP-2, a strong vision question-answering model. By keepingacquiring new visual information from BLIP-2's answers, ChatCaptioner is ableto generate more enriched image descriptions. We conduct human-subjectevaluations on common image caption datasets such as COCO, Conceptual Caption,and WikiArt, and compare ChatCaptioner with BLIP-2 as well as ground truth. Ourresults demonstrate that ChatCaptioner's captions are significantly moreinformative, receiving three times as many votes from human evaluators forproviding the most image information. Besides, ChatCaptioner identifies 53%more objects within the image than BLIP-2 alone measured by WordNet synsetmatching. Code is available at https://github.com/Vision-CAIR/ChatCaptioner

Quick Read (beta)

loading the full paper ...