GroundCap: A Visually Grounded Image Captioning Dataset

Abstract

Current image captioning systems lack the ability to link descriptive text tospecific visual elements, making their outputs difficult to verify. Whilerecent approaches offer some grounding capabilities, they cannot track objectidentities across multiple references or ground both actions and objectssimultaneously. We propose a novel ID-based grounding system that enablesconsistent object reference tracking and action-object linking, and presentGroundCap, a dataset containing 52,016 images from 77 movies, with 344human-annotated and 52,016 automatically generated captions. Each caption isgrounded on detected objects (132 classes) and actions (51 classes) using a tagsystem that maintains object identity while linking actions to thecorresponding objects. Our approach features persistent object IDs forreference tracking, explicit action-object linking, and segmentation ofbackground elements through K-means clustering. We propose gMETEOR, a metriccombining caption quality with grounding accuracy, and establish baselineperformance by fine-tuning Pixtral-12B. Human evaluation demonstrates ourapproach's effectiveness in producing verifiable descriptions with coherentobject references.

Quick Read (beta)

loading the full paper ...