Abstract
We present Pix2Cap-COCO, the first panoptic pixel-level caption datasetdesigned to advance fine-grained visual understanding. To achieve this, wecarefully design an automated annotation pipeline that prompts GPT-4V togenerate pixel-aligned, instance-specific captions for individual objectswithin images, enabling models to learn more granular relationships betweenobjects and their contexts. This approach results in 167,254 detailed captions,with an average of 22.94 words per caption. Building on Pix2Cap-COCO, weintroduce a novel task, panoptic segmentation-captioning, which challengesmodels to recognize instances in an image and provide detailed descriptions foreach simultaneously. To benchmark this task, we design a robust baseline basedon X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is aparticularly challenging dataset, as it requires models to excel in bothfine-grained visual understanding and detailed language generation.Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on largemultimodal models (LMMs) to enhance their performance. For example, trainingwith Pix2Cap-COCO significantly improves the performance of GPT4RoI, yieldinggains in CIDEr +1.4%, ROUGE +0.4%, and SPICE +0.5% on Visual Genome dataset,and strengthens its region understanding ability on the ViP-BENCH, with anoverall improvement of +5.1%, including notable increases in recognitionaccuracy +11.2% and language generation quality +22.2%.