SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Abstract

Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasetsgenerated by text-to-image (T2I) models to mitigate the need for costly manualannotation. However, these T2I models often produce images that exhibitsemantic misalignments with their corresponding input captions (e.g., missingobjects, incorrect attributes), resulting in noisy synthetic image-captionpairs that can hinder model training. Existing dataset pruning techniques arelargely designed for removing noisy text in web-crawled data. However, thesemethods are ill-suited for the distinct challenges of synthetic data, wherecaptions are typically well-formed, but images may be inaccuraterepresentations. To address this gap, we introduce SynC, a novel frameworkspecifically designed to refine synthetic image-caption datasets for ZIC.Instead of conventional filtering or regeneration, SynC focuses on reassigningcaptions to the most semantically aligned images already present within thesynthetic image pool. Our approach employs a one-to-many mapping strategy byinitially retrieving multiple relevant candidate images for each caption. Wethen apply a cycle-consistency-inspired alignment scorer that selects the bestimage by verifying its ability to retrieve the original caption viaimage-to-text retrieval. Extensive evaluations demonstrate that SynCconsistently and significantly improves performance across various ZIC modelson standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-artresults in several scenarios. SynC offers an effective strategy for curatingrefined synthetic data to enhance ZIC.

Quick Read (beta)

loading the full paper ...