Distinctive Image Captioning via CLIP Guided Group Optimization

Abstract

Image captioning models are usually trained according to human annotatedground-truth captions, which could generate accurate but generic captions. Inthis paper, we focus on generating the distinctive captions that candistinguish the target image from other similar images. To evaluate thedistinctiveness of captions, we introduce a series of metrics that uselarge-scale vision-language pre-training model CLIP to quantify thedistinctiveness. To further improve the distinctiveness of captioning models,we propose a simple and effective training strategy which trains the model bycomparing target image with similar image group and optimizing the groupembedding gap. Extensive experiments are conducted on various baseline modelsto demonstrate the wide applicability of our strategy and the consistency ofmetric results with human evaluation. By comparing the performance of our bestmodel with existing state-of-the-art models, we claim that our model achievesnew state-of-the-art towards distinctiveness objective.

Quick Read (beta)

loading the full paper ...