Abstract
One property that remains lacking in image captions generated by contemporarymethods is discriminability: being able to tell two images apart given thecaption for one of them. We propose a way to improve this aspect of captiongeneration. By incorporating into the captioning training objective a losscomponent directly related to ability (by a machine) to disambiguateimage/caption matches, we obtain systems that produce much more discriminativecaption, according to human evaluation. Remarkably, our approach leads toimprovement in other aspects of generated captions, reflected by a battery ofstandard scores such as BLEU, SPICE etc. Our approach is modular and can beapplied to a variety of model/loss combinations commonly proposed for imagecaptioning.