What You See is What You Read? Improving Text-Image Alignment Evaluation

Abstract

Automatically determining whether a text and a corresponding image aresemantically aligned is a significant challenge for vision-language models,with applications in generative text-to-image and image-to-text tasks. In thiswork, we study methods for automatic text-image alignment evaluation. We firstintroduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasetsfrom both text-to-image and image-to-text generation tasks, with humanjudgements for whether a given text-image pair is semantically aligned. We thendescribe two automatic methods to determine alignment: the first involving apipeline based on question generation and visual question answering models, andthe second employing an end-to-end classification approach by finetuningmultimodal pretrained models. Both methods surpass prior approaches in varioustext-image alignment tasks, with significant improvements in challenging casesthat involve complex composition or unnatural images. Finally, we demonstratehow our approaches can localize specific misalignments between an image and agiven text, and how they can be used to automatically re-rank candidates intext-to-image generation.

Quick Read (beta)

loading the full paper ...