Abstract
Multimodal generative AI usually involves generating image or text responsesgiven inputs in another modality. The evaluation of image-text relevancy isessential for measuring response quality or ranking candidate responses. Inparticular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``NotRelevant'', is a fundamental problem. However, this is a challenging taskconsidering that texts have diverse formats and the definition of relevancyvaries in different scenarios. We find that Multimodal Large Language Models(MLLMs) are an ideal choice to build such evaluators, as they can flexiblyhandle complex text formats and take in additional task information. In thispaper, we present LLaVA-RE, a first attempt for binary image-text relevancyevaluation with MLLM. It follows the LLaVA architecture and adopts detailedtask instructions and multimodal in-context samples. In addition, we propose anovel binary relevancy data set that covers various tasks. Experimental resultsvalidate the effectiveness of our framework.