Abstract
The evaluation of text-generative vision-language models is a challenging yetcrucial endeavor. By addressing the limitations of existing Visual QuestionAnswering (VQA) benchmarks and proposing innovative evaluation methodologies,our research seeks to advance our understanding of these models' capabilities.We propose a novel VQA benchmark based on well-known visual classificationdatasets which allows a granular evaluation of text-generative vision-languagemodels and their comparison with discriminative vision-language models. Toimprove the assessment of coarse answers on fine-grained classification tasks,we suggest using the semantic hierarchy of the label space to ask automaticallygenerated follow-up questions about the ground-truth category. Finally, wecompare traditional NLP and LLM-based metrics for the problem of evaluatingmodel predictions given ground-truth answers. We perform a human evaluationstudy upon which we base our decision on the final metric. We apply ourbenchmark to a suite of vision-language models and show a detailed comparisonof their abilities on object, action, and attribute classification. Ourcontributions aim to lay the foundation for more precise and meaningfulassessments, facilitating targeted progress in the exciting field ofvision-language modeling.