Evaluating Attribute Comprehension in Large Vision-Language Models

Abstract

Currently, large vision-language models have gained promising progress onmany downstream tasks. However, they still suffer many challenges infine-grained visual understanding tasks, such as object attributecomprehension. Besides, there have been growing efforts on the evaluations oflarge vision-language models, but lack of in-depth study of attributecomprehension and the visual language fine-tuning process. In this paper, wepropose to evaluate the attribute comprehension ability of largevision-language models from two perspectives: attribute recognition andattribute hierarchy understanding. We evaluate three vision-languageinteractions, including visual question answering, image-text matching, andimage-text cosine similarity. Furthermore, we explore the factors affectingattribute comprehension during fine-tuning. Through a series of quantitativeand qualitative experiments, we introduce three main findings: (1) Largevision-language models possess good attribute recognition ability, but theirhierarchical understanding ability is relatively limited. (2) Compared to ITC,ITM exhibits superior capability in capturing finer details, making it moresuitable for attribute understanding tasks. (3) The attribute information inthe captions used for fine-tuning plays a crucial role in attributeunderstanding. We hope this work can help guide future progress in fine-grainedvisual understanding of large vision-language models.

Quick Read (beta)

loading the full paper ...