AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Abstract

Large Vision-Language Models (LVLMs) have become essential for advancing theintegration of visual and linguistic information, facilitating a wide range ofcomplex applications and tasks. However, the evaluation of LVLMs presentssignificant challenges as the evaluation benchmark always demands lots of humancost for its construction, and remains static, lacking flexibility onceconstructed. Even though automatic evaluation has been explored in textualmodality, the visual modality remains under-explored. As a result, in thiswork, we address a question: "Can LVLMs serve as a path to automaticbenchmarking?". We introduce AutoBench-V, an automated framework for servingevaluation on demand, i.e., benchmarking LVLMs based on specific aspects ofmodel capability. Upon receiving an evaluation capability, AutoBench-Vleverages text-to-image models to generate relevant image samples and thenutilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completingthe evaluation process efficiently and flexibly. Through an extensiveevaluation of seven popular LVLMs across five demanded user inputs (i.e.,evaluation capabilities), the framework shows effectiveness and reliability. Weobserve the following: (1) Our constructed benchmark accurately reflectsvarying task difficulties; (2) As task difficulty rises, the performance gapbetween models widens; (3) While models exhibit strong performance in abstractlevel understanding, they underperform in details reasoning tasks; and (4)Constructing a dataset with varying levels of difficulties is critical for acomprehensive and exhaustive evaluation. Overall, AutoBench-V not onlysuccessfully utilizes LVLMs for automated benchmarking but also reveals thatLVLMs as judges have significant potential in various domains.

Quick Read (beta)

loading the full paper ...