Abstract
The recent emergence of Large Vision-Language Models(VLMs) has resulted in avariety of different benchmarks for evaluating such models. Despite this, weobserve that most existing evaluation methods suffer from the fact that theyeither require the model to choose from pre-determined responses, sacrificingopen-endedness, or evaluate responses using a judge model, resulting insubjective and unreliable evaluation. In addition, we observe a lack ofbenchmarks for VLMs in the Korean language, which are necessary as a separatemetric from more common English language benchmarks, as the performance ofgenerative language models can differ significantly based on the language beingused. Therefore, we present KOFFVQA, a general-purpose free-form visualquestion answering benchmark in the Korean language for the evaluation of VLMs.Our benchmark consists of 275 carefully crafted questions each paired with animage and grading criteria covering 10 different aspects of VLM performance.The grading criteria eliminate the problem of unreliability by allowing thejudge model to grade each response based on a pre-determined set of rules. Bydefining the evaluation criteria in an objective manner, even a smallopen-source model can be used to evaluate models on our benchmark reliably. Inaddition to evaluating a large number of existing VLMs on our benchmark, wealso experimentally verify that our method of using pre-existing gradingcriteria for evaluation is much more reliable than existing methods. Ourevaluation code is available at https://github.com/maum-ai/KOFFVQA