Abstract
Multimodal large language models (MLLMs) have emerged as powerful tools forvisual question answering (VQA), enabling reasoning and contextualunderstanding across visual and textual modalities. Despite their advancements,the evaluation of MLLMs on VQA benchmarks often relies on point estimates,overlooking the significant variance in performance caused by factors such asstochastic model outputs, training seed sensitivity, and hyperparameterconfigurations. This paper critically examines these issues by analyzingvariance across 14 widely used VQA benchmarks, covering diverse tasks such asvisual reasoning, text understanding, and commonsense reasoning. Wesystematically study the impact of training seed, framework non-determinism,model scale, and extended instruction finetuning on performance variability.Additionally, we explore Cloze-style evaluation as an alternate assessmentstrategy, studying its effectiveness in reducing stochasticity and improvingreliability across benchmarks. Our findings highlight the limitations ofcurrent evaluation practices and advocate for variance-aware methodologies tofoster more robust and reliable development of MLLMs.