Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction

Abstract

This study addresses the critical challenge of hallucination mitigation inLarge Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasksthrough a Split Conformal Prediction (SCP) framework. While LVLMs excel inmulti-modal reasoning, their outputs often exhibit hallucinated content withhigh confidence, posing risks in safety-critical applications. We propose amodel-agnostic uncertainty quantification method that integrates dynamicthreshold calibration and cross-modal consistency verification. By partitioningdata into calibration and test sets, the framework computes nonconformityscores to construct prediction sets with statistical guarantees underuser-defined risk levels ($\alpha$). Key innovations include: (1) rigorouscontrol of \textbf{marginal coverage} to ensure empirical error rates remainstrictly below $\alpha$; (2) dynamic adjustment of prediction set sizesinversely with $\alpha$, filtering low-confidence outputs; (3) elimination ofprior distribution assumptions and retraining requirements. Evaluations onbenchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforcestheoretical guarantees across all $\alpha$ values. The framework achievesstable performance across varying calibration-to-test split ratios,underscoring its robustness for real-world deployment in healthcare, autonomoussystems, and other safety-sensitive domains. This work bridges the gap betweentheoretical reliability and practical applicability in multi-modal AI systems,offering a scalable solution for hallucination detection and uncertainty-awaredecision-making.

Quick Read (beta)

loading the full paper ...