Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Abstract

Language and Vision-Language Models (LLMs/VLMs) have revolutionized the fieldof AI by their ability to generate human-like text and understand images, butensuring their reliability is crucial. This paper aims to evaluate the abilityof LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini ProVision) to estimate their verbalized uncertainty via prompting. We propose thenew Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilitiesvia difficult queries and object counting, and the Net Calibration Error (NCE)to measure direction of miscalibration. Results show that both LLMs and VLMshave a high calibration error and are overconfident most of the time,indicating a poor capability for uncertainty estimation. Additionally wedevelop prompts for regression tasks, and we show that VLMs have poorcalibration when producing mean/standard deviation and 95% confidenceintervals.

Quick Read (beta)

loading the full paper ...