Abstract
This study evaluated self-reported response certainty across several largelanguage models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen)using 300 gastroenterology board-style questions. The highest-performing models(GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improvedperformance, all exhibited a consistent tendency towards overconfidence.Uncertainty estimation presents a significant challenge to the safe use of LLMsin healthcare. Keywords: Large Language Models; Confidence Elicitation;Artificial Intelligence; Gastroenterology; Uncertainty Quantification
Quick Read (beta)
loading the full paper ...