Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework

Abstract

Large language models (LLMs) are increasingly adopted in medicalquestion-answering (QA) scenarios. However, LLMs can generate hallucinationsand nonfactual information, undermining their trustworthiness in high-stakesmedical tasks. Conformal Prediction (CP) provides a statistically rigorousframework for marginal (average) coverage guarantees but has limitedexploration in medical QA. This paper proposes an enhanced CP framework formedical multiple-choice question-answering (MCQA) tasks. By associating thenon-conformance score with the frequency score of correct options andleveraging self-consistency, the framework addresses internal model opacity andincorporates a risk control strategy with a monotonic loss function. Evaluatedon MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, theproposed method meets specified error rate guarantees while reducing averageprediction set size with increased risk level, offering a promising uncertaintyevaluation metric for LLMs.

Quick Read (beta)

loading the full paper ...