Abstract
Large language models (LLMs) are increasingly adopted in medicalquestion-answering (QA) scenarios. However, LLMs can generate hallucinationsand nonfactual information, undermining their trustworthiness in high-stakesmedical tasks. Conformal Prediction (CP) provides a statistically rigorousframework for marginal (average) coverage guarantees but has limitedexploration in medical QA. This paper proposes an enhanced CP framework formedical multiple-choice question-answering (MCQA) tasks. By associating thenon-conformance score with the frequency score of correct options andleveraging self-consistency, the framework addresses internal model opacity andincorporates a risk control strategy with a monotonic loss function. Evaluatedon MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, theproposed method meets specified error rate guarantees while reducing averageprediction set size with increased risk level, offering a promising uncertaintyevaluation metric for LLMs.