Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation

Abstract

Hallucinations are a common issue that undermine the reliability of largelanguage models (LLMs). Recent studies have identified a specific subset ofhallucinations, known as confabulations, which arise due to predictiveuncertainty of LLMs. To detect confabulations, various methods for estimatingpredictive uncertainty in natural language generation (NLG) have beendeveloped. These methods are typically evaluated by correlating uncertaintyestimates with the correctness of generated text, with question-answering (QA)datasets serving as the standard benchmark. However, commonly used approximatecorrectness functions have substantial disagreement between each other and,consequently, in the ranking of the uncertainty estimation methods. This allowsone to inflate the apparent performance of uncertainty estimation methods. Wepropose using several alternative risk indicators for risk correlationexperiments that improve robustness of empirical assessment of UE algorithmsfor NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judgevariants leads to reducing the evaluation biases. Furthermore, we explorestructured tasks as well as out of distribution and perturbation detectiontasks which provide robust and controllable risk indicators. Finally, wepropose to use an Elo rating of uncertainty estimation methods to give anobjective summarization over extensive evaluation settings.

Quick Read (beta)

loading the full paper ...