Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Abstract

Reliability and failure detection of large language models (LLMs) is criticalfor their deployment in high-stakes, multi-step reasoning tasks. Prior workexplores confidence estimation for self-evaluating LLM-scorer systems, withconfidence scorers estimating the likelihood of errors in LLM responses.However, most methods focus on single-step outputs and overlook the challengesof multi-step reasoning. In this work, we extend self-evaluation techniques tomulti-step tasks, testing two intuitive approaches: holistic scoring andstep-by-step scoring. Using two multi-step benchmark datasets, we show thatstepwise evaluation generally outperforms holistic scoring in detectingpotential errors, with up to 15% relative increase in AUC-ROC. Our findingsdemonstrate that self-evaluating LLM systems provide meaningful confidenceestimates in complex reasoning, improving their trustworthiness and providing apractical framework for failure detection.

Quick Read (beta)

loading the full paper ...