AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Abstract

For Large Language Models (LLMs) to be reliably deployed in both everyday andhigh-stakes domains, knowing when not to answer is equally critical asanswering correctly. Real-world user queries, which can be underspecified,ill-posed, or fundamentally unanswerable, require LLMs to reason aboutuncertainty and selectively abstain -- i.e., refuse to answer definitively.However, abstention remains understudied, without a systematic evaluationframework for modern LLMs. In this work, we introduce AbstentionBench, alarge-scale benchmark for holistically evaluating abstention across 20 diversedatasets, including questions with unknown answers, underspecification, falsepremises, subjective interpretations, and outdated information. Evaluating 20frontier LLMs reveals abstention is an unsolved problem, and one where scalingmodels is of little use. While recent reasoning LLMs have shown impressiveresults in complex problem solving, surprisingly, we find that reasoningfine-tuning degrades abstention (by $24\%$ on average), even for math andscience domains on which reasoning models are explicitly trained. We find thatwhile a carefully crafted system prompt can boost abstention in practice, itdoes not resolve models' fundamental inability to reason about uncertainty. Werelease AbstentionBench to foster research into advancing LLM reliability.

Quick Read (beta)

loading the full paper ...