Abstract
Modern LLMs can now produce highly readable abstractive summaries, to thepoint that traditional automated metrics for evaluating summary quality, suchas ROUGE, have saturated. However, LLMs still sometimes introduce inaccuraciesinto summaries, i.e., information inconsistent with or unsupported by thecorresponding source. Measuring the occurrence of these often subtle factualinconsistencies automatically has proved challenging. This in turn hasmotivated development of metrics intended to measure the factual consistency ofgenerated summaries against sources. But are these approaches measuring whatthey purport to? Or are they mostly exploiting artifacts? In this work, westress test a range of automatic factuality metrics, including specializedmodels and LLM-based prompting methods, to probe what they actually capture.Using a shallow classifier to separate ``easy'' examples for factual evaluationwhere surface features suffice from ``hard'' cases requiring deeper reasoning,we find that all metrics show substantial performance drops on the latter.Furthermore, some metrics are more sensitive to benign, fact-preserving editsthan to factual corrections. Building on this observation, we demonstrate thatmost automatic factuality metrics can be gamed, i.e., their scores can beartificially inflated by appending innocuous, content-free sentences tosummaries. Among the metrics tested, the prompt based ChatGPT-DA approach isthe most robust and reliable. However, this comes with a notable caveat:Prompting LLMs to assess factuality may overly rely on their parametricknowledge rather than the provided reference when making judgments. Takentogether, our findings call into question the reliability of current factualitymetrics and prompt a broader reflection on what these metrics are trulymeasuring.