STORYSUMM: Evaluating Faithfulness in Story Summarization

Abstract

Human evaluation has been the gold standard for checking faithfulness inabstractive summarization. However, with a challenging source domain likenarrative, multiple annotators can agree a summary is faithful, while missingdetails that are obvious errors only once pointed out. We therefore introduce anew dataset, STORYSUMM, comprising LLM summaries of short stories withlocalized faithfulness labels and error explanations. This benchmark is forevaluation methods, testing whether a given method can detect challenginginconsistencies. Using this dataset, we first show that any one humanannotation protocol is likely to miss inconsistencies, and we advocate forpursuing a range of methods when establishing ground truth for a summarizationdataset. We finally test recent automatic metrics and find that none of themachieve more than 70% balanced accuracy on this task, demonstrating that it isa challenging benchmark for future work in faithfulness evaluation.

Quick Read (beta)

loading the full paper ...