Abstract
While long-context large language models (LLMs) can technically summarizebook-length documents (>100K tokens), the length and complexity of thedocuments have so far prohibited evaluations of input-dependent aspects likefaithfulness. In this paper, we conduct the first large-scale human evaluationof faithfulness and content selection on LLM-generated summaries of fictionalbooks. Our study mitigates the issue of data contamination by focusing onsummaries of books published in 2023 or 2024, and we hire annotators who havefully read each book prior to the annotation task to minimize cost andcognitive burden. We collect FABLES, a dataset of annotations on 3,158 claimsmade in LLM-generated summaries of 26 books, at a cost of $5.2K USD, whichallows us to rank LLM summarizers based on faithfulness: Claude-3-Opussignificantly outperforms all closed-source LLMs, while the open-source Mixtralis on par with GPT-3.5-Turbo. An analysis of the annotations reveals that mostunfaithful claims relate to events and character states, and they generallyrequire indirect reasoning over the narrative to invalidate. While LLM-basedauto-raters have proven reliable for factuality and coherence in othersettings, we implement several LLM raters of faithfulness and find that nonecorrelates strongly with human annotations, especially with regard to detectingunfaithful claims. Our experiments suggest that detecting unfaithful claims isan important future direction not only for summarization evaluation but also asa testbed for long-context understanding. Finally, we move beyond faithfulnessby exploring content selection errors in book-length summarization: we developa typology of omission errors related to crucial narrative elements and alsoidentify a systematic over-emphasis on events occurring towards the end of thebook.