Abstract
Single document news summarization has seen substantial progress onfaithfulness in recent years, driven by research on the evaluation of factualconsistency, or hallucinations. We ask whether these advances carry over toother text summarization domains. We propose a new evaluation benchmark ontopic-focused dialogue summarization, generated by LLMs of varying sizes. Weprovide binary sentence-level human annotations of the factual consistency ofthese summaries along with detailed explanations of factually inconsistentsentences. Our analysis shows that existing LLMs hallucinate significantamounts of factual errors in the dialogue domain, regardless of the model'ssize. On the other hand, when LLMs, including GPT-4, serve as binary factualevaluators, they perform poorly and can be outperformed by prevailingstate-of-the-art specialized factuality evaluation metrics. Finally, weconducted an analysis of hallucination types with a curated error taxonomy. Wefind that there are diverse errors and error distributions in model-generatedsummaries and that non-LLM based metrics can capture all error types betterthan LLM-based evaluators.