Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Abstract

While large language models (LLMs) already achieve strong performance onstandard generic summarization benchmarks, their performance on more complexsummarization task settings is less studied. Therefore, we benchmark LLMs oninstruction controllable text summarization, where the model input consists ofboth a source article and a natural language requirement for the desiredsummary characteristics. To this end, we curate an evaluation-only dataset forthis task setting and conduct human evaluation on 5 LLM-based summarizationsystems. We then benchmark LLM-based automatic evaluation for this task with 4different evaluation protocols and 11 LLMs, resulting in 40 evaluation methodsin total. Our study reveals that instruction controllable text summarizationremains a challenging task for LLMs, since (1) all LLMs evaluated still makefactual and other types of errors in their summaries; (2) all LLM-basedevaluation methods cannot achieve a strong alignment with human annotators whenjudging the quality of candidate summaries; (3) different LLMs show largeperformance gaps in summary generation and evaluation. We make our collectedbenchmark, InstruSum, publicly available to facilitate future research in thisdirection.

Quick Read (beta)

loading the full paper ...