A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

Abstract

Existing research on news summarization primarily focuses on single-languagesingle-document (SLSD), single-language multi-document (SLMD) or cross-languagesingle-document (CLSD). However, in real-world scenarios, news about ainternational event often involves multiple documents in different languages,i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news isof great significance. However, the lack of datasets for MLMD newssummarization has constrained the development of research in this area. To fillthis gap, we construct a mixed-language multi-document news summarizationdataset (MLMD-news), which contains four different languages and 10,992 sourcedocument cluster and target summary pairs. Additionally, we propose agraph-based extract-generate model and benchmark various methods on theMLMD-news dataset and publicly release our dataset andcode\footnote[1]{https://github.com/Southnf9/MLMD-news}, aiming to advanceresearch in summarization within MLMD scenarios.

Quick Read (beta)

loading the full paper ...