WikiHow: A Large Scale Text Summarization Dataset

Abstract

Sequence-to-sequence models have recently gained the state of the artperformance in summarization. However, not too many large-scale high-qualitydatasets are available and almost all the available ones are mainly newsarticles with specific writing style. Moreover, abstractive human-style systemsinvolving description of the content at a deeper level require data with higherlevels of abstraction. In this paper, we present WikiHow, a dataset of morethan 230,000 article and summary pairs extracted and constructed from an onlineknowledge base written by different human authors. The articles span a widerange of topics and therefore represent high diversity styles. We evaluate theperformance of the existing methods on WikiHow to present its challenges andset some baselines to further improve it.

Quick Read (beta)

loading the full paper ...