Open Subtitles Paraphrase Corpus for Six Languages

Abstract

This paper accompanies the release of Opusparcus, a new paraphrase corpus forsix European languages: German, English, Finnish, French, Russian, and Swedish.The corpus consists of paraphrases, that is, pairs of sentences in the samelanguage that mean approximately the same thing. The paraphrases are extractedfrom the OpenSubtitles2016 corpus, which contains subtitles from movies and TVshows. The informal and colloquial genre that occurs in subtitles makes suchdata a very interesting language resource, for instance, from the perspectiveof computer assisted language learning. For each target language, theOpusparcus data have been partitioned into three types of data sets: training,development and test sets. The training sets are large, consisting of millionsof sentence pairs, and have been compiled automatically, with the help ofprobabilistic ranking functions. The development and test sets consist ofsentence pairs that have been checked manually; each set contains approximately1000 sentence pairs that have been verified to be acceptable paraphrases by twoannotators.

Quick Read (beta)

loading the full paper ...