Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Abstract

Identifying cross-language plagiarism is challenging, especially for distantlanguage pairs and sense-for-sense translations. We introduce the newmultilingual retrieval model Cross-Language Ontology-Based Similarity Analysis(CL\nobreakdash-OSA) for this task. CL-OSA represents documents as entityvectors obtained from the open knowledge graph Wikidata. Opposed to othermethods, CL\nobreakdash-OSA does not require computationally expensive machinetranslation, nor pre-training using comparable or parallel corpora. It reliablydisambiguates homonyms and scales to allow its application to Web-scaledocument collections. We show that CL-OSA outperforms state-of-the-art methodsfor retrieving candidate documents from five large, topically diverse testcorpora that include distant language pairs like Japanese-English. Foridentifying cross-language plagiarism at the character level, CL-OSA primarilyimproves the detection of sense-for-sense translations. For these challengingcases, CL-OSA's performance in terms of the well-established PlagDet scoreexceeds that of the best competitor by more than factor two. The code and dataof our study are openly available.

Quick Read (beta)

loading the full paper ...