A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model

Abstract

Synthetic data construction of Grammatical Error Correction (GEC) fornon-English languages relies heavily on human-designed and language-specificrules, which produce limited error-corrected patterns. In this paper, wepropose a generic and language-independent strategy for multilingual GEC, whichcan train a GEC system effectively for a new non-English language with only twoeasy-to-access resources: 1) a pretrained cross-lingual language model (PXLM)and 2) parallel translation data between English and the language. Our approachcreates diverse parallel GEC data without any language-specific operations bytaking the non-autoregressive translation generated by PXLM and the goldtranslation as error-corrected sentence pairs. Then, we reuse PXLM toinitialize the GEC model and pretrain it with the synthetic data generated byitself, which yields further improvement. We evaluate our approach on threepublic benchmarks of GEC in different languages. It achieves thestate-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtainscompetitive performance on Falko-Merlin (German) and RULEC-GEC (Russian).Further analysis demonstrates that our data construction method iscomplementary to rule-based approaches.

Quick Read (beta)

loading the full paper ...