Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection

Abstract

Grammatical Error Detection (GED) methods rely heavily on human annotatederror corpora. However, these annotations are unavailable in many low-resourcelanguages. In this paper, we investigate GED in this context. Leveraging thezero-shot cross-lingual transfer capabilities of multilingual pre-trainedlanguage models, we train a model using data from a diverse set of languages togenerate synthetic errors in other languages. These synthetic error corpora arethen used to train a GED model. Specifically we propose a two-stage fine-tuningpipeline where the GED model is first fine-tuned on multilingual synthetic datafrom target languages followed by fine-tuning on human-annotated GED corporafrom source languages. This approach outperforms current state-of-the-artannotation-free GED methods. We also analyse the errors produced by our methodand other strong baselines, finding that our approach produces errors that aremore diverse and more similar to human errors.

Quick Read (beta)

loading the full paper ...