Data Augmentation for Spoken Grammatical Error Correction

Abstract

While there exist strong benchmark datasets for grammatical error correction(GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are stillunder-resourced. In this paper, we propose a fully automated method to generateaudio-text pairs with grammatical errors and disfluencies. Moreover, we proposea series of objective metrics that can be used to evaluate the generated dataand choose the more suitable dataset for SGEC. The goal is to generate anaugmented dataset that maintains the textual and acoustic characteristics ofthe original data while providing new types of errors. This augmented datasetshould augment and enrich the original corpus without altering the languageassessment scores of the second language (L2) learners. We evaluate the use ofthe augmented corpus both for written GEC (the text part) and for SGEC (theaudio-text pairs). Our experiments are conducted on the S\&I Corpus, the firstpublicly available speech dataset with grammar error annotations.

Quick Read (beta)

loading the full paper ...