Error-preserving Automatic Speech Recognition of Young English Learners' Language

Abstract

One of the central skills that language learners need to practice is speakingthe language. Currently, students in school do not get enough speakingopportunities and lack conversational practice. Recent advances in speechtechnology and natural language processing allow for the creation of noveltools to practice their speaking skills. In this work, we tackle the firstcomponent of such a pipeline, namely, the automated speech recognition module(ASR), which faces a number of challenges: first, state-of-the-art ASR modelsare often trained on adult read-aloud data by native speakers and do nottransfer well to young language learners' speech. Second, most ASR systemscontain a powerful language model, which smooths out errors made by thespeakers. To give corrective feedback, which is a crucial part of languagelearning, the ASR systems in our setting need to preserve the errors made bythe language learners. In this work, we build an ASR system that satisfiesthese requirements: it works on spontaneous speech by young language learnersand preserves their errors. For this, we collected a corpus containing around85 hours of English audio spoken by learners in Switzerland from grades 4 to 6on different language learning tasks, which we used to train an ASR model. Ourexperiments show that our model benefits from direct fine-tuning on children'svoices and has a much higher error preservation rate than other models.

Quick Read (beta)

loading the full paper ...