Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition

Abstract

The SENCOTEN language, spoken on the Saanich peninsula of southern VancouverIsland, is in the midst of vigorous language revitalization efforts to turn thetide of language loss as a result of colonial language policies. To supportthese on-the-ground efforts, the community is turning to digital technology.Automatic Speech Recognition (ASR) technology holds great promise foraccelerating language documentation and the creation of educational resources.However, developing ASR systems for SENCOTEN is challenging due to limited dataand significant vocabulary variation from its polysynthetic structure andstress-driven metathesis. To address these challenges, we propose an ASR-drivendocumentation pipeline that leverages augmented speech data from atext-to-speech (TTS) system and cross-lingual transfer learning with SpeechFoundation Models (SFMs). An n-gram language model is also incorporated viashallow fusion or n-best restoring to maximize the use of available data.Experiments on the SENCOTEN dataset show a word error rate (WER) of 19.34% anda character error rate (CER) of 5.09% on the test set with a 57.02%out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WERimproves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating thepotential of our ASR-driven pipeline to support SENCOTEN languagedocumentation.

Quick Read (beta)

loading the full paper ...