Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Abstract

We present improvements in automatic speech recognition (ASR) for Somali, acurrently extremely under-resourced language. This forms part of a continuingUnited Nations (UN) effort to employ ASR-based keyword spotting systems tosupport humanitarian relief programmes in rural Africa. Using just 1.57 hoursof annotated speech data as a seed corpus, we increase the pool of trainingdata by applying semi-supervised training to 17.55 hours of untranscribedspeech. We make use of factorised time-delay neural networks (TDNN-F) foracoustic modelling, since these have recently been shown to be effective inresource-scarce situations. Three semi-supervised training passes wereperformed, where the decoded output from each pass was used for acoustic modeltraining in the subsequent pass. The automatic transcriptions from the bestperforming pass were used for language model augmentation. To ensure thequality of automatic transcriptions, decoder confidence is used as a threshold.The acoustic and language models obtained from the semi-supervised approachshow significant improvement in terms of WER and perplexity compared to thebaseline. Incorporating the automatically generated transcriptions yields a6.55\% improvement in language model perplexity. The use of 17.55 hour ofSomali acoustic data in semi-supervised training shows an improvement of 7.74\%relative over the baseline.

Quick Read (beta)

loading the full paper ...