Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

Abstract

Furui first demonstrated that the identity of both consonant and vowel can beperceived from the C-V transition; later, Stevens proposed that acousticlandmarks are the primary cues for speech perception, and that steady-stateregions are secondary or supplemental. Acoustic landmarks are perceptuallysalient, even in a language one doesn't speak, and it has been demonstratedthat non-speakers of the language can identify features such as the primaryarticulator of the landmark. These factors suggest a strategy for developinglanguage-independent automatic speech recognition: landmarks can potentially belearned once from a suitably labeled corpus and rapidly applied to many otherlanguages. This paper proposes enhancing the cross-lingual portability of aneural network by using landmarks as the secondary task in multi-task learning(MTL). The network is trained in a well-resourced source language with bothphone and landmark labels (English), then adapted to an under-resourced targetlanguage with only word labels (Iban). Landmark-tasked MTL reducessource-language phone error rate by 2.9% relative, and reduces target-languageword error rate by 1.9%-5.9% depending on the amount of target-languagetraining data. These results suggest that landmark-tasked MTL causes the DNN tolearn hidden-node features that are useful for cross-lingual adaptation.

Quick Read (beta)

loading the full paper ...