Differentiable Allophone Graphs for Language-Universal Speech Recognition

Abstract

Building language-universal speech recognition systems entails producingphonological units of spoken sound that can be shared across languages. Whilespeech annotations at the language-specific phoneme or surface levels arereadily available, annotations at a universal phone level are relatively rareand difficult to produce. In this work, we present a general framework toderive phone-level supervision from only phonemic transcriptions andphone-to-phoneme mappings with learnable weights represented using weightedfinite-state transducers, which we call differentiable allophone graphs. Bytraining multilingually, we build a universal phone-based speech recognitionmodel with interpretable probabilistic phone-to-phoneme mappings for eachlanguage. These phone-based systems with learned allophone graphs can be usedby linguists to document new languages, build phone-based lexicons that capturerich pronunciation variations, and re-evaluate the allophone mappings of seenlanguage. We demonstrate the aforementioned benefits of our proposed frameworkwith a system trained on 7 diverse languages.

Quick Read (beta)

loading the full paper ...