Seed Words Based Data Selection for Language Model Adaptation

Abstract

We address the problem of language model customization in applications wherethe ASR component needs to manage domain-specific terminology; although currentstate-of-the-art speech recognition technology provides excellent results forgeneric domains, the adaptation to specialized dictionaries or glossaries isstill an open issue. In this work we present an approach for automaticallyselecting sentences, from a text corpus, that match, both semantically andmorphologically, a glossary of terms (words or composite words) furnished bythe user. The final goal is to rapidly adapt the language model of an hybridASR system with a limited amount of in-domain text data in order tosuccessfully cope with the linguistic domain at hand; the vocabulary of thebaseline model is expanded and tailored, reducing the resulting OOV rate. Dataselection strategies based on shallow morphological seeds and semanticsimilarity viaword2vec are introduced and discussed; the experimental settingconsists in a simultaneous interpreting scenario, where ASRs in three languagesare designed to recognize the domain-specific terms (i.e. dentistry). Resultsusing different metrics (OOV rate, WER, precision and recall) show theeffectiveness of the proposed techniques.

Quick Read (beta)

loading the full paper ...