A Hierarchical Model for Spoken Language Recognition

Abstract

Spoken language recognition (SLR) refers to the automatic process used todetermine the language present in a speech sample. SLR is an important task inits own right, for example, as a tool to analyze or categorize large amounts ofmulti-lingual data. Further, it is also an essential tool for selectingdownstream applications in a work flow, for example, to chose appropriatespeech recognition or machine translation models. SLR systems are usuallycomposed of two stages, one where an embedding representing the audio sample isextracted and a second one which computes the final scores for each language.In this work, we approach the SLR task as a detection problem and implement thesecond stage as a probabilistic linear discriminant analysis (PLDA) model. Weshow that discriminative training of the PLDA parameters gives large gains withrespect to the usual generative training. Further, we propose a novelhierarchical approach were two PLDA models are trained, one to generate scoresfor clusters of highly related languages and a second one to generate scoresconditional to each cluster. The final language detection scores are computedas a combination of these two sets of scores. The complete model is traineddiscriminatively to optimize a cross-entropy objective. We show that thishierarchical approach consistently outperforms the non-hierarchical one fordetection of highly related languages, in many cases by large margins. We trainour systems on a collection of datasets including 100 languages and test themboth on matched and mismatched conditions, showing that the gains are robust tocondition mismatch.

Quick Read (beta)

loading the full paper ...