Robust Open-Set Spoken Language Identification and the CU MultiLang Dataset

Abstract

Most state-of-the-art spoken language identification models are closed-set;in other words, they can only output a language label from the set of classesthey were trained on. Open-set spoken language identification systems, however,gain the ability to detect when an input exhibits none of the originallanguages. In this paper, we implement a novel approach to open-set spokenlanguage identification that uses MFCC and pitch features, a TDNN model toextract meaningful feature embeddings, confidence thresholding on softmaxoutputs, and LDA and pLDA for learning to classify new unknown languages. Wepresent a spoken language identification system that achieves 91.76% accuracyon trained languages and has the capability to adapt to unknown languages onthe fly. To that end, we also built the CU MultiLang Dataset, a large anddiverse multilingual speech corpus which was used to train and evaluate oursystem.

Quick Read (beta)

loading the full paper ...