Towards Relevance and Sequence Modeling in Language Recognition

Abstract

The task of automatic language identification (LID) involving multipledialects of the same language family in the presence of noise is a challengingproblem. In these scenarios, the identity of the language/dialect may bereliably present only in parts of the temporal sequence of the speech signal.The conventional approaches to LID (and for speaker recognition) ignore thesequence information by extracting long-term statistical summary of therecording assuming an independence of the feature frames. In this paper, wepropose a neural network framework utilizing short-sequence information inlanguage recognition. In particular, a new model is proposed for incorporatingrelevance in language recognition, where parts of speech data are weighted morebased on their relevance for the language recognition task. This relevanceweighting is achieved using the bidirectional long short-term memory (BLSTM)network with attention modeling. We explore two approaches, the first approachuses segment level i-vector/x-vector representations that are aggregated in theneural model and the second approach where the acoustic features are directlymodeled in an end-to-end neural model. Experiments are performed using thelanguage recognition task in NIST LRE 2017 Challenge using clean, noisy andmulti-speaker speech data as well as in the RATS language recognition corpus.In these experiments on noisy LRE tasks as well as the RATS dataset, theproposed approach yields significant improvements over the conventionali-vector/x-vector based language recognition approaches as well as with otherprevious models incorporating sequence information.

Quick Read (beta)

loading the full paper ...