Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions

Abstract

This memo describes NTR/TSU winning submission for Low Resource ASR challengeat Dialog2021 conference, language identification track. Spoken Language Identification (LID) is an important step in a multilingualAutomated Speech Recognition (ASR) system pipeline. Traditionally, the ASR taskrequires large volumes of labeled data that are unattainable for most of theworld's languages, including most of the languages of Russia. In this memo, weshow that a convolutional neural network with a Self-Attentive Pooling layershows promising results in low-resource setting for the language identificationtask and set up a SOTA for the Low Resource ASR challenge dataset. Additionally, we compare the structure of confusion matrices for this andsignificantly more diverse VoxForge dataset and state and substantiate thehypothesis that whenever the dataset is diverse enough so that the otherclassification factors, like gender, age etc. are well-averaged, the confusionmatrix for LID system bears the language similarity measure.

Quick Read (beta)

loading the full paper ...