Is Attention always needed? A Case Study on Language Identification from Speech

Abstract

Language Identification (LID), a recommended initial step to Automatic SpeechRecognition (ASR), is used to detect a spoken language from audio specimens. Instate-of-the-art systems capable of multilingual speech processing, however,users have to explicitly set one or more languages before using them. LID,therefore, plays a very important role in situations where ASR based systemscannot parse the uttered language in multilingual contexts causing failure inspeech recognition. We propose an attention based convolutional recurrentneural network (CRNN with Attention) that works on Mel-frequency CepstralCoefficient (MFCC) features of audio specimens. Additionally, we reproduce somestate-of-the-art approaches, namely Convolutional Neural Network (CNN) andConvolutional Recurrent Neural Network (CRNN), and compare them to our proposedmethod. We performed extensive evaluation on thirteen different Indianlanguages and our model achieves classification accuracy over 98%. Our LIDmodel is robust to noise and provides 91.2% accuracy in a noisy scenario. Theproposed model is easily extensible to new languages.

Quick Read (beta)

loading the full paper ...