Spoken Language Identification using ConvNets

Abstract

Language Identification (LI) is an important first step in several speechprocessing systems. With a growing number of voice-based assistants, speech LIhas emerged as a widely researched field. To approach the problem ofidentifying languages, we can either adopt an implicit approach where only thespeech for a language is present or an explicit one where text is availablewith its corresponding transcript. This paper focuses on an implicit approachdue to the absence of transcriptive data. This paper benchmarks existing modelsand proposes a new attention based model for language identification which useslog-Mel spectrogram images as input. We also present the effectiveness of rawwaveforms as features to neural network models for LI tasks. For training andevaluation of models, we classified six languages (English, French, German,Spanish, Russian and Italian) with an accuracy of 95.4% and four languages(English, French, German, Spanish) with an accuracy of 96.3% obtained from theVoxForge dataset. This approach can further be scaled to incorporate morelanguages.

Quick Read (beta)

loading the full paper ...