Abstract
In this work, we propose a new pooling strategy for language identificationby considering Indian languages. The idea is to obtain utterance level featuresfor any variable length audio for robust language recognition. We use theGhostVLAD approach to generate an utterance level feature vector for anyvariable length input audio by aggregating the local frame level featuresacross time. The generated feature vector is shown to have very good languagediscriminative features and helps in getting state of the art results forlanguage identification task. We conduct our experiments on 635Hrs of audiodata for 7 Indian languages. Our method outperforms the previous state of theart x-vector [11] method by an absolute improvement of 1.88% in F1-score andachieves 98.43% F1-score on the held-out test data. We compare our system withvarious pooling approaches and show that GhostVLAD is the best pooling approachfor this task. We also provide visualization of the utterance level embeddingsgenerated using Ghost-VLAD pooling and show that this method creates embeddingswhich has very good language discriminative features.