Exploring wav2vec 2.0 on speaker verification and language identification

Abstract

Wav2vec 2.0 is a recently proposed self-supervised framework for speechrepresentation learning. It follows a two-stage training process ofpre-training and fine-tuning, and performs well in speech recognition tasksespecially ultra-low resource cases. In this work, we attempt to extendself-supervised framework to speaker verification and language identification.First, we use some preliminary experiments to indicate that wav2vec 2.0 cancapture the information about the speaker and language. Then we demonstrate theeffectiveness of wav2vec 2.0 on the two tasks respectively. For speakerverification, we obtain a new state-of-the-art result, Equal Error Rate (EER)of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain anEER of 12.02% on 1 second condition and an EER of 3.47% on full-lengthcondition of the AP17-OLR dataset. Finally, we utilize one model to achieve theunified modeling by the multi-task learning for the two tasks.

Quick Read (beta)

loading the full paper ...