Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Abstract

Pretrained contextual word representations in NLP have greatly improvedperformance on various downstream tasks. For speech, we propose contextualframe representations that capture phonetic information at the acoustic framelevel and can be used for utterance-level language, speaker, and speechrecognition. These representations come from the frame-wise intermediaterepresentations of an end-to-end, self-attentive ASR model (SAN-CTC) on spokenutterances. We first train the model on the Fisher English corpus withcontext-independent phoneme labels, then use its representations at inferencetime as features for task-specific models on the NIST LRE07 closed-set languagerecognition task and a Fisher speaker recognition task, giving significantimprovements over the state-of-the-art on both (e.g., language EER of 4.68% on3sec utterances, 23% relative reduction in speaker EER). Results remaincompetitive when using a novel dilated convolutional model for languagerecognition, or when ASR pretraining is done with character labels only.

Quick Read (beta)

loading the full paper ...