Abstract
In this paper, we explore the encoding/pooling layer and loss function in theend-to-end speaker and language recognition system. First, a unified andinterpretable end-to-end system for both speaker and language recognition isdeveloped. It accepts variable-length input and produces an utterance levelresult. In the end-to-end system, the encoding layer plays a role inaggregating the variable-length input sequence into an utterance levelrepresentation. Besides the basic temporal average pooling, we introduce aself-attentive pooling layer and a learnable dictionary encoding layer to getthe utterance level representation. In terms of loss function for open-setspeaker verification, to get more discriminative speaker embedding, center lossand angular softmax loss is introduced in the end-to-end system. Experimentalresults on Voxceleb and NIST LRE 07 datasets show that the performance ofend-to-end learning system could be significantly improved by the proposedencoding layer and loss function.