Abstract
We propose an end-to-end deep model for speaker verification in the wild. Ourmodel uses thin-ResNet for extracting speaker embeddings from utterances and aSiamese capsule network and dynamic routing as the Back-end to calculate asimilarity score between the embeddings. We conduct a series of experiments andcomparisons on our model to state-of-the-art solutions, showing that our modeloutperforms all the other models using substantially less amount of trainingdata. We also perform additional experiments to study the impact of differentspeaker embeddings on the Siamese capsule network. We show that the bestperformance is achieved by using embeddings obtained directly from the featureaggregation module of the Front-end and passing them to higher capsules usingdynamic routing.