Speaker diarisation using 2D self-attentive combination of embeddings

Abstract

Speaker diarisation systems often cluster audio segments using speakerembeddings such as i-vectors and d-vectors. Since different types of embeddingsare often complementary, this paper proposes a generic framework to improveperformance by combining them into a single embedding, referred to as ac-vector. This combination uses a 2-dimensional (2D) self-attentive structure,which extends the standard self-attentive layer by averaging not only acrosstime but also across different types of embeddings. Two types of 2Dself-attentive structure in this paper are the simultaneous combination and theconsecutive combination, adopting a single and multiple self-attentive layersrespectively. The penalty term in the original self-attentive layer which isjointly minimised with the objective function to encourage diversity ofannotation vectors is also modified to obtain not only different local peaksbut also the overall trends in the multiple annotation vectors. Experiments onthe AMI meeting corpus show that our modified penalty term improves the d-vector relative speaker error rate (SER) by 6% and 21% for d-vector systems,and a 10% further relative SER reduction can be obtained using the c-vectorfrom our best 2D self-attentive structure.

Quick Read (beta)

loading the full paper ...