Abstract
In this paper, we propose a novel neural speaker diarization system usingmemory-aware multi-speaker embedding with sequence-to-sequence architecture(NSD-MS2S), which integrates a memory-aware multi-speaker embedding module witha sequence-to-sequence architecture. The system leverages a memory module toenhance speaker embeddings and employs a Seq2Seq framework to efficiently mapacoustic features to speaker labels. Additionally, we explore the applicationof mixture of experts in speaker diarization, and introduce a Shared and SoftMixture of Experts (SS-MoE) module to further mitigate model bias and enhanceperformance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE.Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo,Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements inrobustness and generalization. The proposed methods achieve state-of-the-artresults, showcasing their effectiveness in challenging real-world scenarios.