Wavesplit: End-to-End Speech Separation by Speaker Clustering

Abstract

We introduce Wavesplit, an end-to-end speech separation system. From a singlerecording of mixed speech, the model infers and clusters representations ofeach speaker and then estimates each source signal conditioned on the inferredrepresentations. The model is trained on the raw waveform to jointly performthe two tasks. Our model infers a set of speaker representations throughclustering, which addresses the fundamental permutation problem of speechseparation. Moreover, the sequence-wide speaker representations provide a morerobust separation of long, challenging sequences, compared to previousapproaches. We show that Wavesplit outperforms the previous state-of-the-art onclean mixtures of 2 or 3 speakers (WSJ0-2mix, WSJ0-3mix), as well as in noisy(WHAM!) and reverberated (WHAMR!) conditions. As an additional contribution, wefurther improve our model by introducing online data augmentation forseparation.

Quick Read (beta)

loading the full paper ...