Continual self-training with bootstrapped remixing for speech enhancement

Abstract

We propose RemixIT, a simple and novel self-supervised training method forspeech enhancement. The proposed method is based on a continuouslyself-training scheme that overcomes limitations from previous studies includingassumptions for the in-domain noise distribution and having access to cleantarget signals. Specifically, a separation teacher model is pre-trained on anout-of-domain dataset and is used to infer estimated target signals for a batchof in-domain mixtures. Next, we bootstrap the mixing process by generatingartificial mixtures using permuted estimated clean and noise signals. Finally,the student model is trained using the permuted estimated sources as targetswhile we periodically update teacher's weights using the latest student model.Our experiments show that RemixIT outperforms several previous state-of-the-artself-supervised methods under multiple speech enhancement tasks. Additionally,RemixIT provides a seamless alternative for semi-supervised and unsuperviseddomain adaptation for speech enhancement tasks, while being general enough tobe applied to any separation task and paired with any separation model.

Quick Read (beta)

loading the full paper ...