Abstract
Universal sound separation aims to extract clean audio tracks correspondingto distinct events from mixed audio, which is critical for artificial auditoryperception. However, current methods heavily rely on artificially mixed audiofor training, which limits their ability to generalize to naturally mixed audiocollected in real-world environments. To overcome this limitation, we proposeClearSep, an innovative framework that employs a data engine to decomposecomplex naturally mixed audio into multiple independent tracks, therebyallowing effective sound separation in real-world scenarios. We introduce tworemix-based evaluation metrics to quantitatively assess separation quality anduse these metrics as thresholds to iteratively apply the data engine alongsidemodel training, progressively optimizing separation performance. In addition,we propose a series of training strategies tailored to these separatedindependent tracks to make the best use of them. Extensive experimentsdemonstrate that ClearSep achieves state-of-the-art performance across multiplesound separation tasks, highlighting its potential for advancing soundseparation in natural audio scenarios. For more examples and detailed results,please visit our demo page at https://clearsep.github.io.