Diffusion Buffer for Online Generative Speech Enhancement

  • 2025-10-21 15:52:33
  • Bunlong Lay, Rostislav Makarov, Simon Welker, Maris Hillemann, Timo Gerkmann
  • 0

Abstract

Online Speech Enhancement was mainly reserved for predictive models. A keyadvantage of these models is that for an incoming signal frame from a stream ofdata, the model is called only once for enhancement. In contrast, generativeSpeech Enhancement models often require multiple calls, resulting in acomputational complexity that is too high for many online speech enhancementapplications. This work presents the Diffusion Buffer, a generativediffusion-based Speech Enhancement model which only requires one neural networkcall per incoming signal frame from a stream of data and performs enhancementin an online fashion on a consumer-grade GPU. The key idea of the DiffusionBuffer is to align physical time with Diffusion time-steps. The approachprogressively denoises frames through physical time, where past frames havemore noise removed. Consequently, an enhanced frame is output to the listenerwith a delay defined by the Diffusion Buffer, and the output frame has acorresponding look-ahead. In this work, we extend upon our previous work bycarefully designing a 2D convolutional UNet architecture that specificallyaligns with the Diffusion Buffer's look-ahead. We observe that the proposedUNet improves performance, particularly when the algorithmic latency is low.Moreover, we show that using a Data Prediction loss instead of Denoising ScoreMatching loss enables flexible control over the trade-off between algorithmiclatency and quality during inference. The extended Diffusion Buffer equippedwith a novel NN and loss function drastically reduces the algorithmic latencyfrom 320 - 960 ms to 32 - 176 ms with an even increased performance. While ithas been shown before that offline generative diffusion models outperformpredictive approaches in unseen noisy speech data, we confirm that the onlineDiffusion Buffer also outperforms its predictive counterpart on unseen noisyspeech data.

 

Quick Read (beta)

loading the full paper ...