Abstract
Speech Super-Resolution (SSR) is a task of enhancing low-resolution speechsignals by restoring missing high-frequency components. Conventional approachestypically reconstruct log-mel features, followed by a vocoder that generateshigh-resolution speech in the waveform domain. However, as log-mel featureslack phase information, this can result in performance degradation during thereconstruction phase. Motivated by recent advances with Selective State SpacesModels (SSMs), we propose a method, referred to as Wave-U-Mamba that directlyperforms SSR in time domain. In our comparative study, including models such asWSRGlow, NU-Wave 2, and AudioSR, Wave-U-Mamba demonstrates superiorperformance, achieving the lowest Log-Spectral Distance (LSD) across variouslow-resolution sampling rates, ranging from 8 kHz to 24 kHz. Additionally,subjective human evaluations, scored using Mean Opinion Score (MOS) reveal thatour method produces SSR with natural and human-like quality. Furthermore,Wave-U-Mamba achieves these results while generating high-resolution speechover nine times faster than baseline models on a single A100 GPU, withparameter sizes less than 2% of those in the baseline models.