SLNSpeech: solving extended speech separation problem by the help of sign language

Abstract

A speech separation task can be roughly divided into audio-only separationand audio-visual separation. In order to make speech separation technologyapplied in the real scenario of the disabled, this paper presents an extendedspeech separation problem which refers in particular to sign language assistedspeech separation. However, most existing datasets for speech separation areaudios and videos which contain audio and/or visual modalities. To address theextended speech separation problem, we introduce a large-scale dataset namedSign Language News Speech (SLNSpeech) dataset in which three modalities ofaudio, visual, and sign language are coexisted. Then, we design a general deeplearning network for the self-supervised learning of three modalities,particularly, using sign language embeddings together with audio oraudio-visual information for better solving the speech separation task.Specifically, we use 3D residual convolutional network to extract sign languagefeatures and use pretrained VGGNet model to exact visual features. After that,an improved U-Net with skip connections in feature extraction stage is appliedfor learning the embeddings among the mixed spectrogram transformed from sourceaudios, the sign language features and visual features. Experiments resultsshow that, besides visual modality, sign language modality can also be usedalone to supervise speech separation task. Moreover, we also show theeffectiveness of sign language assisted speech separation when the visualmodality is disturbed. Source code will be released inhttp://cheertt.top/homepage/

Quick Read (beta)

loading the full paper ...