My lips are concealed: Audio-visual speech enhancement through obstructions

Abstract

Our objective is an audio-visual model for separating a single speaker from amixture of sounds such as other speakers and background noise. Moreover, wewish to hear the speaker even when the visual cues are temporarily absent dueto occlusion. To this end we introduce a deep audio-visual speech enhancementnetwork that is able to separate a speaker's voice by conditioning on both thespeaker's lip movements and/or a representation of their voice. The voicerepresentation can be obtained by either (i) enrollment, or (ii) byself-enrollment -- learning the representation on-the-fly given sufficientunobstructed visual input. The model is trained by blending audios, and byintroducing artificial occlusions around the mouth region that prevent thevisual modality from dominating. The method is speaker-independent, and wedemonstrate it on real examples of speakers unheard (and unseen) duringtraining. The method also improves over previous models in particular for casesof occlusion in the visual modality.

Quick Read (beta)

loading the full paper ...