VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Abstract

We introduce a new approach for audio-visual speech separation. Given avideo, the goal is to extract the speech associated with a face in spite ofsimultaneous background sounds and/or other human speakers. Whereas existingmethods focus on learning the alignment between the speaker's lip movements andthe sounds they generate, we propose to leverage the speaker's face appearanceas an additional prior to isolate the corresponding vocal qualities they arelikely to produce. Our approach jointly learns audio-visual speech separationand cross-modal speaker embeddings from unlabeled video. It yieldsstate-of-the-art results on five benchmark datasets for audio-visual speechseparation and enhancement, and generalizes well to challenging real-worldvideos of diverse scenarios. Our video results and code:http://vision.cs.utexas.edu/projects/VisualVoice/.

Quick Read (beta)

loading the full paper ...