TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

Abstract

The goal of this work is Active Speaker Detection (ASD), a task to determinewhether a person is speaking or not in a series of video frames. Previous workshave dealt with the task by exploring network architectures while learningeffective representations has been less explored. In this work, we proposeTalkNCE, a novel talk-aware contrastive loss. The loss is only applied to partof the full segments where a person on the screen is actually speaking. Thisencourages the model to learn effective representations through the naturalcorrespondence of speech and facial movements. Our loss can be jointlyoptimized with the existing objectives for training ASD models without the needfor additional supervision or training data. The experiments demonstrate thatour loss can be easily integrated into the existing ASD frameworks, improvingtheir performance. Our method achieves state-of-the-art performances onAVA-ActiveSpeaker and ASW datasets.

Quick Read (beta)

loading the full paper ...