Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

  • 2019-07-18 17:55:38
  • Kalin Stefanov, Jonas Beskow, Giampiero Salvi
  • 0

Abstract

This paper presents a self-supervised method for visual detection of theactive speaker in a multi-person spoken interaction scenario. Active speakerdetection is a fundamental prerequisite for any artificial cognitive systemattempting to acquire language in social settings. The proposed method isintended to complement the acoustic detection of the active speaker, thusimproving the system robustness in noisy conditions. The method can detect anarbitrary number of possibly overlapping active speakers based exclusively onvisual information about their face. Furthermore, the method does not rely onexternal annotations, thus complying with cognitive development. Instead, themethod uses information from the auditory modality to support learning in thevisual domain. This paper reports an extensive evaluation of the proposedmethod using a large multi-person face-to-face interaction dataset. The resultsshow good performance in a speaker dependent setting. However, in a speakerindependent setting the proposed method yields a significantly lowerperformance. We believe that the proposed method represents an essentialcomponent of any artificial cognitive system or robotic platform engaging insocial interactions.

 

Quick Read (beta)

loading the full paper ...