Learning Spatial-Temporal Graphs for Active Speaker Detection

Abstract

We address the problem of active speaker detection through a new framework,called SPELL, that learns long-range multimodal graphs to encode theinter-modal relationship between audio and visual data. We cast active speakerdetection as a node classification task that is aware of longer-termdependencies. We first construct a graph from a video so that each nodecorresponds to one person. Nodes representing the same identity share edgesbetween them within a defined temporal window. Nodes within the same videoframe are also connected to encode inter-person interactions. Through extensiveexperiments on the Ava-ActiveSpeaker dataset, we demonstrate that learninggraph-based representation, owing to its explicit spatial and temporalstructure, significantly improves the overall performance. SPELL outperformsseveral relevant baselines and performs at par with state of the art modelswhile requiring an order of magnitude lower computation cost.

Quick Read (beta)

loading the full paper ...