Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

Abstract

Data fusion plays an important role in many technical applications thatrequire efficient processing of multimodal sensory observations. A prominentexample is audiovisual signal processing, which has gained increasing attentionin automatic speech recognition, speaker localization and related tasks. Ifappropriately combined with acoustic information, additional visual cues canhelp to improve the performance in these applications, especially under adverseacoustic conditions. A dynamic weighting of acoustic and visual streams basedon instantaneous sensor reliability measures is an efficient approach to datafusion in this context. This paper presents a framework that extends thewell-established theory of nonlinear dynamical systems with the notion ofdynamic stream weights for an arbitrary number of sensory observations. Itcomprises a recursive state estimator based on the Gaussian filtering paradigm,which incorporates dynamic stream weights into a framework closely related tothe extended Kalman filter. Additionally, a convex optimization approach toestimate oracle dynamic stream weights in fully observed dynamical systemsutilizing a Dirichlet prior is presented. This serves as a basis for a genericparameter learning framework of dynamic stream weight estimators. The proposedsystem is application-independent and can be easily adapted to specific tasksand requirements. A study using audiovisual speaker tracking tasks isconsidered as an exemplary application in this work. An improved trackingperformance of the dynamic stream weight-based estimation framework overstate-of-the-art methods is demonstrated in the experiments.

Quick Read (beta)

loading the full paper ...