Object Referring in Visual Scene with Spoken Language

Abstract

Object referring has important applications, especially for human-machineinteraction. While having received great attention, the task is mainly attackedwith written language (text) as input rather than spoken language (speech),which is more natural. This paper investigates Object Referring with SpokenLanguage (ORSpoken) by presenting two datasets and one novel approach. Objectsare annotated with their locations in images, text descriptions and speechdescriptions. This makes the datasets ideal for multi-modality learning. Theapproach is developed by carefully taking down ORSpoken problem into threesub-problems and introducing task-specific vision-language interactions at thecorresponding levels. Experiments show that our method outperforms competingmethods consistently and significantly. The approach is also evaluated in thepresence of audio noise, showing the efficacy of the proposed vision-languageinteraction methods in counteracting background noise.

Quick Read (beta)

loading the full paper ...