Language-Conditioned Observation Models for Visual Object Search

Abstract

Object search is a challenging task because when given complex languagedescriptions (e.g., "find the white cup on the table"), the robot must move itscamera through the environment and recognize the described object. Previousworks map language descriptions to a set of fixed object detectors withpredetermined noise models, but these approaches are challenging to scalebecause new detectors need to be made for each object. In this work, we bridgethe gap in realistic object search by posing the search problem as a partiallyobservable Markov decision process (POMDP) where the object detector and visualsensor noise in the observation model is determined by a single Deep NeuralNetwork conditioned on complex language descriptions. We incorporate the neuralnetwork's outputs into our language-conditioned observation model (LCOM) torepresent dynamically changing sensor noise. With an LCOM, any languagedescription of an object can be used to generate an appropriate object detectorand noise model, and training an LCOM only requires readily availablesupervised image-caption datasets. We empirically evaluate our method bycomparing against a state-of-the-art object search algorithm in simulation, anddemonstrate that planning with our observation model yields a significantlyhigher average task completion rate (from 0.46 to 0.66) and more efficient andquicker object search than with a fixed-noise model. We demonstrate our methodon a Boston Dynamics Spot robot, enabling it to handle complex natural languageobject descriptions and efficiently find objects in a room-scale environment.

Quick Read (beta)

loading the full paper ...