Abstract
In this paper, we address the task of segmenting an object given a naturallanguage expression that references it, \textit{i.e.} a referring expression.Current techniques tackle this task by either (\textit{i}) directly orrecursively merging the linguistic and visual information in the channeldimension and then performing convolutions; or by (\textit{ii}) mapping theexpression to a space in which it can be thought of as a filter, whose responseis directly related to the presence of the object at a given spatial coordinatein the image, so that a convolution can be applied to look for the object. Wepropose a novel method that merges the best of both worlds to exploit therecursive nature of language, and that also, during the upsampling process,takes advantage of the intermediate information generated when downsampling theimage, so that detailed segmentations can be obtained. Our method is comparedwith the state-of-the-art approaches in four standard datasets, in which ityields high performance and surpasses all previous methods in six of eight ofthe standard dataset splits for this task. Code will be made available in thefinal version of this paper. Full implementation of our method and trainingroutines, written in PyTorch, can be found at\url{https://github.com/andfoy/query-objseg}