Abstract
How to best integrate linguistic and perceptual processing in multi-modaltasks that involve language and vision is an important open problem. In thiswork, we argue that the common practice of using language in a top-down manner,to direct visual attention over high-level visual features, may not be optimal.We hypothesize that the use of language to also condition the bottom-upprocessing from pixels to high-level features can provide benefits to theoverall performance. To support our claim, we propose a model forlanguage-vision problems involving dense prediction, and perform experiments ontwo different multi-modal tasks: image segmentation from referring expressionsand language-guided image colorization. We compare results where either one orboth of the top-down and bottom-up visual branches are conditioned on language.Our experiments reveal that using language to control the filters for bottom-upvisual processing in addition to top-down attention leads to better results onboth tasks and achieves state-of-the-art performance. Our analysis of differentword types in input expressions suggest that the bottom-up conditioning isespecially helpful in the presence of low level visual concepts like color.