Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters

Abstract

How to best integrate linguistic and perceptual processing in multi-modaltasks that involve language and vision is an important open problem. In thiswork, we argue that the common practice of using language in a top-down manner,to direct visual attention over high-level visual features, may not be optimal.We hypothesize that the use of language to also condition the bottom-upprocessing from pixels to high-level features can provide benefits to theoverall performance. To support our claim, we propose a U-Net-based model andperform experiments on two language-vision dense-prediction tasks: referringexpression segmentation and language-guided image colorization. We compareresults where either one or both of the top-down and bottom-up visual branchesare conditioned on language. Our experiments reveal that using language tocontrol the filters for bottom-up visual processing in addition to top-downattention leads to better results on both tasks and achieves competitiveperformance. Our linguistic analysis suggests that bottom-up conditioningimproves segmentation of objects especially when input text refers to low-levelvisual concepts. Code is available at https://github.com/ilkerkesen/bvpr.

Quick Read (beta)

loading the full paper ...