Towards Language-guided Visual Recognition via Dynamic Convolutions

  • 2021-10-17 11:29:13
  • Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Xinghao Ding, Yongjian Wu, Feiyue Huang, Yue Gao, Rongrong Ji
  • 0

Abstract

In this paper, we are committed to establishing an unified and end-to-endmulti-modal network via exploring the language-guided visual recognition. Toapproach this target, we first propose a novel multi-modal convolution modulecalled Language-dependent Convolution (LaConv). Its convolution kernels aredynamically generated based on natural language information, which can helpextract differentiated visual features for different multi-modal examples.Based on the LaConv module, we further build the first fully language-drivenconvolution network, termed as LaConvNet, which can unify the visualrecognition and multi-modal reasoning in one forward structure. To validateLaConv and LaConvNet, we conduct extensive experiments on four benchmarkdatasets of two vision-and-language tasks, i.e., visual question answering(VQA) and referring expression comprehension (REC). The experimental resultsnot only shows the performance gains of LaConv compared to the existingmulti-modal modules, but also witness the merits of LaConvNet as an unifiednetwork, including compact network, high generalization ability and excellentperformance, e.g., +4.7% on RefCOCO+.

 

Quick Read (beta)

loading the full paper ...