Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding

Abstract

Grounding natural language in images essentially requires composite visualreasoning. However, existing methods over-simplify the composite nature oflanguage into a monolithic sentence embedding or a coarse composition ofsubject-predicate-object triplet. They might perform well on short phrases, butgenerally fail in longer sentences, mainly due to the over-fitting to certainvision-language bias. In this paper, we propose to ground natural language inan intuitive, explainable, and composite fashion as it should be. Inparticular, we develop a novel modular network called Neural Module Treenetwork (NMTree) that regularizes the visual grounding along the dependencyparsing tree of the sentence, where each node is a module network thatcalculates or accumulates the grounding score in a bottom-up direction where asneeded. NMTree disentangles the visual grounding from the composite reasoning,allowing the former to only focus on primitive and easy-to-generalize patterns.To reduce the impact of parsing errors, we train the modules and their assemblyend-to-end by using the Gumbel-Softmax approximation and its straight-throughgradient estimator, accounting for the discrete process of module selection.Overall, the proposed NMTree not only consistently outperforms thestate-of-the-arts on several benchmarks and tasks, but also shows explainablereasoning in grounding score calculation. Therefore, NMTree shows a gooddirection in closing the gap between explainability and performance.

Quick Read (beta)

loading the full paper ...