Learning to Compose and Reason with Language Tree Structures for Visual Grounding

Abstract

Grounding natural language in images, such as localizing "the black dog onthe left of the tree", is one of the core problems in artificial intelligence,as it needs to comprehend the fine-grained and compositional language space.However, existing solutions rely on the association between the holisticlanguage features and visual features, while neglect the nature ofcompositional reasoning implied in the language. In this paper, we propose anatural language grounding model that can automatically compose a binary treestructure for parsing the language and then perform visual reasoning along thetree in a bottom-up fashion. We call our model RVG-TREE: Recursive GroundingTree, which is inspired by the intuition that any language expression can berecursively decomposed into two constituent parts, and the grounding confidencescore can be recursively accumulated by calculating their grounding scoresreturned by sub-trees. RVG-TREE can be trained end-to-end by using theStraight-Through Gumbel-Softmax estimator that allows the gradients from thecontinuous score functions passing through the discrete tree construction.Experiments on several benchmarks show that our model achieves thestate-of-the-art performance with more explainable reasoning.

Quick Read (beta)

loading the full paper ...