Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Abstract

We marry two powerful ideas: deep representation learning for visualrecognition and language understanding, and symbolic program execution forreasoning. Our neural-symbolic visual question answering (NS-VQA) system firstrecovers a structural scene representation from the image and a program tracefrom the question. It then executes the program on the scene representation toobtain an answer. Incorporating symbolic structure as prior knowledge offersthree unique advantages. First, executing programs on a symbolic space is morerobust to long program traces; our model can solve complex reasoning tasksbetter, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the modelis more data- and memory-efficient: it performs well after learning on a smallnumber of training data; it can also encode an image into a compactrepresentation, requiring less storage than existing methods for offlinequestion answering. Third, symbolic program execution offers full transparencyto the reasoning process; we are thus able to interpret and diagnose eachexecution step.

Quick Read (beta)

loading the full paper ...