TAB-VCR: Tags and Attributes based VCR Baselines

Abstract

Reasoning is an important ability that we learn from a very early age. Yet,reasoning is extremely hard for algorithms. Despite impressive recent progressthat has been reported on tasks that necessitate reasoning, such as visualquestion answering and visual dialog, models often exploit biases in datasets.To develop models with better reasoning abilities, recently, the new visualcommonsense reasoning(VCR) task has been introduced. Not only do models have toanswer questions, but also do they have to provide a reason for the givenanswer. The proposed baseline achieved compelling results, leveraging ameticulously designed model composed of LSTM modules and attention nets. Herewe show that a much simpler model obtained by ablating and pruning the existingintricate baseline can perform better with half the number of trainableparameters. By associating visual features with attribute information andbetter text to image grounding, we obtain further improvements for our simpler& effective baseline, TAB-VCR. We show that this approach results in a 5.3%,4.4% and 6.5% absolute improvement over the previous state-of-the-art onquestion answering, answer justification and holistic VCR.

Quick Read (beta)

loading the full paper ...