Abstract
We introduce a new inference task - Visual Entailment (VE) - which differsfrom traditional Textual Entailment (TE) tasks whereby a premise is defined byan image, rather than a natural language sentence as in TE tasks. A noveldataset SNLI-VE is proposed for VE tasks based on the Stanford Natural LanguageInference corpus and Flickr30K. We introduce a differentiable architecturecalled the Explainable Visual Entailment model (EVE) to tackle the VE problem.EVE and several other state-of-the-art visual question answering (VQA) basedmodels are evaluated on the SNLI-VE dataset, facilitating grounded languageunderstanding and providing insights on how modern VQA based models perform.
Quick Read (beta)
loading the full paper ...