Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Abstract

Many visual scenes contain text that carries crucial information, and it isthus essential to understand text in images for downstream reasoning tasks. Forexample, a deep water label on a warning sign warns people about the danger inthe scene. Recent work has explored the TextVQA task that requires reading andunderstanding text in images to answer a question. However, existing approachesfor TextVQA are mostly based on custom pairwise fusion mechanisms between apair of two modalities and are restricted to a single prediction step bycasting TextVQA as a classification task. In this work, we propose a novelmodel for the TextVQA task based on a multimodal transformer architectureaccompanied by a rich representation for text in images. Our model naturallyfuses different modalities homogeneously by embedding them into a commonsemantic space where self-attention is applied to model inter- and intra-modality context. Furthermore, it enables iterative answer decoding with adynamic pointer network, allowing the model to form an answer throughmulti-step prediction instead of one-step classification. Our model outperformsexisting approaches on three benchmark datasets for the TextVQA task by a largemargin.

Quick Read (beta)

loading the full paper ...