Abstract
The multimodal models used in the emerging field at the intersection ofcomputational linguistics and computer vision implement the bottom-upprocessing of the `Hub and Spoke' architecture proposed in cognitive science torepresent how the brain processes and combines multi-sensory inputs. Inparticular, the Hub is implemented as a neural network encoder. We investigatethe effect on this encoder of various vision-and-language tasks proposed in theliterature: visual question answering, visual reference resolution, andvisually grounded dialogue. To measure the quality of the representationslearned by the encoder, we use two kinds of analyses. First, we evaluate theencoder pre-trained on the different vision-and-language tasks on an existingdiagnostic task designed to assess multimodal semantic understanding. Second,we carry out a battery of analyses aimed at studying how the encoder merges andexploits the two modalities.