Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!

Abstract

Modeling expressive cross-modal interactions seems crucial in multimodaltasks, such as visual question answering. However, sometimes high-performingblack-box algorithms turn out to be mostly exploiting unimodal signals in thedata. We propose a new diagnostic tool, empirical multimodally-additivefunction projection (EMAP), for isolating whether or not cross-modalinteractions improve performance for a given model on a given task. Thisfunction projection modifies model predictions so that cross-modal interactionsare eliminated, isolating the additive, unimodal structure. For sevenimage+text classification tasks (on each of which we set new state-of-the-artbenchmarks), we find that, in many cases, removing cross-modal interactionsresults in little to no performance degradation. Surprisingly, this holds evenwhen expressive models, with capacity to consider interactions, otherwiseoutperform less expressive models; thus, performance improvements, even whenpresent, often cannot be attributed to consideration of cross-modal featureinteractions. We hence recommend that researchers in multimodal machinelearning report the performance not only of unimodal baselines, but also theEMAP of their best-performing model.

Quick Read (beta)

loading the full paper ...