Abstract
While the incipient internet was largely text-based, the modern digital worldis becoming increasingly multi-modal. Here, we examine multi-modalclassification where one modality is discrete, e.g. text, and the other iscontinuous, e.g. visual representations transferred from a convolutional neuralnetwork. In particular, we focus on scenarios where we have to be able toclassify large quantities of data quickly. We investigate various methods forperforming multi-modal fusion and analyze their trade-offs in terms ofclassification accuracy and computational efficiency. Our findings indicatethat the inclusion of continuous information improves performance overtext-only on a range of multi-modal classification tasks, even with simplefusion methods. In addition, we experiment with discretizing the continuousfeatures in order to speed up and simplify the fusion process even further. Ourresults show that fusion with discretized features outperforms text-onlyclassification, at a fraction of the computational cost of full multi-modalfusion, with the additional benefit of improved interpretability.