OmniNet: A unified architecture for multi-modal multi-task learning

Abstract

Transformer is a popularly used neural network architecture, especially forlanguage understanding. We introduce an extended and unified architecture whichcan be used for tasks involving a variety of modalities like image, text,videos, etc. We propose a spatio-temporal cache mechanism that enables learningspatial dimension of the input in addition to the hidden states correspondingto the temporal input sequence. The proposed architecture further enables asingle model to support tasks with multiple input modalities as well asasynchronous multi-task learning, thus we refer to it as OmniNet. For example,a single instance of OmniNet can concurrently learn to perform the tasks ofpart-of-speech tagging, image captioning, visual question answering and videoactivity recognition. We demonstrate that training these four tasks togetherresults in about three times compressed model while retaining the performancein comparison to training them individually. We also show that using thisneural network pre-trained on some modalities assists in learning an unseentask. This illustrates the generalization capacity of the self-attentionmechanism on the spatio-temporal cache present in OmniNet.

Quick Read (beta)

loading the full paper ...