Learning Multimodal Representations for Unseen Activities

Abstract

We present a method to learn a joint multimodal representation space thatenables recognition of unseen activities in videos. We first compare the effectof placing various constraints on the embedding space using paired text andvideo data. We also propose a method to improve the joint embedding space usingan adversarial formulation, allowing it to benefit from unpaired text and videodata. By using unpaired text data, we show the ability to learn arepresentation that better captures unseen activities. In addition to testing on publicly available datasets, we introduce a new,large-scale text/video dataset. We experimentally confirm that using paired and unpaired data to learn ashared embedding space benefits three difficult tasks (i) zero-shot activityclassification, (ii) unsupervised activity discovery, and (iii) unseen activitycaptioning, outperforming the state-of-the-arts.

Quick Read (beta)

loading the full paper ...