Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception

Abstract

For training a video-based action recognition model that accepts multi-viewvideo, annotating frame-level labels is tedious and difficult. However, it isrelatively easy to annotate sequence-level labels. This kind of coarseannotations are called as weak labels. However, training a multi-viewvideo-based action recognition model with weak labels for frame-levelperception is challenging. In this paper, we propose a novel learningframework, where the weak labels are first used to train a multi-viewvideo-based base model, which is subsequently used for downstream frame-levelperception tasks. The base model is trained to obtain individual latentembeddings for each view in the multi-view input. For training the model usingthe weak labels, we propose a novel latent loss function. We also propose amodel that uses the view-specific latent embeddings for downstream frame-levelaction recognition and detection tasks. The proposed framework is evaluatedusing the MM Office dataset by comparing several baseline algorithms. Theresults show that the proposed base model is effectively trained using weaklabels and the latent embeddings help the downstream models improve accuracy.

Quick Read (beta)

loading the full paper ...