Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

Abstract

Video action recognition is a challenging but important task forunderstanding and discovering what the video does. However, acquiringannotations for a video is costly, and semi-supervised learning (SSL) has beenstudied to improve performance even with a small number of labeled data in thetask. Prior studies for semi-supervised video action recognition have mostlyfocused on using single modality - visuals - but the video is multi-modal, soutilizing both visuals and audio would be desirable and improve performancefurther, which has not been explored well. Therefore, we propose audio-visualSSL for video action recognition, which uses both visual and audio together,even with quite a few labeled data, which is challenging. In addition, tomaximize the information of audio and video, we propose a novel audio sourcelocalization-guided mixup method that considers inter-modal relations betweenvideo and audio modalities. In experiments on UCF-51, Kinetics-400, andVGGSound datasets, our model shows the superior performance of the proposedsemi-supervised audio-visual action recognition framework and audio sourcelocalization-guided mixup.

Quick Read (beta)

loading the full paper ...