Abstract
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizingboth uni-modal events (i.e., those occurring exclusively in either the visualor acoustic modality of a video) and multi-modal events (i.e., those occurringin both modalities concurrently). Moreover, the prohibitive cost of annotatingtraining data with the class labels of all these events, along with their startand end times, imposes constraints on the scalability of AVVP techniques unlessthey can be trained in a weakly-supervised setting, where onlymodality-agnostic, video-level labels are available in the training data. Tothis end, recently proposed approaches seek to generate segment-levelpseudo-labels to better guide model training. However, the absence ofinter-segment dependencies when generating these pseudo-labels and the generalbias towards predicting labels that are absent in a segment limit theirperformance. This work proposes a novel approach towards overcoming theseweaknesses called Uncertainty-weighted Weakly-supervised Audio-visual VideoParsing (UWAV). Additionally, our innovative approach factors in theuncertainty associated with these estimated pseudo-labels and incorporates afeature mixup based training regularization for improved training. Empiricalresults show that UWAV outperforms state-of-the-art methods for the AVVP taskon multiple metrics, across two different datasets, attesting to itseffectiveness and generalizability.