VGGSounder: Audio-Visual Evaluations for Foundation Models

Abstract

The emergence of audio-visual foundation models underscores the importance ofreliably assessing their multi-modal understanding. The VGGSounder dataset iscommonly used as a benchmark for evaluation audio-visual classification.However, our analysis identifies several limitations of VGGSounder, includingincomplete labelling, partially overlapping classes, and misaligned modalities.These lead to distorted evaluations of auditory and visual capabilities. Toaddress these limitations, we introduce VGGSounder, a comprehensivelyre-annotated, multi-label test set that extends VGGSound and is specificallydesigned to evaluate audio-visual foundation models. VGGSounder featuresdetailed modality annotations, enabling precise analyses of modality-specificperformance. Furthermore, we reveal model limitations by analysing performancedegradation when adding another input modality with our new modality confusionmetric.

Quick Read (beta)

loading the full paper ...