Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

Abstract

Effective explanations of video action recognition models should disentanglehow movements unfold over time from the surrounding spatial context. However,existing methods based on saliency produce entangled explanations, making itunclear whether predictions rely on motion or spatial context. Language-basedapproaches offer structure but often fail to explain motions due to their tacitnature -- intuitively understood but difficult to verbalize. To address thesechallenges, we propose Disentangled Action aNd Context concept-basedExplainable (DANCE) video action recognition, a framework that predicts actionsthrough disentangled concept types: motion dynamics, objects, and scenes. Wedefine motion dynamics concepts as human pose sequences. We employ a largelanguage model to automatically extract object and scene concepts. Built on anante-hoc concept bottleneck design, DANCE enforces prediction through theseconcepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101-- demonstrate that DANCE significantly improves explanation clarity withcompetitive performance. We validate the superior interpretability of DANCEthrough a user study. Experimental results also show that DANCE is beneficialfor model debugging, editing, and failure analysis.

Quick Read (beta)

loading the full paper ...