Action Scene Graphs for Long-Form Understanding of Egocentric Videos

Abstract

We present Egocentric Action Scene Graphs (EASGs), a new representation forlong-form understanding of egocentric videos. EASGs extend standardmanually-annotated representations of egocentric videos, such as verb-nounaction labels, by providing a temporally evolving graph-based description ofthe actions performed by the camera wearer, including interacted objects, theirrelationships, and how actions unfold in time. Through a novel annotationprocedure, we extend the Ego4D dataset by adding manually labeled EgocentricAction Scene Graphs offering a rich set of annotations designed for long-fromegocentric video understanding. We hence define the EASG generation task andprovide a baseline approach, establishing preliminary benchmarks. Experimentson two downstream tasks, egocentric action anticipation and egocentric activitysummarization, highlight the effectiveness of EASGs for long-form egocentricvideo understanding. We will release the dataset and the code to replicateexperiments and annotations.

Quick Read (beta)

loading the full paper ...