NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

Abstract

Existing video captioning benchmarks and models lack coherent representationsof causal-temporal narrative, which is sequences of events linked through causeand effect, unfolding over time and driven by characters or agents. This lackof narrative restricts models' ability to generate text descriptions thatcapture the causal and temporal dynamics inherent in video content. To addressthis gap, we propose NarrativeBridge, an approach comprising of: (1) a novelCausal-Temporal Narrative (CTN) captions benchmark generated using a largelanguage model and few-shot prompting, explicitly encoding cause-effecttemporal relationships in video descriptions, evaluated automatically to ensurecaption quality and relevance; and (2) a dedicated Cause-Effect Network (CEN)architecture with separate encoders for capturing cause and effect dynamicsindependently, enabling effective learning and generation of captions withcausal-temporal narrative. Extensive experiments demonstrate that CEN is moreaccurate in articulating the causal and temporal aspects of video content thanthe second best model (GIT): 17.88 and 17.44 CIDEr on the MSVD and MSR-VTTdatasets, respectively. The proposed framework understands and generatesnuanced text descriptions with intricate causal-temporal narrative structurespresent in videos, addressing a critical limitation in video captioning. Forproject details, visit https://narrativebridge.github.io/.

Quick Read (beta)

loading the full paper ...