Causality Matters: How Temporal Information Emerges in Video Language Models

Abstract

Video language models (VideoLMs) have made significant progress in multimodalunderstanding. However, temporal understanding, which involves identifyingevent order, duration, and relationships across time, still remains a corechallenge. Prior works emphasize positional encodings (PEs) as a key mechanismfor encoding temporal structure. Surprisingly, we find that removing ormodifying PEs in video inputs yields minimal degradation in the performance oftemporal understanding. In contrast, reversing the frame sequence whilepreserving the original PEs causes a substantial drop. To explain thisbehavior, we conduct substantial analysis experiments to trace how temporalinformation is integrated within the model. We uncover a causal informationpathway: temporal cues are progressively synthesized through inter-frameattention, aggregated in the final frame, and subsequently integrated into thequery tokens. This emergent mechanism shows that temporal reasoning emergesfrom inter-visual token interactions under the constraints of causal attention,which implicitly encodes temporal structure. Based on these insights, wepropose two efficiency-oriented strategies: staged cross-modal attention and atemporal exit mechanism for early token truncation. Experiments on twobenchmarks validate the effectiveness of both approaches. To the best of ourknowledge, this is the first work to systematically investigate video temporalunderstanding in VideoLMs, offering insights for future model improvement.

Quick Read (beta)

loading the full paper ...