Abstract
The ability to perceive how objects change over time is a crucial ingredientin human intelligence. However, current benchmarks cannot faithfully reflectthe temporal understanding abilities of video-language models (VidLMs) due tothe existence of static visual shortcuts. To remedy this issue, we presentVITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporalConcept underStanding. Specifically, we first introduce a fine-grained taxonomyof temporal concepts in natural language in order to diagnose the capability ofVidLMs to comprehend different temporal aspects. Furthermore, to disentanglethe correlation between static and temporal information, we generatecounterfactual video descriptions that differ from the original one only in thespecified temporal aspect. We employ a semi-automatic data collection frameworkusing large language models and human-in-the-loop annotation to obtainhigh-quality counterfactual descriptions efficiently. Evaluation ofrepresentative video-language understanding models confirms their deficiency intemporal understanding, revealing the need for greater emphasis on the temporalelements in video-language research.