Abstract
Understanding a procedural activity requires modeling both how action stepstransform the scene, and how evolving scene transformations can influence thesequence of action steps, even those that are accidental or erroneous. Existingwork has studied procedure-aware video representations by modeling the temporalorder of actions, but has not explicitly learned the state changes (scenetransformations). In this work, we study procedure-aware video representationlearning by incorporating state-change descriptions generated by Large LanguageModels (LLMs) as supervision signals for video encoders. Moreover, we generatestate-change counterfactuals that simulate hypothesized failure outcomes,allowing models to learn by imagining unseen "What if" scenarios. Thiscounterfactual reasoning facilitates the model's ability to understand thecause and effect of each step in an activity. We conduct extensive experimentson procedure-aware tasks, including temporal action segmentation, errordetection, action phase classification, frame retrieval, multi-instanceretrieval, and action recognition. Our results demonstrate the effectiveness ofthe proposed state-change descriptions and their counterfactuals, and achievesignificant improvements on multiple tasks. Code is available athttps://github.com/HCIS- Lab/counterfactual-video-pretrain.