As humans, we understand events in the visual world contextually, performingmultimodal reasoning across time to make inferences about the past, present,and future. We introduce MERLOT, a model that learns multimodal scriptknowledge by watching millions of YouTube videos with transcribed speech -- inan entirely label-free, self-supervised manner. By pretraining with a mix ofboth frame-level (spatial) and video-level (temporal) objectives, our model notonly learns to match images to temporally corresponding words, but also tocontextualize what is happening globally over time. As a result, MERLOTexhibits strong out-of-the-box representations of temporal commonsense, andachieves state-of-the-art performance on 12 different video QA datasets whenfinetuned. It also transfers well to the world of static images, allowingmodels to reason about the dynamic context behind visual scenes. On VisualCommonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy,outperforming state-of-the-art models of similar size by over 3%, even thosethat make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training onvideos versus static images; 2) scaling the magnitude and diversity of thepretraining video corpus; and 3) using diverse objectives that encouragefull-stack multimodal reasoning, from the recognition to cognition level.