Abstract
Large-scale pre-trained video generation models excel in content creation butare not reliable as physically accurate world simulators out of the box. Thiswork studies the process of post-training these models for accurate worldmodeling through the lens of the simple, yet fundamental, physics task ofmodeling object freefall. We show state-of-the-art video generation modelsstruggle with this basic task, despite their visually impressive outputs. Toremedy this problem, we find that fine-tuning on a relatively small amount ofsimulated videos is effective in inducing the dropping behavior in the model,and we can further improve results through a novel reward modeling procedure weintroduce. Our study also reveals key limitations of post-training ingeneralization and distribution modeling. Additionally, we release a benchmarkfor this task that may serve as a useful diagnostic tool for tracking physicalaccuracy in large-scale video generative model development.