Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Abstract

Despite recent progress in video generation, producing videos that adhere tophysical laws remains a significant challenge. Traditional diffusion-basedmethods struggle to extrapolate to unseen physical conditions (eg, velocity)due to their reliance on data-driven approximations. To address this, wepropose to integrate symbolic reasoning and reinforcement learning to enforcephysical consistency in video generation. We first introduce the DiffusionTimestep Tokenizer (DDT), which learns discrete, recursive visual tokens byrecovering visual attributes lost during the diffusion process. The recursivevisual tokens enable symbolic reasoning by a large language model. Based on it,we propose the Phys-AR framework, which consists of two stages: The first stageuses supervised fine-tuning to transfer symbolic knowledge, while the secondstage applies reinforcement learning to optimize the model's reasoningabilities through reward functions based on physical conditions. Our approachallows the model to dynamically adjust and improve the physical properties ofgenerated videos, ensuring adherence to physical laws. Experimental resultsdemonstrate that PhysAR can generate videos that are physically consistent.

Quick Read (beta)

loading the full paper ...