Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

  • 2025-08-14 17:30:37
  • Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser-Nam Lim
  • 0

Abstract

Recent advancements in video generation have enabled the creation ofhigh-quality, visually compelling videos. However, generating videos thatadhere to the laws of physics remains a critical challenge for applicationsrequiring realism and accuracy. In this work, we propose PhysHPO, a novelframework for Hierarchical Cross-Modal Direct Preference Optimization, totackle this challenge by enabling fine-grained preference alignment forphysically plausible video generation. PhysHPO optimizes video alignment acrossfour hierarchical granularities: a) Instance Level, aligning the overall videocontent with the input prompt; b) State Level, ensuring temporal consistencyusing boundary frames as anchors; c) Motion Level, modeling motion trajectoriesfor realistic dynamics; and d) Semantic Level, maintaining logical consistencybetween narrative and visuals. Recognizing that real-world videos are the bestreflections of physical phenomena, we further introduce an automated dataselection pipeline to efficiently identify and utilize "good data" fromexisting large-scale text-video datasets, thereby eliminating the need forcostly and time-intensive dataset construction. Extensive experiments on bothphysics-focused and general capability benchmarks demonstrate that PhysHPOsignificantly improves physical plausibility and overall video generationquality of advanced models. To the best of our knowledge, this is the firstwork to explore fine-grained preference alignment and data selection for videogeneration, paving the way for more realistic and human-preferred videogeneration paradigms.

 

Quick Read (beta)

loading the full paper ...