Abstract
Predicting future video frames is essential for decision-making systems, yetRGB frames alone often lack the information needed to fully capture theunderlying complexities of the real world. To address this limitation, wepropose a multi-modal framework for Synchronous Video Prediction (SyncVP) thatincorporates complementary data modalities, enhancing the richness and accuracyof future predictions. SyncVP builds on pre-trained modality-specific diffusionmodels and introduces an efficient spatio-temporal cross-attention module toenable effective information sharing across modalities. We evaluate SyncVP onstandard benchmark datasets, such as Cityscapes and BAIR, using depth as anadditional modality. We furthermore demonstrate its generalization to othermodalities on SYNTHIA with semantic information and ERA5-Land with climatedata. Notably, SyncVP achieves state-of-the-art performance, even in scenarioswhere only one modality is present, demonstrating its robustness and potentialfor a wide range of applications.