Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

Abstract

Inspired by the performance and scalability of autoregressive large languagemodels (LLMs), transformer-based models have seen recent success in the visualdomain. This study investigates a transformer adaptation for video predictionwith a simple end-to-end approach, comparing various spatiotemporalself-attention layouts. Focusing on causal modeling of physical simulationsover time; a common shortcoming of existing video-generative approaches, weattempt to isolate spatiotemporal reasoning via physical object trackingmetrics and unsupervised training on physical simulation datasets. We introducea simple yet effective pure transformer model for autoregressive videoprediction, utilizing continuous pixel-space representations for videoprediction. Without the need for complex training strategies or latentfeature-learning components, our approach significantly extends the timehorizon for physically accurate predictions by up to 50% when compared withexisting latent-space approaches, while maintaining comparable performance oncommon video quality metrics. In addition, we conduct interpretabilityexperiments to identify network regions that encode information useful toperform accurate estimations of PDE simulation parameters via probing models,and find that this generalizes to the estimation of out-of-distributionsimulation parameters. This work serves as a platform for furtherattention-based spatiotemporal modeling of videos via a simple, parameterefficient, and interpretable approach.

Quick Read (beta)

loading the full paper ...