VideoFlow: A Flow-Based Generative Model for Video

Abstract

Generative models that can model and predict sequences of future events can,in principle, learn to capture complex real-world phenomena, such as physicalinteractions. In particular, learning predictive models of videos offers anespecially appealing mechanism to enable a rich understanding of the physicalworld: videos of real-world interactions are plentiful and readily available,and a model that can predict future video frames can not only capture usefulrepresentations of the world, but can be useful in its own right, for problemssuch as model-based robotic control. However, a central challenge in videoprediction is that the future is highly uncertain: a sequence of pastobservations of events can imply many possible futures. Although a number ofrecent works have studied probabilistic models that can represent uncertainfutures, such models are either extremely expensive computationally (as in thecase of pixel-level autoregressive models), or do not directly optimize thelikelihood of the data. In this work, we propose a model for video predictionbased on normalizing flows, which allows for direct optimization of the datalikelihood, and produces high-quality stochastic predictions. To our knowledge,our work is the first to propose multi-frame video prediction with normalizingflows. We describe an approach for modeling the latent space dynamics, anddemonstrate that flow-based generative models offer a viable and competitiveapproach to generative modeling of video.

Quick Read (beta)

loading the full paper ...