High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Abstract

Predicting future video frames is extremely challenging, as there are manyfactors of variation that make up the dynamics of how frames change throughtime. Previously proposed solutions require complex inductive biases insidenetwork architectures with highly specialized computation, includingsegmentation masks, optical flow, and foreground and background separation. Inthis work, we question if such handcrafted architectures are necessary andinstead propose a different approach: finding minimal inductive bias for videoprediction while maximizing network capacity. We investigate this question byperforming the first large-scale empirical study and demonstratestate-of-the-art performance by learning large models on three differentdatasets: one for modeling object interactions, one for modeling human motion,and one for modeling car driving.

Quick Read (beta)

loading the full paper ...