Clockwork Variational Autoencoders for Video Prediction

Abstract

Deep learning has enabled algorithms to generate realistic images. However,accurately predicting long video sequences requires understanding long-termdependencies and remains an open challenge. While existing video predictionmodels succeed at generating sharp images, they tend to fail at accuratelypredicting far into the future. We introduce the Clockwork VAE (CW-VAE), avideo prediction model that leverages a hierarchy of latent sequences, wherehigher levels tick at slower intervals. We demonstrate the benefits of bothhierarchical latents and temporal abstraction on 4 diverse video predictiondatasets with sequences of up to 1000 frames, where CW-VAE outperforms topvideo prediction models. Additionally, we propose a Minecraft benchmark forlong-term video prediction. We conduct several experiments to gain insightsinto CW-VAE and confirm that slower levels learn to represent objects thatchange more slowly in the video, and faster levels learn to represent fasterobjects.

Quick Read (beta)

loading the full paper ...