MaskViT: Masked Visual Pre-Training for Video Prediction

Abstract

The ability to predict future visual observations conditioned on pastobservations and motor commands can enable embodied agents to plan solutions toa variety of tasks in complex environments. This work shows that we can creategood video prediction models by pre-training transformers via masked visualmodeling. Our approach, named MaskViT, is based on two simple design decisions.First, for memory and training efficiency, we use two types of windowattention: spatial and spatiotemporal. Second, during training, we mask avariable percentage of tokens instead of a fixed mask ratio. For inference,MaskViT generates all tokens via iterative refinement where we incrementallydecrease the masking ratio following a mask scheduling function. On severaldatasets we demonstrate that MaskViT outperforms prior works in videoprediction, is parameter efficient, and can generate high-resolution videos(256x256). Further, we demonstrate the benefits of inference speedup (up to512x) due to iterative decoding by using MaskViT for planning on a real robot.Our work suggests that we can endow embodied agents with powerful predictivemodels by leveraging the general framework of masked visual modeling withminimal domain knowledge.

Quick Read (beta)

loading the full paper ...