Transformer-based architectures have become competitive across a variety ofvisual domains, most notably images and videos. While prior work has studiedthese modalities in isolation, having a common architecture suggests that onecan train a single unified model for multiple visual modalities. Prior attemptsat unified modeling typically use architectures tailored for vision tasks, orobtain worse performance compared to single modality models. In this work, weshow that masked autoencoding can be used to train a simple Vision Transformeron images and videos, without requiring any labeled data. This single modellearns visual representations that are comparable to or better thansingle-modality representations on both image and video benchmarks, while usinga much simpler architecture. In particular, our single pretrained model can befinetuned to achieve 86.5% on ImageNet and 75.3% on the challenging SomethingSomething-v2 video benchmark. Furthermore, this model can be learned bydropping 90% of the image and 95% of the video patches, enabling extremely fasttraining.