Modern hierarchical vision transformers have added several vision-specificcomponents in the pursuit of supervised classification performance. While thesecomponents lead to effective accuracies and attractive FLOP counts, the addedcomplexity actually makes these transformers slower than their vanilla ViTcounterparts. In this paper, we argue that this additional bulk is unnecessary.By pretraining with a strong visual pretext task (MAE), we can strip out allthe bells-and-whistles from a state-of-the-art multi-stage vision transformerwithout losing accuracy. In the process, we create Hiera, an extremely simplehierarchical vision transformer that is more accurate than previous modelswhile being significantly faster both at inference and during training. Weevaluate Hiera on a variety of tasks for image and video recognition. Our codeand models are available at https://github.com/facebookresearch/hiera.