Vision Transformers (ViT) become widely-adopted architectures for variousvision tasks. Masked auto-encoding for feature pretraining and multi-scalehybrid convolution-transformer architectures can further unleash the potentialsof ViT, leading to state-of-the-art performances on image classification,detection and semantic segmentation. In this paper, our ConvMAE frameworkdemonstrates that multi-scale hybrid convolution-transformer can learn morediscriminative representations via the mask auto-encoding scheme. However,directly using the original masking strategy leads to the heavy computationalcost and pretraining-finetuning discrepancy. To tackle the issue, we adopt themasked convolution to prevent information leakage in the convolution blocks. Asimple block-wise masking strategy is proposed to ensure computationalefficiency. We also propose to more directly supervise the multi-scale featuresof the encoder to boost multi-scale features. Based on our pretrained ConvMAEmodels, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% comparedwith MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochssurpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask APrespectively. Code and pretrained models are available athttps://github.com/Alpha-VL/ConvMAE.