MLP-based architectures, which consist of a sequence of consecutivemulti-layer perceptron blocks, have recently been found to reach comparableresults to convolutional and transformer-based methods. However, most adoptspatial MLPs which take fixed dimension inputs, therefore making it difficultto apply them to downstream tasks, such as object detection and semanticsegmentation. Moreover, single-stage designs further limit performance in othercomputer vision tasks and fully connected layers bear heavy computation. Totackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP forvisual recognition, which is a light-weight, stage-wise, co-design ofconvolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% ofMLP-Mixer-B/16, respectively). Experiments on object detection and semanticsegmentation further show that visual representation learned by ConvMLP can beseamlessly transferred and achieve competitive results with fewer parameters.Our code and pre-trained models are publicly available athttps://github.com/SHI-Labs/Convolutional-MLPs.