Vision transformers have been successfully applied to image recognition tasksdue to their ability to capture long-range dependencies within an image.However, there are still gaps in both performance and computational costbetween transformers and existing convolutional neural networks (CNNs). In thispaper, we aim to address this issue and develop a network that can outperformnot only the canonical transformers, but also the high-performanceconvolutional models. We propose a new transformer based hybrid network bytaking advantage of transformers to capture long-range dependencies, and ofCNNs to model local features. Furthermore, we scale it to obtain a family ofmodels, called CMTs, obtaining much better accuracy and efficiency thanprevious convolution and transformer based models. In particular, our CMT-Sachieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller onFLOPs than the existing DeiT and EfficientNet, respectively. The proposed CMT-Salso generalizes well on CIFAR10 (99.2%), CIFAR100 (91.7%), Flowers (98.7%),and other challenging vision datasets such as COCO (44.3% mAP), withconsiderably less computational cost.