Most existing methods in vision language pre-training rely on object-centricfeatures extracted through object detection, and make fine-grained alignmentsbetween the extracted features and texts. We argue that the use of objectdetection may not be suitable for vision language pre-training. Instead, wepoint out that the task should be performed so that the regions of `visualconcepts' mentioned in the texts are located in the images, and in the meantimealignments between texts and visual concepts are identified, where thealignments are in multi-granularity. This paper proposes a new method calledX-VLM to perform `multi-grained vision language pre-training'. Experimentalresults show that X-VLM consistently outperforms state-of-the-art methods inmany downstream vision language tasks.