Vehicle search is one basic task for the efficient traffic management interms of the AI City. Most existing practices focus on the image-based vehiclematching, including vehicle re-identification and vehicle tracking. In thispaper, we apply one new modality, i.e., the language description, to search thevehicle of interest and explore the potential of this task in the real-worldscenario. The natural language-based vehicle search poses one new challenge offine-grained understanding of both vision and language modalities. To connectlanguage and vision, we propose to jointly train the state-of-the-art visionmodels with the transformer-based language model in an end-to-end manner.Except for the network structure design and the training strategy, severaloptimization objectives are also re-visited in this work. The qualitative andquantitative experiments verify the effectiveness of the proposed method. Ourproposed method has achieved the 1st place on the 5th AI City Challenge,yielding competitive performance 18.69% MRR accuracy on the private test set.We hope this work can pave the way for the future study on using languagedescription effectively and efficiently for real-world vehicle retrievalsystems. The code will be available athttps://github.com/ShuaiBai623/AIC2021-T5-CLV.