Natural language-based vehicle retrieval is a task to find a target vehiclewithin a given image based on a natural language description as a query. Thistechnology can be applied to various areas including police searching for asuspect vehicle. However, it is challenging due to the ambiguity of languagedescriptions and the difficulty of processing multi-modal data. To tackle thisproblem, we propose a deep neural network called SBNet that performs naturallanguage-based segmentation for vehicle retrieval. We also propose twotask-specific modules to improve performance: a substitution module that helpsfeatures from different domains to be embedded in the same space and a futureprediction module that learns temporal information. SBnet has been trainedusing the CityFlow-NL dataset that contains 2,498 tracks of vehicles with threeunique natural language descriptions each and tested 530 unique vehicle tracksand their corresponding query sets. SBNet achieved a significant improvementover the baseline in the natural language-based vehicle tracking track in theAI City Challenge 2021.