Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

Abstract

Next Location Prediction is a fundamental task in the study of humanmobility, with wide-ranging applications in transportation planning, urbangovernance, and epidemic forecasting. In practice, when humans attempt topredict the next location in a trajectory, they often visualize the trajectoryon a map and reason based on road connectivity and movement trends. However,the vast majority of existing next-location prediction models do not reasonover maps \textbf{in the way that humans do}. Fortunately, the recentdevelopment of Vision-Language Models (VLMs) has demonstrated strongcapabilities in visual perception and even visual reasoning. This opens up anew possibility: by rendering both the road network and trajectory onto animage and leveraging the reasoning abilities of VLMs, we can enable models toperform trajectory inference in a human-like manner. To explore this idea, wefirst propose a method called Vision-Guided Location Search (VGLS), whichevaluates whether a general-purpose VLM is capable of trajectory-basedreasoning without modifying any of its internal parameters. Based on insightsfrom the VGLS results, we further propose our main approach: VLMLocPredictor,which is composed of two stages: In the first stage, we design two SupervisedFine-Tuning (SFT) tasks that help the VLM understand road network andtrajectory structures and acquire basic reasoning ability on such visualinputs. In the second stage, we introduce Reinforcement Learning from VisualMap Feedback, enabling the model to self-improve its next-location predictionability through interaction with the environment. Experiments conducted ondatasets from four different cities show that our method achievesstate-of-the-art (SOTA) performance and exhibits superior cross-citygeneralization compared to other LLM-based approaches.

Quick Read (beta)

loading the full paper ...