Abstract
Understanding and predicting urban dynamics is crucial for managingtransportation systems, optimizing urban planning, and enhancing publicservices. While neural network-based approaches have achieved success, theyoften rely on task-specific architectures and large volumes of data, limitingtheir ability to generalize across diverse urban scenarios. Meanwhile, LargeLanguage Models (LLMs) offer strong reasoning and generalization capabilities,yet their application to spatial-temporal urban dynamics remains underexplored.Existing LLM-based methods struggle to effectively integrate multifacetedspatial-temporal data and fail to address distributional shifts betweentraining and testing data, limiting their predictive reliability in real-worldapplications. To bridge this gap, we propose UrbanMind, a novelspatial-temporal LLM framework for multifaceted urban dynamics prediction thatensures both accurate forecasting and robust generalization. At its core,UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder withspecialized masking strategies that capture intricate spatial-temporaldependencies and intercorrelations among multifaceted urban dynamics.Additionally, we design a semantic-aware prompting and fine-tuning strategythat encodes spatial-temporal contextual details into prompts, enhancing LLMs'ability to reason over spatial-temporal patterns. To further improvegeneralization, we introduce a test time adaptation mechanism with a test datareconstructor, enabling UrbanMind to dynamically adjust to unseen test data byreconstructing LLM-generated embeddings. Extensive experiments on real-worldurban datasets across multiple cities demonstrate that UrbanMind consistentlyoutperforms state-of-the-art baselines, achieving high accuracy and robustgeneralization, even in zero-shot settings.