MobCLIP: Learning General-purpose Geospatial Representation at Scale

Abstract

Representation learning of geospatial locations remains a core challenge inachieving general geospatial intelligence. Current embedding methods often lackversatility, limiting their utility across diverse tasks in both human andnatural domains. We present MobCLIP, the first nationwide general-purposelocation encoder, integrating an unprecedented diversity of data modalitiesthrough effective and scalable multimodal fusion. Adopting a novel CLIP-basedarchitecture, our framework aligns 100M+ POIs, nationwide remote sensingimagery, and structured demographic statistics with a billion-edge mobilitygraph. By tokenizing spatial locations into grid cells inspired by VisionTransformers, we establish a unified representation space bridging mobilitypatterns and multimodal features. To rigorously evaluate the general-purposeeffectiveness of MobCLIP, we construct a benchmark dataset composed of 11downstream prediction tasks across social, economic, and natural domains.Experiments show that MobCLIP, with four input modalities and a compact128-dimensional representation space, achieves significantly superiorgeneral-purpose predictive performances than state-of-the-art models by anaverage of 35%. Thanks to the effective integration of human-centricmodalities, the performance gain is particularly profound in human-centrictasks, such as energy consumption (+260%), offline retail consumption amount(+98%), and crime cases (+95%) predictions. Echoing LLM scaling laws, wefurther demonstrate the scaling behavior in geospatial representation learning.We open-source code and pretrained models at:https://github.com/ylzhouchris/MobCLIP.

Quick Read (beta)

loading the full paper ...