Abstract
Remote sensing (RS) images from multiple modalities and platforms exhibitdiverse details due to differences in sensor characteristics and imagingperspectives. Existing vision-language research in RS largely relies onrelatively homogeneous data sources. Moreover, they still remain limited toconventional visual perception tasks such as classification or captioning. As aresult, these methods fail to serve as a unified and standalone frameworkcapable of effectively handling RS imagery from diverse sources in real-worldapplications. To address these issues, we propose RingMo-Agent, a modeldesigned to handle multi-modal and multi-platform data that performs perceptionand reasoning tasks based on user textual instructions. Compared with existingmodels, RingMo-Agent 1) is supported by a large-scale vision-language datasetnamed RS-VL3M, comprising over 3 million image-text pairs, spanning optical,SAR, and infrared (IR) modalities collected from both satellite and UAVplatforms, covering perception and challenging reasoning tasks; 2) learnsmodality adaptive representations by incorporating separated embedding layersto construct isolated features for heterogeneous modalities and reducecross-modal interference; 3) unifies task modeling by introducing task-specifictokens and employing a token-based high-dimensional hidden state decodingmechanism designed for long-horizon spatial tasks. Extensive experiments onvarious RS vision-language tasks demonstrate that RingMo-Agent not only proveseffective in both visual understanding and sophisticated analytical tasks, butalso exhibits strong generalizability across different platforms and sensingmodalities.