GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Abstract

The continuous operation of Earth-orbiting satellites generates vast andever-growing archives of Remote Sensing (RS) images. Natural language presentsan intuitive interface for accessing, querying, and interpreting the data fromsuch archives. However, existing Vision-Language Models (VLMs) arepredominantly trained on web-scraped, noisy image-text data, exhibiting limitedexposure to the specialized domain of RS. This deficiency results in poorperformance on RS-specific tasks, as commonly used datasets often lackdetailed, scientifically accurate textual descriptions and instead emphasizesolely on attributes like date and location. To bridge this critical gap, weintroduce GAIA, a novel dataset designed for multi-scale, multi-sensor, andmulti-modal RS image analysis. GAIA comprises of 205,150 meticulously curatedRS image-text pairs, representing a diverse range of RS modalities associatedto different spatial resolutions. Unlike existing vision-language datasets inRS, GAIA specifically focuses on capturing a diverse range of RS applications,providing unique information about environmental changes, natural disasters,and various other dynamic phenomena. The dataset provides a spatially andtemporally balanced distribution, spanning across the globe, covering the last25 years with a balanced temporal distribution of observations. GAIA'sconstruction involved a two-stage process: (1) targeted web-scraping of imagesand accompanying text from reputable RS-related sources, and (2) generation offive high-quality, scientifically grounded synthetic captions for each imageusing carefully crafted prompts that leverage the advanced vision-languagecapabilities of GPT-4o. Our extensive experiments, including fine-tuning ofCLIP and BLIP2 models, demonstrate that GAIA significantly improves performanceon RS image classification, cross-modal retrieval and image captioning tasks.

Quick Read (beta)

loading the full paper ...