RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Abstract

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-textpaired data have demonstrated unprecedented image-text associationcapabilities, achieving remarkable results across various downstream tasks. Acritical challenge is how to make use of existing large-scale pre-trained VLMs,which are trained on common objects, to perform the domain-specific transferfor accomplishing domain-related downstream tasks. A critical challenge is howto make use of existing large-scale pre-trained VLMs, which are trained oncommon objects, to perform the domain-specific transfer for accomplishingdomain-related downstream tasks. In this paper, we propose a new framework thatincludes the Domain pre-trained Vision-Language Model (DVLM), bridging the gapbetween the General Vision-Language Model (GVLM) and domain-specific downstreamtasks. Moreover, we present an image-text paired dataset in the field of remotesensing (RS), RS5M, which has 5 million RS images with English descriptions.The dataset is obtained from filtering publicly available image-text paireddatasets and captioning label-only RS datasets with pre-trained VLM. Theseconstitute the first large-scale RS image-text paired dataset. Additionally, wefine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuningmethods on RS5M to implement the DVLM. Experimental results show that ourproposed dataset is highly effective for various tasks, and our model GeoRSCLIPimproves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-ModalText-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo)tasks. Dataset and models have been released in:\url{https://github.com/om-ai-lab/RS5M}.

Quick Read (beta)

loading the full paper ...