Abstract
The prominence of generalized foundation models in vision-languageintegration has witnessed a surge, given their multifarious applications.Within the natural domain, the procurement of vision-language datasets toconstruct these foundation models is facilitated by their abundant availabilityand the ease of web crawling. Conversely, in the remote sensing domain,although vision-language datasets exist, their volume is suboptimal forconstructing robust foundation models. This study introduces an approach tocurate vision-language datasets by employing an image decoding machine learningmodel, negating the need for human-annotated labels. Utilizing thismethodology, we amassed approximately 9.6 million vision-language paireddatasets in VHR imagery. The resultant model outperformed counterparts that didnot leverage publicly available vision-language datasets, particularly indownstream tasks such as zero-shot classification, semantic localization, andimage-text retrieval. Moreover, in tasks exclusively employing vision encoders,such as linear probing and k-NN classification, our model demonstrated superiorefficacy compared to those relying on domain-specific vision-language datasets.