Abstract
Abundant, well-annotated multimodal data in remote sensing are pivotal foraligning complex visual remote sensing (RS) scenes with human language,enabling the development of specialized vision language models across diverseRS interpretation tasks. However, annotating RS images with rich linguisticsemantics at scale demands expertise in RS and substantial human labor, makingit costly and often impractical. In this study, we propose a workflow thatleverages large language models (LLMs) to generate multimodal datasets withsemantically rich captions at scale from plain OpenStreetMap (OSM) data forimages sourced from the Google Earth Engine (GEE) platform. This approachfacilitates the generation of paired remote sensing data and can be readilyscaled up using openly available data. Within this framework, we presentRSTeller, a multimodal dataset comprising over 1 million RS images, eachaccompanied by multiple descriptive captions. Extensive experiments demonstratethat RSTeller enhances the performance of multiple existing vision languagemodels for RS scene understanding through continual pre-training. Ourmethodology significantly reduces the manual effort and expertise needed forannotating remote sensing imagery while democratizing access to high-qualityannotated data. This advancement fosters progress in visual language modelingand encourages broader participation in remote sensing research andapplications. The RSTeller dataset is available athttps://github.com/SlytherinGe/RSTeller.