Abstract
Abundant, well-annotated multimodal data in remote sensing are pivotal foraligning complex visual remote sensing (RS) scenes with human language,enabling the development of specialized vision language models across diverseRS interpretation tasks. However, annotating RS images with rich linguisticsemantics at scale demands expertise in RS and substantial human labor, makingit costly and often impractical. In this study, we propose a workflow thatleverages large language models (LLMs) to generate multimodal datasets withsemantically rich captions at scale from plain OpenStreetMap (OSM) data forimages sourced from the Google Earth Engine (GEE) platform. This approachfacilitates the generation of paired remote sensing data and can be readilyscaled up using openly available data. Within this framework, we presentRSTeller, a multimodal dataset comprising over 1.3 million RS images, eachaccompanied by two descriptive captions. Extensive experiments demonstrate thatRSTeller enhances the performance of multiple existing vision language modelsfor RS scene understanding through continual pre-training. Our methodologysignificantly reduces the manual effort and expertise needed for annotatingremote sensing imagery while democratizing access to high-quality annotateddata. This advancement fosters progress in visual language modeling andencourages broader participation in remote sensing research and applications.The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.