Abstract
The emergence of large-scale large language models, with GPT-4 as a prominentexample, has significantly propelled the rapid advancement of artificialgeneral intelligence and sparked the revolution of Artificial Intelligence 2.0.In the realm of remote sensing (RS), there is a growing interest in developinglarge vision language models (VLMs) specifically tailored for data analysis inthis domain. However, current research predominantly revolves around visualrecognition tasks, lacking comprehensive, large-scale image-text datasets thatare aligned and suitable for training large VLMs, which poses significantchallenges to effectively training such models for RS applications. In computervision, recent research has demonstrated that fine-tuning large vision languagemodels on small-scale, high-quality datasets can yield impressive performancein visual and language understanding. These results are comparable tostate-of-the-art VLMs trained from scratch on massive amounts of data, such asGPT-4. Inspired by this captivating idea, in this work, we build a high-qualityRemote Sensing Image Captioning dataset (RSICap) that facilitates thedevelopment of large VLMs in the RS field. Unlike previous RS datasets thateither employ model-generated captions or short descriptions, RSICap comprises2,585 human-annotated captions with rich and high-quality information. Thisdataset offers detailed descriptions for each image, encompassing scenedescriptions (e.g., residential area, airport, or farmland) as well as objectinformation (e.g., color, shape, quantity, absolute position, etc). Tofacilitate the evaluation of VLMs in the field of RS, we also provide abenchmark evaluation dataset called RSIEval. This dataset consists ofhuman-annotated captions and visual question-answer pairs, allowing for acomprehensive assessment of VLMs in the context of RS.