Abstract
The remarkable achievements of ChatGPT and GPT-4 have sparked a wave ofinterest and research in the field of large language models for ArtificialGeneral Intelligence (AGI). These models provide us with intelligent solutionsthat are more similar to human thinking, enabling us to use general artificialintelligence to solve problems in various applications. However, in the fieldof remote sensing, the scientific literature on the implementation of AGIremains relatively scant. Existing AI-related research primarily focuses onvisual understanding tasks while neglecting the semantic understanding of theobjects and their relationships. This is where vision-language models excel, asthey enable reasoning about images and their associated textual descriptions,allowing for a deeper understanding of the underlying semantics.Vision-language models can go beyond recognizing the objects in an image andcan infer the relationships between them, as well as generate natural languagedescriptions of the image. This makes them better suited for tasks that requireboth visual and textual understanding, such as image captioning, text-basedimage retrieval, and visual question answering. This paper provides acomprehensive review of the research on vision-language models in remotesensing, summarizing the latest progress, highlighting the current challenges,and identifying potential research opportunities. Specifically, we review theapplication of vision-language models in several mainstream remote sensingtasks, including image captioning, text-based image generation, text-basedimage retrieval, visual question answering, scene classification, semanticsegmentation, and object detection. For each task, we briefly describe the taskbackground and review some representative works. Finally, we summarize thelimitations of existing work and provide some possible directions for futuredevelopment.