Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation

Abstract

Vision-language models for Earth observation (EO) typically rely on thevisual spectrum of data as the only model input, thus failing to leverage therich spectral information available in the multispectral channels recorded bysatellites. Therefore, we introduce Llama3-MS-CLIP, the first vision-languagemodel pre-trained with contrastive learning on a large-scale multispectraldataset and report on the performance gains due to the extended spectral range.Furthermore, we present the largest-to-date image-caption dataset formultispectral data, consisting of one million Sentinel-2 samples andcorresponding textual descriptions generated using Llama3-LLaVA-Next andOverture Maps data. We develop a scalable captioning pipeline, which isvalidated by domain experts. We evaluate Llama3-MS-CLIP on multispectralzero-shot image classification and retrieval using three datasets of varyingcomplexity. Our results demonstrate that Llama3-MS-CLIP significantlyoutperforms other RGB-based approaches, improving classification accuracy by+6.77% on average and retrieval performance by +4.63% mAP compared to thesecond-best model. Our results emphasize the relevance of multispectralvision-language learning. The image-caption dataset, code, and model weightsare available at https://github.com/IBM/MS-CLIP.

Quick Read (beta)

loading the full paper ...