Abstract
Modern Earth observation (EO) increasingly leverages deep learning to harnessthe scale and diversity of satellite imagery across sensors and regions. Whilerecent foundation models have demonstrated promising generalization across EOtasks, many remain limited by the scale, geographical coverage, and spectraldiversity of their training data, factors critical for learning globallytransferable representations. In this work, we introduce TerraFM, a scalableself-supervised learning model that leverages globally distributed Sentinel-1and Sentinel-2 imagery, combined with large spatial tiles and land-cover awaresampling to enrich spatial and semantic coverage. By treating sensingmodalities as natural augmentations in our self-supervised approach, we unifyradar and optical inputs via modality-specific patch embeddings and adaptivecross-attention fusion. Our training strategy integrates local-globalcontrastive learning and introduces a dual-centering mechanism thatincorporates class-frequency-aware regularization to address long-taileddistributions in land cover.TerraFM achieves strong generalization on bothclassification and segmentation tasks, outperforming prior models on GEO-Benchand Copernicus-Bench. Our code and pretrained models are publicly available at:https://github.com/mbzuai-oryx/TerraFM .