OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Abstract

Masked Image Modeling (MIM) has become an essential method for buildingfoundational visual models in remote sensing (RS). However, the limitations insize and diversity of existing RS datasets restrict the ability of MIM methodsto learn generalizable representations. Additionally, conventional MIMtechniques, which require reconstructing all tokens, introduce unnecessarycomputational overhead. To address these issues, we present a new pre-trainingpipeline for RS models, featuring the creation of a large-scale RS dataset andan efficient MIM approach. We curated a high-quality dataset named OpticalRS-4Mby collecting publicly available RS datasets and processing them throughexclusion, slicing, and deduplication. OpticalRS-4M comprises 4 million opticalimages covering various RS tasks, such as object detection and pixelsegmentation. To enhance efficiency, we propose SelectiveMAE, a pre-trainingmethod that dynamically encodes and reconstructs semantically rich patchtokens, thereby reducing the inefficiencies of traditional MIM models caused byredundant background pixels in RS images. Extensive experiments demonstratethat OpticalRS-4M significantly improves classification, detection, andsegmentation performance, while SelectiveMAE increases training efficiency over2 times. This highlights the effectiveness and scalability of our pipeline indeveloping RS foundational models.

Quick Read (beta)

loading the full paper ...