We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multi-modal framework for geospatial applications. We propose a novel Cross-modal Sample Selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multi-modal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial-temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels.
To validate our approach, we assembled the multi-modal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google's Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks. All data, code, and pretrained weights will be made publicly available.
NoLDO-S12 contains two splits: SSL4EO-S12@NoL with noisy labels for pretraining, and two downstream datasets, SSL4EO-S12@DW and SSL4EO-S12@OSM, with exact labels for transfer learning.

• SSL4EO-S12@NoL paired the large-scale, multi-modal, and multi-temporal self-supervised SSL4EO-S12 dataset with the 9-class noisy labels (NoL) sourced from the Google Dynamic World (DW) project on Google Earth Engine (GEE). To keep the dataset's multi-temporal characteristics, we only retain the S1-S2-noisy label triples from the locations where all 4 timestamps of S1-S2 pairs have corresponding DW labels, resulting in about 41% (103,793 out of the 251,079 locations) noisily labeled data of the SSL4EO-S12 dataset. SSL4EO-S12@NoL well reflects real-world use cases where noisy labels remain more difficult to obtain than bare S1-S2 image pairs.
The paired noisy label masks along with corresponding image IDs in SSL4EO-S12 can be downloaded from ssl4eo_s12_nol.zip

We construct two downstream datasets, SSL4EO-S12@DW and SSL4EO-S12@OSM for transfer learning experiments. Both are selected on the DW project’s manually annotated training and validation datasets, yet paired with different label sources from DW and OSM.
• SSL4EO-S12@DW was constructed from the DW expert labeled training subset of 4,194 tiles with given dimensions of 510×510 pixels and its hold-out validation set of 409 tiles with given dimensions of 512×512. The human labeling process allows some ambiguous areas left unmarked (white spots in DW masks in Fig. 2). We spatial-temporally aligned the S1 and S2 data for the training and test tiles with GEE, leading to 3,574 training tiles and 340 test tiles, that is, a total of 656,758,064 training pixels and 60,398,506 test pixels. The class distributions can be found in Fig. 2.
The SSL4EO-S12@DW downstream dataset can be downloaded from ssl4eo_s12_dw.zip
• SSL4EO-S12@OSM adopts 13-class fine-grained labels derived from OpenStreetMap (OSM) following the work of Schultz et al. We retrieved 2,996 OSM label masks among the 3,914=3,574+340 DW tiles, with the remaining left without OSM labels. After an automatic check with DW labels as reference assisted by some manual inspection, we construct SSL4EO-S12@OSM with 1,375 training tiles and 400 test tiles, that is, a total of 165,993,707 training pixels and 44,535,192 test pixels.
The SSL4EO-S12@DW downstream dataset can be downloaded from ssl4eo_s12_osm.zip
• Some downloading scripts can be found in data_prepare/data_check_SSL4EO/get_dw_data
and data_prepare/data_check_SSL4EO/get_osm_labels
• Write the SSL4EO-S12-NoL pretraining images (S1/S2) to lmdbs: data_prepare/data_check_SSL4EO/construct_pretrain_lmdb/write_labels_to_lmdb.py
• Write the SSL4EO-S12-NoL pretraining noisy labels to lmdb: data_prepare/data_check_SSL4EO/construct_pretrain_lmdb/write_labels_to_lmdb.py
• Write the DFC2020 downstream dataset to lmdb: data_prepare/data_check_DFC20/write_data_to_lmdb.py
• Write the SSL4EO-S12@DW/OSM downstream dataset to lmdb: data_prepare/data_check_SSL4EO/construct_dw_osm_lmdb/read_dw_data_to_lmdb.py
Implement pretraining on the SLURM system with fusion_type=mid/late
for middle and late fusion settings, respectively.
srun python -u py_scripts_SSL4EO/train_SSL4EO_unet_pl_pretrain_mm_sscom.py \
--data_path $data_directory_path \
--data_name_s1 0k_251k_uint8_s1.lmdb \
--data_name_s2 0k_251k_uint8_s2c.lmdb \
--data_name_label dw_labels.lmdb \
--input_type s12 \
--fusion_type mid \
--save_dir $save_path \
--model_type resnet50 \
--n_channels 13 \
--n_classes 9 \
--loss_type cd \
--consist_loss_type ce \
--label_smoothing \
--label_smoothing_factor 0.15 0.05 \
--label_smoothing_prior_type ts \
--sample_selection \
--sample_selection_rmup_func exp \
--sample_selection_rmdown_epoch 80 \
--sample_selection_prop 0.5 \
--sample_selection_confidence_type ce \
--experiment ns_labels \
--validation 0.01 \
--batch_size 128 \
--num_workers 12 \
--accelerator gpu \
--slurm \
--epochs $ep \
--optimizer adam \
--learning_rate 0.005 \
--lr_adjust_type rop \
--lr_adjust 30
#S2 bands | Fusion type | Link |
13B | middle | weights-cromss-13B-midFusion-epoch=199.ckpt |
late | weights-cromss-13B-lateFusion-epoch=199.ckpt | |
9B | middle | weights-cromss-9B-midFusion-epoch=199.ckpt |
late | weights-cromss-9B-lateFusion-epoch=199.ckpt |
Use the py_scripts_SSL4EO/train_SSL4EO_unet_pl_pretrain_mm.py
script.
Use the py_scripts_SSL4EO/train_SSL4EO_unet_pl_pretrain.py
script with input_type=s1/s2
for each single modality
Example bash scripts using one single GPU:
train_SSL4EO_pl_ft_DFC2020.sh
train_SSL4EO_pl_ft_DW.sh
train_SSL4EO_pl_ft_OSM.sh
Table. 2. Transfer learning results (%) on the SSL4EO-S12@DW dataset from DeepLabv3+ with ResNet-50 and UperNet with ViT-large.
Table. 3. Transfer learning results (%) on the SSL4EO-S12@OSM dataset from FPN with ResNet-50 and UperNet with ViT-large.
@ARTICLE{liu-cromss,
author={Liu, Chenying and Albrecht, Conrad M and Wang, Yi and Zhu, Xiao Xiang},
journal={IEEE Transactions on Geoscience and Remote Sensing},
title={CromSS: Cross-modal pretraining with noisy labels for remote sensing image segmentation},
year={2025},
volume={},
number={},
pages={in press}}