Robust Multi-modal Remote Sensing Image Semantic Segmentation Using Tuple Perturbation-based Contrastive Learning
Keywords: Multi-modal Remote Sensing Image, Contrastive Learning, Tuple Perturbation, Negative samples, Semantic Segmentation
Abstract. Deep learning models exhibit promising potential in multi-modal remote sensing image semantic segmentation (MRSISS). However, the constrained access to labeled samples for training deep learning networks significantly influences the performance of these models. To address that, self-supervised learning (SSL) methods have garnered significant interest in the remote sensing community. Accordingly, this article proposes a novel multi-modal contrastive learning framework based on tuple perturbation. Firstly, a tuple perturbation-based multi-modal contrastive learning network (TMCNet) is designed to better explore shared and different feature representations across modalities during the pre-training stage and the tuple perturbation module is introduced to improve the network’s ability to extract multi-modal features by generating more complex negative samples. In the fine-tuning stage, we develop a simple and effective multi-modal semantic segmentation network (MSSNet), which can reduce noise by using complementary information from various modalities to integrate multi-modal features more effectively, resulting in better semantic segmentation performance. Extensive experiments have been carried out on two published multi-modal image datasets including optical and SAR pairs, and the results show that the proposed framework can obtain superior performance of semantic segmentation than the current state-of-the-art methods in cases of limited labeled samples.