CROSS-MODAL CHANGE DETECTION FLOOD EXTRACTION BASED ON SELF-SUPERVISED CONTRASTIVE PRE-TRAINING

: Flood extraction is a critical issue in remote sensing analysis. Accurate flood extraction faces challenges such as complex scenes, image differences across modalities, and a shortage of labeled samples. Traditional supervised deep learning algorithms demonstrate promising prospects in flood extraction. They mostly rely on abundant labeled data. However, in practical applications, there is a scarcity of available labeled samples for flood change regions, leading to an expensive acquisition of such data for flood extraction. In contrast, there is a wealth of unlabeled data in remote sensing images. Self-supervised contrastive learning (SSCL) provides a solution, allowing learning from unlabeled data without explicit labels. Inspired by SSCL, we utilized the open-source CAU-Flood dataset and developed a framework for cross-modal change detection in flood extraction (CMCDFE). We employed the Barlow Twin (BT) SSCL algorithm to learn effective visual feature representations of flood change regions from unlabeled cross-modal bi-temporal remote sensing data. Subsequently, these well-initialized weight parameters were transferred to the task of flood extraction, achieving optimal accuracy. We introduced the improved CS-DeepLabV3+ network for extracting flood change regions from cross-modal bi-temporal remote sensing data, incorporating the CBAM dual attention mechanism. By demonstrating on the CAU-Flood dataset, we proved that fine-tuning with only a pre-trained encoder can surpass widely used ImageNet pre-training methods without additional data. This approach effectively addresses downstream cross-modal change detection flood extraction tasks.


INTRODUCTION
In recent years, frequent global flood disasters have caused substantial damage to both property and community safety.The essence of flood extraction lies in recognizing floodwaters, specifically in determining the extent of inundation (Zhang et al., 2021).With the advancement of satellite remote sensing technology, remote sensing images have become crucial tools, providing essential means and data support for acquiring floodrelated information.Their rapid acquisition, strong timeliness, and capacity for large-scale repetitive observations significantly contribute to flood monitoring.Both multispectral remote sensing images and synthetic aperture radar (SAR) remote sensing images are applied in flood monitoring.The fusion of multispectral remote sensing images and SAR remote sensing images harnesses the complementary advantages of each, thereby enhancing the effectiveness of remote sensing-based flood monitoring (He et al., 2023;Zhang et al., 2021;Zhao et al., 2023;Zhang et al., 2023).
Since the introduction of fully convolutional networks (Long et al., 2014) in 2015, a multitude of end-to-end deep learning methodologies has been integrated into the task of cross-modal change detection (CD) for flood extraction, playing a pivotal role.These deep learning approaches primarily rely on supervised learning, demanding a substantial volume of labeled data (Konapala et al., 2021;Zhao et al., 2023;Zhang et al., 2023).However, flood disasters are characterized by their sudden and transient nature, with a scarcity of high-resolution satellite imagery during such events.Additionally, annotating remote sensing images incurs a high cost, making the acquisition of well-labeled flood samples a time-consuming and labor-intensive endeavor.To mitigate reliance on annotated data, various cross-modal flood extraction methods utilize preprocessing models on large-scale ImageNet datasets, followed by fine-tuning with a limited amount of pixel-level annotations.However, substantial distribution disparities between ImageNet data and cross-modal flood monitoring datasets pose a considerable risk of domain shift issues.Recently, selfsupervised learning has garnered significant research interest in academia as a method to derive effective visual representations from a vast pool of unlabeled images (Caron et al. 2020 ;Chen et al., 2020;Chen and He 2021;Grill et al. 2020;He et al. 2019;Tian et al. 2019;Zbontar et al. 2021).Essentially, selfsupervised learning consists of two steps: firstly, it involves a pretext task, where well-designed self-supervised signals and pseudo-labels (i.e., automatically generated labels) are utilized to aid in initializing model parameters.This enables the model to directly extract rich visual feature knowledge from unlabeled image data.Subsequently, this acquired knowledge is transferred to specific downstream tasks to reduce reliance on a large number of labeled samples, thereby enhancing the model's performance in those tasks.
Self-supervised contrastive pre-training represents a novel feature learning paradigm, primarily focused on defining positive and negative sample pairs.Its main objective is to maximize the similarity within positive pairs while minimizing it for negative pairs in the feature space, embodying the principle of "attraction within the same class, exclusion between different classes."Recent research highlights the high generalization ability of features pre-trained by most selfsupervised contrastive learning (SSCL) methods, a notable advantage in initializing backbone networks for downstream tasks.Among the early SSCL methods, MoCo (He et al. 2019) and SimCLR (Chen et al., 2020) introduced innovative concepts like the momentum encoder and queue sampling, effectively handling negative samples.However, MoCo's computational speed is somewhat hindered due to queue sampling, while SimCLR, with its larger batch size, may incur higher GPU memory demands and increased computational costs to prevent the learning of trivial solutions.BYOL (Grill et al. 2020) addresses trivial solutions through positive sample contrastive learning, employing a symmetric network and stop-gradient methods.Nonetheless, it faces challenges related to sensitivity to task requirements and hyper-parameter tuning.SimSiam (Chen and He 2021), which avoids negative samples, leverages an asymmetric network structure and cross-gradient updates to counteract trivial solutions.However, it requires additional computational resources and exhibits sensitivity to hyperparameters.SwAV (Caron et al. 2020), rooted in online clustering and multi-view prediction encoding, successfully circumvents negative sample requirements, proving advantageous in high-label-cost scenarios.Nevertheless, it does entail higher computational resources due to the intricate nature of the online clustering algorithm.In contrast, Barlow Twin (BT) (Zbontar et al. 2021) introduces an innovative approach to SSCL, unencumbered by batch size restrictions and negative samples.It emphasizes the embedding itself, steering clear of asymmetric structure design.By computing the crosscorrelation matrix of augmented samples and utilizing a loss function to mitigate redundancy, BT achieves a crosscorrelation matrix reminiscent of an identity matrix.This indicates that the feature vectors of different augmentations of the same sample exhibit similarity, thereby minimizing redundancy across different dimensions and enhancing feature representation efficiency.
Given the inherent strengths of BT, we have chosen to employ it as the objective function for SSCL within our proposed framework for cross-modal CD flood extraction (CMCDFE).This correspondence presents the results of experimental validation conducted using the publicly available CAU-Flood dataset (He et al., 2023).CAU-Flood stands out as a remote sensing dataset explicitly crafted for cross-modal flood extraction, featuring multiple sets of pre-disaster Sentinel-2 optical images, post-disaster Sentinel-1 SAR images, and corresponding ground truth label images delineating altered regions.The articulated CMCDFE framework unfolds in two sequential phases: self-supervised contrastive pre-training and fine-tuning.In the initial stage, we construct a three-channel false-color image by amalgamating post-disaster Sentinel-1 VV polarization mode data, near-infrared band images extracted from pre-disaster Sentinel-2 optical images, and computed NDWI (Normalized Difference Water Index) index images.Subsequently, the BT algorithm is enlisted to distill effective visual representations of altered areas from these unlabeled false-color images.In the subsequent stage, we leverage the SSCL methodology to pre-train the encoder of the refined CS-DeepLabV3+ model.The encoder demonstrates noteworthy parameter initialization, and empirical evidence derived from the CAU-Flood flood monitoring dataset attests that fine-tuning exclusively with the pre-trained encoder outperforms the widely embraced ImageNet pre-training approach, eliminating the need for additional data.This methodological refinement efficiently addresses downstream tasks associated with cross-modal flood extraction.
The remainder of this paper is structured as follows.Section 2 outlines the proposed methodology, providing a detailed description.In Section 3, we present the results of our experiments and engage in a comprehensive discussion.The concluding remarks are offered in Section 4 to wrap up this paper.

CMCDFE framework
We utilized pre-disaster Sentinel-2 multispectral images to extract near-infrared band images and selected NDWI as the representation of water bodies in the multispectral data.NDWI is computed by calculating data from the near-infrared and green bands, and its calculation formula is as follows: Green NIR NDWI Green NIR

SSCL pre-training method
The method employed in this paper can be divided into two parts.Firstly, the BT algorithm is utilized for SSCL pre-training.Subsequently, the pre-trained weights are transferred to downstream cross-modal flood extraction.The BT algorithm maintains basic consistency with the SimCLR model in several aspects, including image augmentation, the encoder, and the projection module.It employs ResNet50 (excluding the final classification layer) as the feature extractor, followed by a projector network.This projector network consists of two linear layers, each with a hidden layer size of 512 output units.Due to high computational requirements, the output of the projection network is modified to generate embeddings of size 256, whereas the original BT network produces embeddings of size 8192 (Zbontar et al. 2021).The first layer of the projector is followed by a batch normalization layer and rectified linear units.Figure 2 provides an overview of the BT algorithm.As the BT algorithm does not require distinguishing between positive and negative samples, the input false-color image X undergoes transformations t T  to obtain different augmented data 0 X and 1 X .After passing through the encoder θ f , they respectively yield features ' 0 f and ' 1 f .Following the projector layer θ g , the extracted features are denoted as ' 0 z and ' 1 z .For a given batch, the network's loss function is: ( ) The initial component of the loss function is denoted as the "invariance term," while the subsequent one is termed the "redundancy reduction term."Here, C signifies the cross- correlation matrix and can be computed as follows: Where ii C denotes the diagonal elements of the cross- correlation matrix C , ij C represents the non-diagonal elements, and λ is a hyperparameter.Here, the parameter b represents different batch samples, indicating that the calculation of each element in C is conducted across the batch dimensions.It can be observed that the optimization objective aims for the diagonal elements of the cross-correlation matrix C to be 1, and the non-diagonal elements to be 0.After

Dataset description and evaluation metrics
To evaluate the efficacy of the proposed algorithm, we employed the CAU-Flood cross-modal flood extraction dataset, comprising pre-disaster Sentinel-2 optical images and postdisaster Sentinel-1 SAR images for 18 distinct study regions.
Encompassing a comprehensive area of 95,142 square kilometers, the CAU-Flood dataset spans diverse geographical locations, including China, Bangladesh, Australia, the United States, Canada, and Germany.Notably, the Sentinel-1 images exhibit spatial resolutions ranging from 3.5 meters to 40 meters.Radar images from Sentinel-1 were acquired at the Ground Range Detected (GRD) level, with exclusive processing of data from the VV-polarization mode to optimize flood detection accuracy.Sentinel-2 images consist of four bands (red, green, blue, and near-infrared) with a spatial resolution of 10 meters.
Ensuring semantic consistency in cross-modal interpretation, the CAU-Flood dataset underwent resampling to enforce uniform image sizes for SAR and optical pairs, and grayscale values were stretched to a standardized range of 0 to 255 (He et al., 2023).Processed Sentinel-2 and Sentinel-1 images served as pre-disaster and post-disaster inputs, respectively, with manual annotations identifying flood areas.This yielded a dataset comprising 18,302 image patches sized at 256×256, with 15,231 patches allocated for training and 3,071 for testing.
During the pre-training phase, we utilized pairs of pre-disaster and post-disaster images from the training set to construct falsecolor images, employing the BT algorithm for SSCL without the use of labeled images.Subsequently, the pre-trained ResNet50 encoder segment, endowed with well-tuned parameters, was transferred to the downstream CS-DeepLabV3+ model for the cross-modal flood extraction task.Four metrics, namely precision, recall, F1 score, and IoU, were employed to evaluate our algorithm's performance, and comparisons were made with state-of-the-art (SOTA) methods.

Implementation details
This paper utilizes the PyTorch framework to implement the BT algorithm.The specifications of our experimental machine are as follows: a 12th Gen Intel Core i9-12900K @ 3.19 GHz processor, 64.00 GB RAM, and an NVIDIA GeForce RTX 3090 graphics card.To optimize the model, we adhere to the BT protocol (Zbontar et al. 2021).During the SSCL pre-training phase, we set the batch size to 20 and employed the LARS optimizer for training over 400 epochs.The initial learning rate was set at 0.005, adjusted by multiplying with the batch size and dividing by 256.We introduced a learning rate warm-up period of 10 epochs, followed by a cosine decay schedule (Zbontar et al. 2021), reducing the learning rate by a factor of 1000.The trade-off parameter of the loss function is set to , and the weight decay parameter value is 1.5×10 -6 .
We evaluated the performance of the CS-DeepLabV3+ algorithm on downstream cross-modal flood extraction using the SSCL and fine-tuning strategies.For consistency, we employed ResNet50 as the feature extractor for the CS-DeepLabV3+ algorithm.In the downstream task experiments of this paper, we chose to use the Adam optimizer with beta1 set to 0.9, beta2 set to 0.999, and epsilon set to 1.0×10 -8 .The initial learning rate was set to 0.005, with a batch size of 10, and a total of 150 epochs were conducted.We adopted a hybrid loss function (Fang et al., 2022), combining weighted cross-entropy and Dice loss with equal weights.

Evaluation on CAU-Flood dataset
We conducted a comparison between the cross-modal flood extraction method CS-DeepLabV3+ proposed in this paper and several SOTA methods, including UNet++ (Peng et al., 2019), ResUNet (Zhang et al., 2018), PSPNet (Zhao et al., 2017), HRNet (Sun et al., 2019), and DeeplabV3+.Among them, both UNet++ and ResUNet represent advancements over the traditional UNet network, which has become a fundamental network in various remote sensing applications.UNet++ inherits the structure of UNet while incorporating dense skip connections, thereby maximizing the preservation of finegrained details and global information.In contrast, ResUNet leverages the benefits of both residual networks and UNet, with residual connections alleviating the gradient vanishing problem in deep networks, contributing to faster convergence and improved training efficiency.PSPNet aggregates contextual information from different regions of the image using pyramid pooling modules, integrating complex contextual information into the pixel-level semantic segmentation framework.HRNet transforms the connection between high-resolution and lowresolution feature maps from a serial to a parallel structure, thereby maintaining the representation of high-resolution feature maps throughout the entire network.To assess the effectiveness of the SSCL pre-training method proposed in this paper, we initialized the parameters of five comparative methods during training using ImageNet pre-trained weights.In contrast, our approach involves transferring the pre-trained weights of the BT algorithm to the cross-modal flood extraction task.
The results of cross-modal flood extraction on the CAU-Flood dataset using various SOTA methods are presented in Figure 4. From top to bottom, these scenarios include pre-disaster Sentinel-2 multispectral imagery, post-disaster Sentinel-1 VV polarization mode data, false-color images, ground truth images (where white represents changed areas and black represents unchanged areas), and the results of UNet++, ResUNet, PSPNet, HRNet, DeeplabV3+, and CS-DeepLabV3+.The results in Figure 4 demonstrate that all six comparative methods are proficient in handling the cross-modal flood extraction task, with each SOTA method yielding satisfactory segmentation results.Nonetheless, there are instances of suboptimal segmentation outcomes.While each model exhibits some missed detections, the extracted results overall remain acceptable.In Figure 4, grey represents true-negative (TN) pixels, green represents true-positive (TP) pixels, blue indicates false-positive (FP) pixels, and red corresponds to false-negative (FN) pixels.

Ablation experiment
We fine-tuned the proposed CS-DeepLabV3+ algorithm on the CAU-Flood dataset using three strategies: random initialization (Rand-init), supervised ImageNet pre-training (ImageNet-sup), and pre-training with BT self-supervised learning.To assess the effectiveness of the adopted self-supervised contrastive pretraining method, we conducted a detailed comparative analysis with three additional self-supervised pre-training methods, namely SimCLR (Chen et al., 2020), MoCo (He et al., 2019), and CMC (Tian et al. 2019).These methods all employ a contrastive loss, which differs from the BT loss function used in our approach.According to the results in Table 2,

CONCLUSION
Supervised deep learning models demand a substantial amount of annotated data when tasked with flood extraction from crossmodal remote sensing images.However, the collection and annotation of samples containing the desired flood change regions are both time-consuming and labor-intensive.To tackle this challenge, the adoption of transfer learning with a selfsupervised contrastive pre-training strategy has proven effective.In this study, we applied the BT self-supervised learning algorithm to learn effective visual feature representations of flood change regions from unlabeled cross-modal bi-temporal remote sensing data.Subsequently, these well-initialized weight parameters were transferred to the task of flood extraction.We introduced an improved CS-DeepLabV3+ network for extracting flood change regions from cross-modal bi-temporal remote sensing data, incorporating the CBAM dual attention mechanism.Experimental analysis on the open-source CAU-Flood dataset validated the effectiveness of our proposed method.The results demonstrated that fine-tuning with only a pre-trained encoder can surpass widely used ImageNet pretraining methods without the need for additional data, effectively addressing downstream cross-modal flood extraction tasks.Even with a limited number of labeled data samples, our self-supervised pre-training strategy proves effective.This proves particularly beneficial for flood extraction applications facing challenges in acquiring labeled data for flood change regions due to cost constraints.In the future, we plan to replace the ResNet50 encoder component of our approach with a vision transformer to further enhance the accuracy of flood extraction.

.
This paper synthesizes post-disaster Sentinel-1 VV polarization mode data, near-infrared band images, and NDWI index images through channel fusion, constructing a three-channel false-color image.Assuming train X is the synthesized three-channel falsecolor image, we use the BT SSCL algorithm to pre-train the model on an unlabeled training set Our goal is to use self-supervised contrastive pre-training to learn effective visual feature representations of cross-modal flood changes from the synthesized three-channel false-color images.Subsequently, the learned encoder weights are used as the initial weights for the downstream flood change region extraction task network, the CS-DeepLabV3+ algorithm.Figure1illustrates the schematic diagram of the proposed CMCDFE framework, where the knowledge transfer of self-supervised contrastive learning feature representation is well-validated in the downstream task.

Figure 1 .
Figure 1.Pipeline of the proposed CMCDFE framework.
multiple training iterations, the cross-correlation matrices calculated for positive examples of the same image under different transformations tend to approach the identity matrix(Zbontar et al. 2021).

Figure 2 .
Figure 2. Flow chart of the proposed SSCL pre-training algorithm.

Figure 4 .
Figure 4. Visual comparisons of the different models applied to CAU-Flood.Grey: TN pixels; green: TP pixels; blue: FP pixels; red: FN pixels.
our proposed BT self-supervised contrastive pre-training method outperforms the widely used ImageNet pre-training method, as well as SimCLR, MoCo, and CMC methods.Compared to the Randinit strategy, on the CAU-Flood test set, applying our BT pretraining process increased the F1 score by nearly 1.5%.Similarly, our proposed pre-training method slightly outperformed the supervised ImageNet pre-training method and other SSCL methods in terms of performance.Overall, ablation experiments indicate that our proposed method for unlabeled cross-modal remote sensing images can achieve or even surpass the performance of widely used ImageNet pre-training methods and SSCL methods such as SimCLR, MoCo, and CMC, which utilize over a million labeled images.These results also indirectly confirm that our method mitigates the domain shift problem caused by transfer learning from ImageNet weights in the task of cross-modal flood extraction.

Table 1 .
The qualitative comparison results in Figure4demonstrate that flood detection based on deep learning exhibits good adaptability to different types of land cover and can be employed in situations with frequent flooding.It can effectively identify flooded areas in environments such as estuaries, inland river plains, villages, and lakes.The flood detection accuracy evaluation results of these comparative methods on the test set are presented in Table1.Thanks to the CAU-Flood dataset, various deep learning models demonstrate excellent performance in the cross-modal flood extraction task.Performance comparison for the CAU-Flood dataset.(All values are in percentages.)

Table 2 .
Ablation experiment results on CAU-Flood dataset.(Allvalues are in percentages.❌indicatesexcludedstepsduringthe training process, while ✔ denotes their inclusion.)Asiswellknown,large-scaleflood extraction tasks currently face a challenge due to the absence of a sufficiently large and publicly available annotated dataset.This limitation hinders the widespread application of deep learning methods in crossmodal flood extraction tasks.On one hand, annotating crossmodal bi-temporal remote sensing images for flood change regions from large-scale datasets is an expensive, tedious, timeconsuming, and primarily manual process.On the other hand, there is an urgent need for methods capable of learning and expressing visual information in cross-modal images without the reliance on labeled samples.To thoroughly validate the performance of the proposed BT self-supervised contrastive pre-training method with a small number of labeled samples, we specifically fine-tune a cross-modal flood extraction network with a limited number of labeled samples and evaluate the accuracy of the final flood extraction results.This paper compares the impact of Rand-init, ImageNet-sup, SimCLR, MoCo, and CMC on the performance of cross-modal flood extraction tasks, as presented in Table3.From Table3, it can

Table 3 .
Performance of the different pre-training methods evaluated using the improved CS-DeepLabV3+ model with limited labels.(All values are in percentages.)