A Novel Hybrid Model Based on CNN and Multi-scale Transformer for Extracting Water Bodies from High Resolution Remote Sensing Images

: Extracting water bodies from high-resolution remote sensing images has always been a challenging and hot task in the field of remote sensing. Considering that the accuracy and reliability of water body extraction still have some room for improvement, this paper proposes a hybrid network model based on CNN and multi-scale transformer for water body extraction from high-resolution remote sensing images. Specifically, the proposed network first uses a CNN model to extract a series of multi-scale features from shallow to deep from remote sensing images. These multi-scale features are then fed into a designed multi-scale transformer module to extract global contextual association information of water bodies. Afterwards, the water separability in the new multi-scale features output from the multi-scale transformer module is evaluated separately, and the features at different scales are adaptively weighted and fused according to their water separability. Subsequently, the network adaptively refines the fused features with the aid of a hybrid attention model to generate refined features that can effectively distinguish between water bodies and non-water bodies. Finally, these refined features are input into the prediction head to generate the final water body extraction results. The proposed network integrates the ability of CNN to capture local detail features and the ability of transformer to model global contextual semantic associations in a large range. Therefore, it can more accurately identify water bodies in remote sensing images, and the extracted water body boundaries have high accuracy and continuity. Finally, water body extraction experiments on the public dataset demonstrate the effectiveness of the proposed network. Moreover, the results of comparative experiments also show that compared with existing networks or methods such as U-Net, FCN8s, DeepLabv3+, and MSFA-Net, the proposed network has certain advantages in terms of water body extraction accuracy.


INTRODUCTION
The spatial distribution information of surface water bodies plays an important role in many applications such as water resources management and protection, flood monitoring, and sustainable development.
With the rapid development of remote sensing technology, water body extraction based on remote sensing images has become the mainstream way to obtain the spatial distribution information of surface water bodies.In particular, highresolution remote sensing images have become the main data source for water body extraction due to their increasing availability.For instance, Liu et al. (2023) proposed a multiscale features extraction network for water extraction from optical high-resolution remote sensing imagery.Zhang et al. (2021) also used the high-resolution remote sensing images as a data source for fine-grained tidal flat water body extraction.
However, as deep learning continues to make breakthroughs in various related fields, water body extraction from highresolution remote sensing images based on deep learning has gradually become the focus of attention.Correspondingly, many deep learning-based methods have been proposed for water body extraction tasks based on remote sensing images.As one of the mainstream deep learning models, Convolutional Neural Network (CNN) is widely used in these existing methods.This is because CNN can effectively extract rich features of objects in the image for better distinction between water bodies and non-water bodies, including shallow detailed features and deep abstract features.For example, Lu et al. (2022) proposed a neighbor feature aggregation network for weakly supervised water extraction from high-resolution remote sensing imagery.Duan et al. (2021) proposed a new lightweight CNN named Lightweight Multi-Scale Land Surface Water Extraction Network (LMSWENet) to extract the land surface water information from GaoFen-1D satellite images.Hu et al. (2022) also adopted the CNN model to extract rich features for water body segmentation.Some similar studies on CNN-based water body extraction can be found in literature (Chen et al., 2018;Tao et al., 2020;Liu et al., 2021), etc.However, although these CNN-based methods can achieve higher accuracy in water body extraction than traditional methods, there are still some problems such as many false detections, inaccurate water body boundaries, and poor continuity in the generated water body maps.
In recent years, transformer has demonstrated excellent performance in the field of image processing, so it has gradually been introduced into water body extraction tasks based on remote sensing images.For example, Zhong et al. (2022) proposed a transformer-based water extraction network called NT-Net.Song et al. (2023) used the swin transformer network for water extraction from remote sensing images.Since transformer has a larger receptive field and stronger context modeling capabilities, it can obtain semantic dependence information of water bodies in a wide range in remote sensing images.Therefore, water bodies segmented from remote sensing images using these transformer-based methods or models tend to have higher boundary accuracy and continuity than CNN-based methods.Generally speaking, these existing methods can achieve acceptable results in water body extraction tasks.However, it is undeniable that there is still room for improvement in the accuracy of water body extraction, especially in terms of the accuracy and continuity of water body boundaries, and when faced with complex surface scenes.Therefore, under the current research background, exploring how to effectively integrate the advantages of CNN and transformer to further improve the accuracy of water body extraction from remote sensing images is a direction worthy of research.
In view of the above considerations, this paper proposes a hybrid network model based on CNN and multi-scale transformer for water body extraction from high-resolution remote sensing images.The proposed network first uses a CNN backbone to extract the multi-scale features from shallow to deep from remote sensing images.These multi-scale features are then fed into a designed multi-scale transformer module to extract global contextual association information of water bodies.Afterwards, the separability of water bodies in all features was evaluated separately, and the features at different scales are adaptively weighted and fused guided by water separability.Subsequently, the network adaptively refines the fused features with the aid of a hybrid attention model to generate refined features that can effectively distinguish between water bodies and non-water bodies.Finally, these refined features are input into the prediction head to generate the final water body extraction results.
The rest of this paper is organized as follows.Section 2 describes the details of proposed network.Section 3 presents the experimental results and analysis.Finally, the conclusions are drawn in Section 4.

METHODOLOGY
Figure 1 shows the overall architecture of the proposed network.As can be seen from Figure 1, the proposed network model mainly includes five stages: 1) Multi-scale feature extraction based on CNN; 2) Global contextual association information extraction based on multi-scale transformer; 3) Multi-scale feature fusion guided by water separability; 4) Feature refinement based on attention mechanism; 5) Water body prediction.Overall, the proposed network first uses a CNN backbone to extract a series of multi-scale features from shallow to deep from the remote sensing images.Then, these extracted features are respectively input into the corresponding processing channels in the multi-scale transformer module for the extraction of global association information at different scales.Correspondingly, a series of new features containing global dependence information of water bodies at different scales are generated.Subsequently, the water separability is evaluated on the features output by each processing channel of the multiscale transformer module, and the features of different scales are weighted and fused according to the water separability.Afterwards, a hybrid attention model is adopted to adaptively refine the fused features and generate the refined features that can effectively distinguish water bodies from non-water bodies.Finally, these refined features are input into the prediction head to generate the final water body extraction results.Considering that rich local and global features have been extracted from the imagery in the previous modules, a shallow FCN is used for water body prediction in the prediction head (Chen et al., 2022).

Multi-scale feature extraction based on CNN
CNN has shown excellent performance in image feature extraction.In particular, U-Net, a modified fully CNN, can combine high-level semantic information with low-level texture information through skip connections to achieve feature extraction and detail recovery.The U-Net has been widely adopted in the segmentation or classification tasks of remote sensing images.Therefore, the proposed network uses the U-Net as the backbone for multi-scale feature extraction.

Global contextual association information extraction based on multi-scale transformer
In the proposed network, the purpose of the multi-scale transformer module is to obtain the global dependence information of water bodies at different scales in the image to enhance the accuracy of water body extraction, especially to improve the accuracy and continuity of the extracted water body boundaries.Obviously, a series of features extracted from images by the CNN contain different levels of semantic information.Using the transformer to model the global semantic dependencies separately at different scales will more effectively extract the spatial features and contextual semantic associations of water bodies in the image.
The core of the transformer lies in its multi-head self-attention mechanism.Figure 2 shows the structure of self-attention, which is actually the scaled dot-product attention mechanism (Ma et al., 2022).The self-attention head function can be represented as follows: where Q, K, V represent Query, Key, and Value respectively, which are obtained from the input data X through three linear mapping layers (Dosovitskiy et al., 2020).And d denotes the dimension of K. softmax(• ) represents the softmax function to generate attention scores.

Multi-scale feature fusion guided by water separability
A series of new multi-scale features containing the global dependence information of the water body are obtained from each processing channel of the multi-scale transformer.Apparently, the separability of water bodies and non-water bodies in each feature differs to varying degrees in different features.Those features with high water separability are more helpful to improve the accuracy and reliability of water extraction.Therefore, in this section, the proposed network first evaluates the water separability of features at different scales, and then performs water separability-guided adaptive weighted fusion of features at different scales.Correspondingly, a fused feature set can be obtained.Suppose a total of M processing channels are used in the multiscale transformer module, which correspond to M different scales.At the scale m (m=1,2,...,M), the generated series of features is denoted as F m ={ f1 、 f2 、……、fN }, where N represents the total number of features output by the corresponding channel.For the n-th feature fn (n=1, 2, ..., N) in F m , let μi, μj and δi, δj denote the mean and standard deviation of water and non-water samples in this feature, respectively.Then the water separability Sn in the feature fn can be expressed by the following Equation (2): Furthermore, the total separability m TS of water bodies and non-water bodies in the feature set F m corresponding to scale m can be expressed by the following Equation ( 3): After evaluating the total separability of the feature sets at all scales, the features of different scales can be weighted and fused according to the following Equation ( 4) to generate a new fused feature set F .Obviously, in the weighted fusion process of features, features with higher water separability have higher weights. 1

Feature refinement based on attention mechanism
To eliminate redundant features in the newly generated feature set F and further improve the difference between water bodies and non-water bodies in the features, this paper adopts the hybrid attention model CBAM to adaptively refine the feature set F .As shown in Figure 3, for the input feature .And these attention maps are multiplied to the input feature maps to achieve the purpose of adaptive feature refinement (Woo et al., 2018).This process can be expressed by the following Equations ( 5) and ( 6). () where  represents element-wise multiplication.F represents the intermediate features generated after processing the input features using the CAM module, and F represents the final refined features generated after CBAM processing.

Water body prediction
In the above CNN backbone and multi-scale transformer modules, rich local and global features have been mined from remote sensing images.Therefore, after obtaining the refined feature F , the proposed network adopts a shallow FCN as the prediction head to generate the final water body map.

EXPERIMENT AND ANALYSIS
To verify the effectiveness of the proposed network, we test it on a publicly available dataset.This section first introduces the dataset used and related experimental settings, and then presents and analyzes the experimental results in detail.

Data Set
A popular and publicly available dataset GID (Tong et al., 2018) with 150 images is used in this study, 120 of which are used as training set and the rest as test set.In the GID dataset, the image size is 6800 pixels × 7200 pixels, and the spatial resolution is 4 meters.
In addition to the unknown class, GID includes five classes: built-up, farmland, forest, meadow, and waters.In order to make it suitable for water segmentation tasks, we divide builtup, farmland, forest, and meadow into one class, which is the non-water class.Unknown data are not considered in the training and testing processes.Figure 4 shows a part of the experimental data and its ground truth.

Experimental settings and evaluation indicators
To verify the effectiveness of the proposed network, a recently proposed water segmentation network MSFA-Net (Hu et al., 2022) and several deep learning network models commonly used in water segmentation tasks were used for comparison experiments, including U-Net (Ronneberger et al., 2015) , FCN8s (Shelhamer et al., 2017), and DeepLabv3+ (Chen et al., 2018), etc.At the same time, in order to quantitatively compare the water extraction results of different methods, several commonly used evaluation indexes were adopted, namely intersection on union (IoU), precision, recall, F1-score, and kappa coefficient (KC) (Carletta, 1996;Duan and Hu, 2020).

Experimental results and analysis
Figure 5 shows some results of water body extraction by different methods on the GID dataset.As can be seen from Figure 5

Figure 1 .
Figure 1.The overall architecture of the proposed network.

Figure 2 .
Figure 2. The self-attention mechanism in transformer.

Figure 3 .
Figure 3.The structure of the CBAM.

Figure 5 .
Figure 5. Partial results of water body extraction by different methods on the GID dataset.

Table 1 .
, the water body extraction results of the proposed network (Ours) have a high consistency with the ground truth, and the water body regions extracted by the proposed network have good continuity and high boundary accuracy.However, in the water body extraction results generated by U-Net, FCN8s, and DeepLabv3+, the false detection is serious.The proposed network can also accurately identify narrow water bodies, while U-Net and MSFA-Net have poor performance in such scenarios.Meanwhile, it can be seen from Figure5that the networks based on the pure CNN structure cannot accurately locate the boundary position of the water body, resulting in low boundary accuracy of the extracted water body.Table1shows the water body extraction accuracy of different methods on the GID dataset.From Table1, it can be seen that the proposed network exhibits the optimal water body extraction performance as a whole.Specifically, the proposed network is superior to other models in the four indicators of IoU, Recall, F1-score and KC, only its Precision is slightly lower than that of DeepLabv3+.Such experimental results prove the effectiveness of the network proposed in this paper.At the same time, it also proves that compared with existing network models such as U-Net, FCN8s, DeepLabv3+, and MSFA-Net, the proposed network has certain advantages in terms of water body extraction accuracy.Water body extraction accuracies of different methods on the GID dataset.To more accurately extract water bodies from high-resolution remote sensing images, this paper proposes a new hybrid network model based on CNN and multi-scale transformer.The proposed network integrates the ability of CNN to capture local detail features and the ability of transformer to model global contextual semantic associations in a large range.Therefore, it can more accurately identify water bodies in remote sensing images, and the extracted water body boundaries have high accuracy and continuity.In this study, the proposed network is tested on the publicly available dataset, and the results of comparative experiments demonstrate its effectiveness and superiority.