SHIP DETECTION IN COSMO-SKYMED SAR IMAGERY USING A NOVEL CNN-BASED DETECTOR: A CASE STUDY FROM THE SUEZ CANAL

: The Suez Canal, strategically located as the shortest international sea route, plays a crucial role in facilitating the transportation of goods between Asia and Europe. However, the occurrence of traffic disruptions within the Canal poses a serious threat to global trade, as evidenced by the recent incident of the container ship Ever Given, which ran aground in the Suez Canal on March 23, 2021. This event led to a complete blockade of the Canal that lasted for six days, resulting in a fleet of ships waiting to pass through the Canal. This highlights the need to monitor the Canal to prevent similar disturbances in the future. In this paper, we propose a CNN-based attention-guided self-learning framework for ship detection from 3m high-resolution COSMO-SkyMed SAR imagery acquired in April 2021 via the Egyptian Suez Canal. We introduce a self-learning augmented segmentation (SLAS) technique to augment the dataset with new ship samples by pseudo-labeling an unlabeled dataset. We also present the Attention-guided Feature Refinement (AFR) module to extract more significant semantic features and contextual information, especially for ships of varying sizes in SAR images. Finally, the AFR module is fed into a Region Proposal Network (RPN) to generate a set of proposal anchors, which are later used in a Deep Detection Network (DDN) for ship classification and localization. Our experimental results demonstrate that the proposed method outperforms current state-of-the-art detection models in terms of detection accuracy, particularly in complex coastal scenes, with an overall accuracy of up to 87% mean average precision (mAP).


INTRODUCTION
The Suez Canal is a vital international sea route that has linked the Mediterranean and the Red Sea since 1869.Annually, nearly 20,000 vessels, including 2,500 tankers, cross the canal, representing 15% of the world's maritime freight traffic.However, this high volume of traffic also poses a risk of disrupting global trade in the marine environment (Lee and Wong, 2021).The recent grounding of the Ever Given, a large container ship passing through the southern part of the canal on its way to Rotterdam on March 23, 2021, exemplified this risk (Forti et al., 2021), as shown in Figure 1.As one of the busiest trade routes globally, this blockage had a considerable negative impact on global trade, highlighting the need for effective monitoring and management of the channel to prevent similar disturbances in the future or, at least, reduce marine anomalies.Therefore, it is crucial to spot ships in the channel to ensure the safe and smooth operation of the waterway and manage fisheries, wrecks, and other potential risks.In recent years, synthetic aperture radar (SAR) images have become a popular tool for ship detection, as they offer imaging capabilities that are available 24/7, regardless of weather or lighting conditions (Maître, 2013); (Yoshida et al., 2021).The SAR beam interacts with the physical surface of the water, allowing it to detect physical objects such as waves and ships that differ from the surface of the water.Of the various search and rescue systems available, the Cosmo-SkyMed SAR (CSK-SAR) constellation stands out because of its high resolution and high repetition time (few hours).This makes it particularly suitable for monitoring high traffic areas and detecting ships within channels.However, identifying vessels in SAR images is a challenging task due to the complex marine background.Figure 2 clearly illustrates typical challenges in detecting ships using high-resolution SAR images, including lighthouses (A), buildings (B), gantry cranes (C), side lobes (D), ghosts caused by azimuth ambiguity (E), sea clutter (F), and beach clutter (G).These appearances can be so similar to that of the vessel in Figure 2(h) that distinguishing them can be a counter intuitive challenge even for experts.Thus, they have to be mitigated before the data can be applied to machine learning algorithms so as not to lead to false alarms.
Recently, researchers have increasingly employed machine learning techniques to develop methods for detecting ships in SAR images.(Wu et al., 2011) presented a multi-scale processing algorithm that optimizes target pixels and suppresses speckle and background clutter.The algorithm's efficacy was tested on a Cosmo-Skymed image with a 3-meter pixel spacing, demonstrating its ability to detect ships effectively.(Liu et al., 2017) introduced a Sea-Land Segmentation based CNN (SLC-CNN) that combines corner features and saliency.The model underwent testing on high-resolution images from the TerraSar-X and ALOS PALSAR datasets.(Huang et al., 2022) introduced WA-CNN, a novel CNN-based method for ship detection in SAR images, which employs wavelets and an attention mechanism.The U-Net architecture was utilized to develop the network, reducing depth and enhancing complexity.The approach was evaluated on two public SAR image datasets and yielded promising results.Despite these promising outcomes, detecting ships in SAR images remains a challenging task, particularly in complex coastal scenes.These approaches also encounter other challenges, such as the need for large annotated datasets and the selection of appropriate hyper-parameters.Recent advancements in deep learning methods have significantly improved ship detection in SAR images.(Wang et al., 2018) introduced an enhanced Single-Shot Detector (SSD) model that can simultaneously detect ships and estimate their orientation angles.Attentional modules were incorporated to enhance the model's performance.Similarly, (Miao et al., 2022) enhanced the RetinaNet model for vessel detection by introducing attention modules and adjusting aspect ratios using the k-means clustering algorithm.These modifications improved the accuracy and effectiveness of the model.(Li et al., 2017) proposed an improved Faster R-CNN-based technique for ship detection, utilizing transfer learning and feature fusion.Their approach demonstrated robustness and effectiveness when evaluated on high-resolution images from various datasets.(Li et al., 2022) further improved the Cascade-R-CNN method by incorporating the Swin transformer, leading to enhanced performance and accuracy.Despite these advancements, challenges persist in detecting ships in complex coastal scenes, and a significant resolution gap between beach and marine scenes presents an additional obstacle in ship detection.Our experimental results demonstrate that the proposed method outperforms the latest ship detection models in terms of detection accuracy and efficiency, particularly in complex coastal scenes.This indicates the potential of our approach for real-world applications.
The remaining parts of this paper are organized as follows.In Section 2, the proposed method is listed.The CSK-SAR dataset, training details, evaluation criteria, and results obtained are described in Section 3. Finally, some conclusions are drawn in Section 4.

Overall Scheme of Network Structure
The current state-of-the-art object detection models can be classified into two categories: single-stage models and two-stage models.Single-stage detectors are known for their fast inference speeds, which are particularly important in applications that require low latency, such as video detection.On the other hand, two-stage detectors are characterized by their high accuracy in locating and recognizing objects within an image.
Considering the challenges involved in detecting ships in SAR images and the significant number of false alarms generated, we opted for the Faster R-CNN two-stage model as the basic design framework for accurate in-image ship recognition (Ren et al., 2015).The Faster R-CNN model's ability to learn features at different scales is particularly important in detecting ships of varying sizes and orientations.For comparison purposes, we chose three other representative methods -Cascade R-CNN (Cai and Vasconcelos, 2018), RetinaNet (Lin et al., 2017b), and SSD (Liu et al., 2016) in our work.The proposed method's network structure is detailed in Figure 3 and can be divided into five parts: the self-learning augmentation segmentation (SLAS) part, the feature extraction network (FEN) part, the attention-guided feature refinement (AFR) part, the region proposal network (RPN) part, and the deeply detection network (DDN) part.We present a detailed explanation of each of these five parts in the following sections.

Self-Learning Augmentation Segmentation
In this section, we present self-learning augmentation segmentation (SLAS), which is inspired by (Ghiasi et al., 2021) and involves augmenting the dataset with new ship samples through self-learning and iterative pseudo-segmentation.We generated masks for each ship in the image using the Otsu method (Yu et al., 2010) on the COSMO-SkyMed SAR (CSK-SAR) dataset.This enabled us to identify all the pixels that belonged to each ship in the image.From our dataset, we selected 1,512 annotated images, some of which contained only background and others contained ships.We identified a total of 136 segmentation instances of ships in these images.To augment the data, we applied random rotations, brightness adjustments, contrast enhancements, lighting modifications, and saturation changes to the 1,512 image dataset.Furthermore, the iterative pseudosegmentation process resulted in two or three additional copies of each ship instance segmentation.We added these additional copies to a new dataset containing labeled ships and not only pure background.After this initial processing, we used the new augmented dataset during iterative pseudo-segmentation to train our proposed network.

Feature Extraction Network
During the feature extraction network (FEN) phase, we slice the new augmented dataset into smaller patches of size 500 × 500 pixels, filtering out patches that do not contain ships, randomly splitting the remaining patches into training, validation, and testing sets.These patches are then fed into a deep convolutional neural network (CNN) to obtain feature maps at various scales.To construct the FEN, we utilized the ResNet-50 (He et al., 2016) architecture with Faster R-CNN as the backbone network.This design plays a crucial role in reducing the size of the network while increasing its depth.Instead of stacking convolutional layers directly, ResNet-50 connects these layers to fit the residual mapping.Specifically, the input SAR image x is processed with two weight convolutional layers and an activation function to yield F (x).This is then added to x to obtain H(x), which is further processed to yield the final output z, as depicted in Figure 4.As shown in Figure 3(b), the FEN consists of five hierarchical convolutional layersC1 − C5, which generate deep feature maps from the input SAR images.
The lower layers (C1 and C2) have higher spatial resolution but contain less semantic information, while higher layers (C4 and C5) have more abstract, semantic information but lower spatial resolution.As a result, the location of the ships in the upper layers is coarser.To address this, we adopt the idea of fine-tuning attention-guided feature refinement (AFR), which integrates the feature information of all convolutional layers to make full use of semantic and spatial information.

Attention-guided Feature Refinement
In our proposed attention-guided feature refinement (AFR) fusion network, we first adopt a feature pyramid network (FPN) (Lin et al., 2017a) to generate multi-scale feature maps that capture rich semantic information different levels.The FPN architecture involves both bottom-up and top-down processes, as depicted in Figure 3(b-c).The resulting feature maps are then used by both ResNet-50 and FPN to make detectors suitable for detecting shore ships of different sizes in SAR images.Next, we introduce the AFR module, which consists of four feature fusions (F 4, F 3, F 2, and F 1), as shown in Figure 3(d).First, we down-sample the last convolutional layer C5 and combine it with the C4 layer using an element-wise sum operation, to obtain the feature map L4.This process is repeated until we obtain the most accurate feature map.However, since SAR images often contain complex background environment information, it is essential to direct the network's attention to the features that are most distinguishable for the current detection task, namely ships.To this end, we propose to use the Squeeze-Excitation module (SEM) (Hu et al., 2018) to encode the feature maps and provide a weight for each channel of the feature maps.

Region Proposal Network
In this section, we present the use of a Region Proposal Network (RPN) in combination with each feature map fusion F i to achieve high performance in the detection of ships in SAR images, as depicted in Figure 3(e).The RPN generates a set of reference boxes, or anchors, called region proposals.To cover ship targets with different sizes for each F i layer, we used four different scales of anchors, denoted as Scalei = {64 × 64, 128 × 128, 256 × 256, 512 × 512} where i ∈ {1, 2, 3, 4}.In addition, we used an aspect ratio of {1 : 1, 1 : 2, 2 : 1}, resulting in a total of 12 (4 scales in 3 aspect ratios) anchors for each i anchor.To generate fixed-dimension 7×7 Region of Interest (RoI) features from the anchor proposals, we adopted a RoI pooling layer.Subsequently, the anchors were sent to both the ship classification layer (CL) and the box regression layer (BL) through two convolutional layers.The CL has a 2K output of the object probability estimation for each proposal, while the BL has a 4K output of the bin coding coordinates (where K = 12).

Deeply Detection Network
In Figure 3(f), the deeply detection network (DDN) represents the final stage of our proposed framework.This stage leverages the fusion of enhanced features extracted from the AFR and proposal anchors generated by RPN as inputs.To enhance the semantic information pertaining to small-sized ships, we combine the features generated by the RoI pooling layer with the AFR-enhanced features.The output is then fed back to the fully connected layers with the sigmoidal activation function to obtain the final detection result.From the fully connected layers, the detection result branches into two simultaneous branches: the classification layer (CL) branch and the bounding box regression branch (BL).The CL confidently predicts the corresponding prior anchors for each pixel on feature maps at different scales.This confidence determines the probability that the anchor belongs to the class of ships.Meanwhile, the BL outputs the offsets of the coordinates between the anchor belonging to a ship and the ground-truth bounding box of that ship.In our work, we adopt a double loss approach from CL and BL to detect each layer, as described in Equation 1. Specifically, the CL introduces the sigmoid activation function to obtain the detected probability P and ground truth Y.We then calculate the classification loss indicated by L Cls .We use λ as the balancing parameter, where Y ≥ 1 indicates that the background is meaningless for training the BL, and L Bbr indicates the loss of the bounding box regression.In L Bbr , we predict four offsets for anchor box i, which are Ai = (a x i , a y i , a w i , a h i ), where a x i and a y i are the top-left coordinates of the predicted area, and a w i and a h i refer to the width and height of the projected area.If the predicted area of anchor box Â has the highest IoU with the ground truth box A * , we assign it a positive sign Yi ≥ 1. Conversely, if the IoU ratio of the predicted area's box is less than 0.3 for all ground truth boxes, we assign a negative label Yi = 0 to it, and then ignore the remaining regions.The IoU ratio is defined as in Equation 2.
Where area A ∪ A * is the union of the predicted area's box and the ground truth's box, and area A ∩ A * is their intersection.

Datasets
In this study, X-band SAR images from the COSMO-SkyMed system were utilized.COSMO-SkyMed is a constellation of four SAR satellites developed by Agenzia Spaziale Italiana (ASI), which enables targeting of the same location on Earth within a single day.Two scenes of COSMO-SkyMed StripMap SAR (CSK-SAR) data with horizontal-horizontal (HH) and vertical-vertical (VV) polarizations were used in this research.
The datasets used were level 1A (L1A) HIMAGE Single-look Complex Slant (SCS) products, provided by ASI as part of the COSMO-SkyMed project (Open Call Id 797).Table 1 presents detailed parameters of the test image datasets, while Figure 5(e) shows the test images acquired from the COSMO-SkyMed satellite.The original CSK-SAR datasets were calibrated, converted to floating-point numbers in dB, and exported in a tagged image file format TIFF, with each image containing 18000 × 15000 pixels.The entire image was divided into 2160 subimages of size 500 × 500 pixels.Of these sub-images, 1512 were used as the training set, and 648 were used as the testing set.The test images were further divided into 486 offshore sub-images (test offshore) and 162 inshore sub-images (test inshore).The annotated dataset was created through a manual annotation process using the LabelMe Toolbox.

Training Details
To implement our proposed method, we utilized Detectron2, a flexible and powerful toolkit for re-implementing existing object detection models (Wu et al., 2019)  source deep learning detectors, including Faster R-CNN, Cascade R-CNN, SSD, and RetinaNet, and employed ResNet-50+FPN pre-trained on the COCO dataset as the backbone for these models.To improve the accuracy of location and segmentation, we resized all samples of CSK-SAR imagery to 512 × 512 pixels for both network training and testing.We trained all detectors using GPUs and completed the training in 52 epochs via stochastic gradient descent (SGD) as the optimizer.The momentum, weight decay, learning rate, and batch size were set to 0.9, 0.0001, 0.02, and 4, respectively.We performed all experiments on our dataset using a virtual machine desktop with a 64-bit Windows 10 Pro operating system.The software configuration included the Python programming language, PyTorch 1.6.0,CUDA 10.1, and cuDNN 7.6.1.Addi-tionally, the hardware capabilities comprised an NVIDIA GRID RTX8000-8Q with 8GB memory, an Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, and 32.0 GB RAM.

Evaluation Indices
To evaluate the performance of the proposed model against the state-of-the-art models, the following evaluation metrics were considered for the testing set: Detection Probability (Pd), False Alarm (Pf), Missed Detection (Pm), Recall, Precision, Mean Average Precision (mAP), F1-score, True Positive (TP), Ground Truth (GT), False Positive (FP), and False Negative (FN).

Results and Analysis
In Table 2, we have shown the ship detection statistics in inshore and offshore scenes resulting from the bounding box AP on the testing set of CSK-SAR datasets.One can observe that AP50 of state-of-the-art detectors was above 82% for detecting the offshore scenes.While detecting the inshore scenes, AP50 of state-of-the-art detectors was above 75%.Hence, the detection accuracy has dropped significantly for inshore scenes by 7%.Moreover, Faster R-CNN with SLAS adoption leads better in bounding box AP than the state-of-the-art detectors.
From Table 3, one can observe that on the offshore+inshore scenes, SLAS enhanced SSD, RetinaNet, Cascade R-CNN, and Faster R-CNN by 5%, 3%, 4%, and 3% mAP, respectively.As for validation and testing inference, the precision-recall curve (PR curve) and IoU between predictions and ground truth obtained from inference on inshore and offshore scenes using Faster-RCNN algorithm with SLAS adoption are showing in Figure 7(a-b).Learning curves for training, validation loss, and validation mAP appears in Figure 7(c-d).Figure 8 displays the outcomes of a small subset of the CSK-SAR dataset utilized to evaluate the detection capability of our model.Incorporating the proposed SLAS with ResNet50+FPN and Faster R-CNN, we observe that the detection results exhibit commendable ship detection performance.To study the ship detection ability of detectors to complex CSK-SAR scenes intuitively, we have selected 4 representative scenes as an example in the test set and use the Faster-RCNN algorithm with SLAS adoption to detect ships.At the same time, we took some auxiliary means such as optical remote sensing images that have the same imaging date for CSK-SAR images from Google Earth to determine the potential interference between similar features of targets in the Canal.The visual results between predictions and ground truth on test set without/with the back-scattering appears in Figure 6.
The number in the predicted bounding box represents the confidence of the detection box and was filtered under the 0.7 confidence coefficient.As a result, most of small and large ships were detected with correct rectangles on the offshore scenes.

CONCLUSIONS
In this paper, we propose an improved attention-guided selflearning Faster R-CNN-based framework for detecting ships inshore and offshore from COSMO-SkyMed SAR datasets across the Egyptian Suez Canal.Our experiments demonstrate that the proposed Self-Learning Augmented Segmentation (SLAS) technique effectively augments by placing pseudo-labeled ships on an unlabeled dataset, thereby significantly improv-ing model performance.Additionally, the proposed Attentionguided Feature Refinement (AFR) module enables convolutional layers to extract more meaningful semantic features, particularly around coastal ships of varying sizes in SAR images, by leveraging global contextual information.This feature refinement enhances the model's detection ability.Our model outperforms other competing detectors in AP bounding box and demonstrates superior ship detection ability from SAR images, as evidenced by our experiments.However, the current model has limitations in mistakenly identifying non-ship targets at sea and using a horizontal rectangular bounding box to mark ship targets, resulting in poor performance when ship targets are in close proximity inshore.To improve the accuracy of ship detection and reduce the performance gap between inland and offshore scenes, future research should consider land and sea segmentation before detection, along with the use of more diverse SAR datasets.

Figure 1 .Figure 2 .
Figure 1.A striking view captured from a Sentinel-1 satellite on March 25, 2021, displays a fleet of ships at a standstill, patiently waiting to pass through the southern entrance of the Suez Canal.Upon magnifying the image, both COSMO-SkyMed SAR and optical high-resolution imageries reveal the colossal Ever Given container ship, still stranded in the canal's narrow passage.The credits for this awe-inspiring composition go to Copernicus Sentinel 1 data©ESA, CC BY-SA 3.0 IGO; COSMO-SkyMed image©ASI, processed and distributed by e-GEOS; and Satellite image©2021 Maxar Technologies.
This paper proposes a novel framework for ship detection from high-resolution COSMO-Skymed SAR (CSK-SAR) datasets acquired in April 2021 via the Egyptian Suez Canal.Our approach employs the open-source Faster R-CNN model as the primary attention-guided self-learning design framework.Specifically, we propose a Self-Learning Augmented Segmentation (SLAS) technique to augment the dataset with new ship samples by pseudo-labeling an unlabeled dataset.Furthermore, we propose an Attention-guided Feature Refinement (AFR) module to enable the convolutional layers to extract more meaningful semantic features, especially about coastal ships of different sizes in SAR images, by leveraging global contextual information.The output of the AFR module is then fed into a Region Proposal Network (RPN) to generate a set of proposal anchors that are later utilized in a Deeply Detection Network (DDN) for classification and localization.

Figure 3 .
Figure 3. Overview of our proposed network for ship detection from SAR images.

Figure 4 .
Figure 4.The shortcut connection of ResNet.

Figure 5 .
Figure 5.An example of large-scale image and sub-images from COSMO-SkyMed SAR imagery with 3000 × 3000 pixels, sub-figure (a) shows the SAR imagery acquisition area of our dataset and distributed ships in the Suez Canal.The sub-figures (b) and (c) represents the inshore scenes.Offshore scenes with multiple ships, as displayed in (d) and (e).COSMO-SkyMed®Products©ASI-Italian Space Agency-2021.All rights reserved.

Figure 6 .
Figure 6.Comparison between predictions and ground truth on test set without/with back-scattering.

Figure 7 .
Figure 7. Learning curves for training and inference for testing, the detection PR curve displayed in sub-figure (a).The IOU between predictions and ground truth, as showed in (b).The Validation loss curve, as displayed in (c).The Validation mAP, as showed in (d).

Figure 8 .
Figure 8.The outcomes of a small subset of the CSK-SAR dataset by Faster R-CNN+ResNet50+FPN+SLAS.

Table 1 .
. We used four open-Detailed image parameters of the X-band SAR images from COSMO-SkyMed SAR Datasets.