FUSING SENTINEL-1 AND SENTINEL-2 IMAGES WITH TRANSFORMER-BASED NETWORK FOR DEFORESTATION DETECTION IN THE BRAZILIAN AMAZON UNDER DIVERSE CLOUD CONDITIONS

: Deforestation is an environmental problem that significantly impacts biodiversity and climate change. Deforestation detection is usually performed using optical remote sensing images, limiting the detection capability to the dry season in which images are not comprised of clouds. In this work, we proposed Transformer-based models to fuse bitemporal Sentinel-1 and Sentinel-2 images to identify new deforestation areas in the Brazilian Amazon area under diverse cloud conditions. The models were evaluated considering clear and cloud-covered pixel conditions. The results confirmed previous works in which the fusion of optical and SAR images improved deforestation detection capabilities. We also concluded that the better Transformer-based network reached the F1-Score of 0 . 92 , considering all pixels, outperforming the better Convolution-based which reached the F1-Score of 0 . 86 , without increasing the training and prediction times.


INTRODUCTION
The Brazilian Amazon forest is the largest rainforest on the Earth, with a significant role in regulating the Earth's climate, providing habitat for millions of species (Strand et al., 2018).However, deforestation severely threatens the forest, contributing to climate change, biodiversity loss, and social conflicts (Baccini et al., 2017).In 2021, for example, 13,235 km 2 of the Brazilian Amazon forest were lost (INPE, 2022).
To monitor deforestation, the Brazilian Institute for Space Research (INPE) developed the Program for the Estimation of Deforestation in the Brazilian Amazon (PRODES) in the 1980s.This program uses satellite images to monitor clear-cutting deforestation in the Brazilian Amazon.PRODES relies on optical satellite imagery to track changes in forest cover, providing accurate and reliable data on deforestation rates and trends.However, it still requires a significant amount of visual interpretation, resulting in undesirable implications in terms of time and cost.As a solution, semi-automatic approaches have been explored to minimize accuracy loss to a minimum (INPE, 2022).
Deep Learning (DL) algorithms have become a valuable tool in detecting deforestation from satellite images and analyzing changes in the forest cover by comparing multitemporal satellite images.Usually, the approach involves convolutional neural networks (CNN).Fully convolutional networks (FCNs) have achieved state-of-the-art results in various remote sensing tasks (Ma et al., 2019), including deforestation detection (Adarme et al., 2020).A typical FCN architecture consists of an encoder module for feature extraction and a decoder module that delivers a prediction at the input image resolution.Many FCN variants have been applied for deforestation detection from optical images (Zhang et al., 2018;Torres et al., 2021; * Corresponding author de Bem et al., 2020) and combined optical and SAR data (Ortega et al., 2021;Ferrari et al., 2023).
Recently, an alternative approach called vision transformers (ViT), or simply transformers, emerged as a substitute for FCNs in image-dense prediction tasks (Dosovitskiy et al., 2020).The primary drawback of FCNs is related to their core operation, convolution, which confines the computation of a new pixel representation to a narrow spatial context.Conversely, ViT bypass convolution and can capture the global context of the input.In fact, over the last years, many transformer architectures have outperformed FCNs in dense prediction tasks like semantic segmentation (Liu et al., 2021;Cao et al., 2021).
The primary data sources for monitoring deforestation in tropical areas are images acquired by optical sensors.These images allow for the easy recognition of changes in forest cover resulting from clear-cutting through visual interpretation elements such as tone, form, color, and texture.Many Earth observation satellites also carry optical systems, improving data availability (Belward and Skøien, 2015).However, a significant limitation of optical images is the lack of information in areas covered by clouds, which is very common in tropical environments (Asner, 2001).
When cloud-free optical images are unavailable, synthetic aperture radar (SAR) emerges as an alternative (Silva et al., 2022;Doblas et al., 2020;Bouvet et al., 2018).SAR is sensitive to surface structural properties such as roughness and moisture content, which may aid deforestation monitoring.However, classification results obtained from cloud-free optical images usually outperform those achieved solely with SAR data (e.g., Ortega et al. (2021)).
There are three primary categories of Optical-SAR fusion strategies.The first category, early fusion, involves stacking multispectral and SAR images together into a single tensor to feed the network.The second category, joint fusion, combines the feature maps produced by two encoders, one for optical images and the other for SAR images.Lastly, the late fusion strategy concatenates the decoder outputs from encoder-decoder networks trained with optical and SAR images (Stahlschmidt et al., 2022).
Earlier studies have demonstrated the effectiveness of optical-SAR fusion with FCNs, utilizing cloud-free images obtained under the same atmospheric conditions (e.g., Rosa et al. (2021); Ebel et al. (2021); Benedetti et al. (2018); Li et al. (2022)).The emphasis of this study is on a scenario where the optical images can be partly covered by clouds.The three strategies to fuse optical and SAR images to identify new deforestation areas in diverse cloud conditions were already evaluated, reaching the best f1-score result of 0.69 for the late fusion strategy (Ferrari et al., 2023).
In this work, we propose and assess techniques for fusing Sentinel-1 and Sentinel-2 images, whether impacted by cloud cover or not, to identify instances of clear-cut deforestation in the Brazilian Amazon using Transformer-based networks.The remainder of this paper is organized as follows.Section 2 presents the material and methods, describing the dataset, the proposed fusion strategies, and the experimental protocol.Sec-tion 3 shows and discusses the results.Finally, Section 4 summarizes the study's findings.

PROPOSED APPROACHES
All proposed Transformer-based models are variants of the SwinUnet (Cao et al., 2021) architecture, while the Convolution-based models are variants of the ResUnet (Zhang et al., 2018).Both architectures were organized in Encoder, Decoder, and Classifier blocks, similar to Ferrari et al. ( 2023) and described in Figure 1.
Henceforth, we used the term "Data" to refer to the pair of coregistered images from the same modality sensor taken in two consecutive years concatenated in the third dimension with the Previous Deforestation Map (refer to section 3.2).

Single-modality models
Single-modality models use only one type of data (either optical or SAR) for model training and prediction.These models are organized as shown in Table 1 and were used to assess the models' performance without the fusion strategy using convolution and Transformer-based architectures.

Multi-modality models
We considered three strategies for optical and SAR data fusion: early fusion, joint fusion, and late fusion, described by Figure 3 presents how the Encoder, Decoder, and Classifier blocks are organized for all multi-modality models.In the early fusion strategy (Figure 3a), the optical and SAR data are concatenated in the third dimension before the encoder.The encoder, decoder, and classifier setup is the same as in the singlemodality models.Just one previous deforestation map was maintained because its information is independent of optical or SAR images.In the joint fusion strategy (Figure 3b), the model architecture has two independent encoder blocks.The encoder outputs (including the skip connections) are concatenated before entering the encoder.Each modal has an independent encoder and decoder in the late fusion strategy (Figure 3c).The decoder outputs are concatenated before entering the classifier block.

Study area and satellite images
We used Sentinel-1 and Sentinel-2 images acquired between 2018 and 2020 in an area of approximately 12, 500 km 2 located in the southeastern portion of Amazonas State, Brazil (Figure 4).We chose this area because of the high deforestation rates and the availability of optical images with diverse cloud coverage conditions.We downloaded all data from the Google Earth Engine (GEE).The GEE Sentinel-1 data includes Level-1 Ground Range Detected (GRD) images with VV and VH polarizations.The image is processed by GEE using the Sentinel-1 Toolbox following the steps: thermal noise removal, radiometric calibration, and terrain correction using SRTM 30.The GEE

Data
As reference data, we used the deforestation polygons produced by PRODES, which can be accessed through the TerraBrasilis platform Assis et al. (2019).PRODES employs trained photointerpreters to manually identify the annual increment of the deforested areas on optical satellite images.Only deforestation areas larger than 6.25 ha where the primary forest was entirely removed are outlined.For visual interpretation, the specialists select images with a reduced cloud cover (usually cloud-free) and acquired within the dry season (July to September) INPE (2022).
We derived the pixel-wise labels for the whole study area for each pair of years (Y0-Y1) from PRODES deforestation polygons.We labeled the image pixels as No Deforestation, Deforestation, and Background.The pixels in which no deforestation was detected by PRODES until Y1 were classified as No Deforestation.The pixels in which new deforestation areas were identified in Y1 were classified as Deforestation.The pixels in which new deforestation areas were identified before Y0 were classified as Background.Pixels that are classified as "Hydrography" and "No Forest" (or other classes in which deforestation is impossible to occur) by PRODES were also classified as Background.Due to differences in the spatial resolution between the images used to generate the PRODES polygons, Landsat images with 30 m spatial resolution, and this work, Sentinel with 10 m spatial resolution, a buffer with 3 pixels was applied to Deforestation polygons.
PRODES usually employs one Landsat image per year to detect the deforestation increment areas.Henceforth, the date this image was taken will be called the PRODES reference date.
In a ±15 days window from each PRODES reference date, we selected three optical images with diverse cloud conditions and three SAR images.Figure 5 shows how we chose the images for this work, where Ri, and I j i means the PRODES reference date of the year i, and j-th optical-SAR image pair from Sentinel-2 and Sentinel-1, respectively.The images from the first two years (Y0 = 2018 and Y1 = 2019) were used for training purposes, while those from the last two years (Y0 = 2019 and Y1 = 2020) were used for testing.We are already aware of past deforestation through and this knowledge can aid in identifying future deforestation areas.One way to represent this previous deforestation information is through a temporal distance map called previous deforestation map, in which the pixels' values vary from 0 to 1.If no previous deforestation occurred in a pixel, its value is 0. However, if a pixel represents a previously deforested area, its value will be based on the year of the deforestation identification.For example, if deforestation occurred in the year Y0, the pixel value in the deforestation map is 1.This value decreases linearly as the year of deforestation moves away from Y0 towards the past.
We estimated the cloud coverage probability of each pixel in every optical image by utilizing the Sen2Cor algorithm (Louis et al., 2016).This method generated a cloud coverage probability map, which provided values ranging from 0 (no cloud) to 1 (cloud).For each pair of optical images used for testing (Y0 = 2019 and Y1 = 2020), we determined the maximum value between the respective cloud coverage probability maps, resulting in the Maximum Cloud Probability (MCP), for each pixel.MCP values will be employed to classify how each pixel was affected by clouds.

Experimental protocol
All procedures of this work were conducted with an Intel Core i9-10900F processor, with 128 GB of RAM and a GeForce RTX 3090 GPU with 24 GB of dedicated memory and Python language and PyTorch library.2023) respectively.

Training protocol
The entire study area was divided into training and validation tiles, as shown in Figure 6.Subsequently, we extracted patches of 128 × 128 pixels from these tiles, with a 70% overlap.To reduce the imbalance inherent to the deforestation detection task, we classified the generated patches based on the presence of pixels belonging to the Deforestation class.The patches with at least 2% of Deforestation pixels were regarded as Deforestation Presence patches, while the rest were as Deforestation Absence.To train and validate the models, we take the same number of Deforestation Absence as the Deforestation Presence patches, randomly discarding the excess of Deforestation Absence patches.
We trained five models from scratch for each architecture described in Tables 1 and 2. The 9 possible combinations between {I 0 0 , I 1 0 , I 2 0 } and {I 0 1 , I 1 1 , I 2 1 } were utilized to extract the patches, increasing the diversity of the images saw by the models during the training process.We chose the Adam optimizer with a learning rate 5 • 10 −5 .The loss function was the weighted cross entropy, with the weights shown in Table 3.Each model was trained until the validation loss stopped to improve for 10 epochs.We preformed the prediction from 9 possible combinations of the Testing images: {I 0 1 , I 1 1 , I 2 1 } and {I 0 2 , I 1 2 , I 2 2 }.To minimize the patch effect in the outcome, the prediction was generated multiple times for each model, with different overlapping (0.15, 0.2, 0.25, 0.3) between the patches, and the 8 pixels border of all sides were discarded.The final architecture's probability prediction was the average from all models and overlaps.We adopted 0.5 as the probability threshold for assigning the pixel to one of the two classes.

Evaluation protocol
To evaluate each architecture in diverse cloud presence conditions, the MCP was estimated from the images {I 0 1 , I 1 1 , I 2 1 } and {I 0 2 , I 1 2 , I 2 2 }.If the MCP of each pixel was ≥ 0.5, then it was classified as Cloudy Pixel because its optical image should be affected by clouds in Y1 or Y2.The remaining pixels were classified as Cloud-Free Pixels.
We evaluated the precision, recall, and F1-Score considering each architecture classified prediction, discarding all Deforestation predictions with areas smaller than 6.25 hectares to be compatible with labels criteria generated from the PRODES data.All pixels belonging to the class Background were also discarded the accuracy computation.The number of trainable parameters of the Convolution-based architectures was lower than the Transformer-based.However, this difference in the number of trainable parameters didn't manifest in the training and prediction times, which we expected to be much higher.This result indicates that, despite Figure 7d presents the accuracy metrics for each architecture.

Results and Discussion
The results are organized based on the cloud-coverage classification, which classified each pixel in Cloudy Pixel or Cloud-Free Pixels, whether the pixel is affected by clouds or not, respectively.The results of the set of all pixels were also presented as All Pixels.
The Convolution-based models employing optical data (CNN-OPT) presented an expected behavior, in which poor results were observed for Cloudy Pixels compared to Cloud-Free Pixels, considering all evaluated metrics.However, the Transformer-based optical model (TRA-OPT), which presented fewer positive predictions when affected by clouds, reached slightly better precision results for the Cloudy Pixels, but the recall results were highly affected.The error maps corroborate the metrics' results, especially for the models that presented low recall values but high precision results, showing that these models predicted fewer deforestation areas in the presence of clouds.

CONCLUSION
We investigated new fusion models, replacing the Convolutionbased with Transformer-based models in diverse cloud coverage conditions.
In general, we expected that the cloud presence decreased the quality of the optical models' predictions.However, the TRA-OPT precision result was greater in Cloudy Pixels than in Cloud-Free Pixels.This behavior may be the result of a few positive predictions generated by TRA-OPT in the locations affected by clouds but may be investigated deeper in the future.
Despite the number of parameters increase, the Transformerbased models did not consume more time to train or generate the predictions than the Convolution-based ones.Very high training times, especially prediction times, could hamper the use of such models.
The results from this work indicates that using the transformer operation, in substitution of the convolution, may improve the deforestation detection capability of the models, of the cloud coverage condition.Investigations using other architectures are required to validate these findings.
Further investigations into the role of the information source (optical and SAR) in the fusion models may clarify its behavior in diverse cloud conditions.
The Transformer-based joint fusion strategy presented the best F1-score, raising as a suitable candidate for systems to detect new deforestation areas using optical and SAR images, outside the dry season, increasing its deforestation detection capability.

Figure 1 .
Figure 1.Base Architectures: Transformer-based (a) and Convolution-based (e) architectures organized into Encoder, Decoder, and Classifier blocks, which are limited by the dashed boxes.

Figure 2
Figure2presents how the Encoder, Decoder, and Classifier blocks are organized for all single-modality optical (Figure2a) and SAR (Figure2b) models.

Figure 5 .
Figure 5. Reference dates and the selected images.
The Transformer-based and Convolution-based architectures have the C and D, respectively, arbitrary values, which affect the number of trainable parameters.Our study set C = 96 and D = 32, following Cao et al. (2021) and Ferrari et al. (

Figure 6 .
Figure 6.Distribution of the training (blue) and validation (green) tiles over the study area.

Figure 7 .
Figure 7. Models parameters (a) and the training (b) and prediction (c) times, with the respective metrics' results (d).

Figures
Figures 7a, 7b, and 7c present the number of trainable parameters of each architecture, and the training and prediction times, respectively.The Convolution-based networks have fewer trainable parameters than the respective Transformerbased.However, this difference didn't manifest in greater training and prediction times.

Figure 8 .
Figure 8. Error maps from the same area in diverse cloud conditions, in which white, black, red, and blue represent the true positive, true negative, false positive, and false negative, respectively.The pixels classified as Cloudy Pixel are highlighted in yellow.thehigher number of parameters of the Transformer-based architectures, these networks are more viable to be employed in large areas like the Brazilian Amazon Forest than the respective Convolution-based.
All models using SAR data were almost unaffected by cloud conditions.The Transformer-based SAR model (TRA-SAR) outperformed the Convolution-based counterpart (CNN-SAR) for all evaluated metrics.The early fusion strategy did not improve the results for the Convolution-based model (CNN-EF) in Cloud-Free Pixels, in comparison to the respective optical (CNN-OPT).However, it achieved better results in the Cloudy Pixels compared to the respective SAR model (CNN-SAR).The Transformer-based early fusion (TRA-EF) presented similar results to the respective Convolution-based in Cloud-Free Pixels but presented worse results than the respective SAR model (TRA-SAR) in Cloudy Pixels.The last two fusion strategies, joint fusion, and late fusion, presented the best F1-Score results, with the respective Transformer-based models outperforming the Convolutionbased architecture.Considering diverse pixel cloud conditions, TRA-JF presented better F1-Score results, while TRA-LF presented slightly lower F1-Score.However, the low recall of TRA-LF (0.86) indicates that this model failed in identifying new deforestation areas.

Figure 8
Figure8presents examples of error maps from two regions in the study area, where white, black, red, and blue represent the true positive, true negative, false positive, and false negative, respectively.The pixels classified as Cloudy Pixel are highlighted in yellow.For each region, the first row represents the Convolution-based models, while the second represents the Transformer-based models' prediction errors.The first two rows (first region) refer to images fully affected by clouds, while the last two (second region) refer to a cloud-free region.

Table 2 .
Model Name Base Architecture Fusion Strategy