AN INVESTIGATION OF SUPER-RESOLUTION FOR CROSS-DOMAIN BUILDING EXTRACTION USING TRANSFORMER

: The density of buildings is an important index to reflect the productivity and prosperity of an economic entity. Automatically monitoring the change and development of buildings through satellite can not only benefit the assessment of the status of urban development but also contribute to suburban construction planning. Apparently, more accurate building extraction performance can be guaranteed with higher-resolution remote sensing images. However, the desired high-resolution images are not always available limited by the remote sensing imaging technology and the expensive cost of updating the sensors and equipment. Therefore, the super-resolution technology, which aims at restoring the high-resolution images from the given low-resolution images, is a promising solution to resolve the dilemma. Therefore, in this paper, we investigate the potential application of super-resolution technology for cross-domain building extraction. The experiment results demonstrate that super-resolution can indeed improve building extraction accuracy.


INTRODUCTION
As an essential carrier of human productive activities, buildings have become one of the most changeable land use types (Hu et al., 2023).The dynamic information of buildings is beneficial for urban planning (Guo et al., 2021), map production (Lafarge et al., 2008), population statistics (Ji et al., 2019), and disaster assessment (Gupta and Shah, 2021).With more and more satellites have been launched worldwide in recent years, automatic extracting and monitoring of the change and development of buildings through remote sensing images become a feasible and efficient option (Chen et al., 2023).
In recent years, deep learning methods especially the convolutional neural network (CNN) represented by fully convolutional networks (FCN) (Long et al., 2015) have become the mainstream approach for building extraction from remote sensing images benefiting from their flexibility and adaptability (Ji et al., 2018).Since the pioneer FCN structure, the encoderdecoder structure for segmentation such as UNet (Ronneberger et al., 2015) and SegNet (Badrinarayanan et al., 2017) aiming at addressing the coarse-resolution segmentation of FCN-based networks was also introduced and improved for building extraction (Shi et al., 2022;Qiu et al., 2023;Deng et al., 2023).
The performance of these CNN-based methods, although promising, encounters bottlenecks in building extraction (Wang et al., 2022a).Specifically, CNN naturally lacks the capability for capturing long-range and non-local dependencies as it is originally designed to extract local patterns (Li et al., 2021b).However, in remote sensing images, buildings are normally in diverse appearances and are surrounded by complex backgrounds.Therefore, with only the local context, pixels will sometimes be ambiguous for identification, while the global context or long-range dependency can then provide extra information to determine the category, as illustrated in Figure 1.As a promising alternative to CNN, transformer (Vaswani et al., 2017), a novel structure originally designed for natural language processing (NLP), adopts the self-attention mechanism to extract global interactions between contexts.Recently, the vision transformer (ViT), a variant of transformer for computer vision, has shown its huge potential in enhancing vision-related tasks (Dosovitskiy et al., 2021;Zhu et al., 2021;Liu et al., 2021Liu et al., , 2022;;Wang et al., 2022b).Compared with content-independent convolutional operations, attention weights of the self-attention blocks in the ViT are generated according to the relationship between contexts (Conde et al., 2023).Meanwhile, the long-range dependency modelling is enabled by the shifted window mechanism embedded within the ViT (Liu et al., 2021(Liu et al., , 2022)).
Apparently, a remote sensing image with a higher resolution can provide more accurate global context information than a lower one, thereby guaranteeing more reliable building extraction performance.Even though more and more high-resolution images are available with the rapid development of remote sensing technologies, a constellation of satellites launched several years ago still continuously provides low-resolution but high-quality images.To fully utilize those valuable resources for building extraction, the super-resolution (SR) technology, which aims at reconstructing the high-resolution (HR) images from the given low-resolution (LR) images, is an encouraging solution (Dong et al., 2022).
For super-resolution, the revolutionary deep-learning-based methods have replaced traditional solutions such as predictionbased methods, patch-based methods, and edge-based methods (Wang et al., 2020).Since the pioneering Super-Resolution Convolutional Neural Network (SRCNN) proposed by (Dong et al., 2015), a series of novel super-resolution models have been developed sequentially to improve both the performance and efficiency (Li et al., 2019;Ji et al., 2020).In the wake of the successful application of the transformer in vision-related tasks, the transformer-based models have already demonstrated their great potential to enhance the super-resolution performance (Liang et al., 2021;Conde et al., 2023;Lei et al., 2021).
Currently, super-resolution technology has been adopted to boost the performance of image classification (Pang et al., 2019), object detection (Li et al., 2021a) and semantic segmentation (Zhang et al., 2021b).For building extraction, superresolution-based methods can be generally divided into two kinds: the end-to-end approach (Zhang et al., 2021c;Xu et al., 2021) and the two-stage approach (Zhang et al., 2021a;Chen et al., 2023).However, all their methods are based on CNN structure, it is still unknown if a transformer-based framework can perform well for super-resolution-based building extraction.
In this paper, we investigate the potential combination of transformer-based super-resolution and building extraction models.Specifically, a two-stage framework is developed using the LSwinSR (Li and Zhao, 2023) for super-resolution and BuildFormer (Wang et al., 2022a) for building extraction.To comprehensively examine the potential super-resolution-based building extraction, the experiments of ×4 upsample scales are conducted and CNN-based networks are also included in the comparison.

METHODOLOGY
The flowchart of the super-resolution-based building extraction framework is demonstrated in Figure 2. The low-resolution images are first up-sampled by the super-resolution model, whereafter the up-sampled images are used to train, validate and test the building extraction model.

Super-Resolution Models
This paper evaluates the Bicubic up-sampling method and a deep learning super-resolution model LSwinSR (Li and Zhao, 2023).LSwinSR was designed based on Swin Transformer (Liu et al., 2021) and SwinIR (Liang et al., 2021)

Dataset
The Inria Aerial Image Labeling Dataset (Maggiori et al., 2017) is used to evaluate the different combinations of superresolution and building extraction methods.The Inria dataset collects 360 high-resolution aerial images in 0.3m resolution from five cities (Austin, Chicago, Kitsap, Tyrol, and Vienna).
In our experiment, the images of Austin, Chicago and Kitsap are taken to train super-resolution models, while the 1-5 tiles of each city remained to validate the performance and select the optimal model.After training and validation, the optimal model is then used to up-sample the images of Tyrol and Vienna, where the 1-5 and 6-10 tiles of each city are remained to test and validate models, respectively.For both superresolution and building extraction, the original 5000 × 5000 images are first padded to 5120 × 5120 pixels and then cropped

Evaluation Metrics
Three frequently-used indexes are employed to evaluate the performance of the super-resolution result including the peaksignal-to-noise ratio (PSNR), the structural similarity index measure (SSIM) and the mean absolute error (MAE).Thereafter, the building extraction model is trained, validated and tested based on the upsampled images generated from corresponding super-resolution algorithms, where the performance is measured by overall accuracy (OA), precision, recall, F1 score and intersection over union (IoU): where TP, FP, and FN represent the true positive, the false positive, and the false negative, respectively.

Super-Resolution Experiments
Considering that the proposed model is a two-stage framework, the accuracy of the image super-resolution model in the first stage plays a crucial role in the performance of the entire model.In this paper, We evaluated two image super-resolution algorithms, Bicubic and LSwinSR, on the Ineria Aerial Image Labelling Dataset, with a super-resolution scale of ×4.The performance is evaluated based on the image quality evaluation metrics including PSNR, SSIM and MAE.Also, it is noted that compared with Bicubic, LSwinSR maintains obvious advances on test sets with respect to PSNR (2.44), SSIM (6.39%), MAE (1.13%).The similar gaps in the three evaluation metrics between training, validation and test sets imply the strong generalization and predictive ability of LSwinSR.

As shown in
The visual comparison in Figure 3 provides a more intuitive demonstration of the above analysis.As demonstrated in Figure 3, LSwinSR produces clear boundaries between buildings and other background structures while Bicubic only describes building with indistinguishable contours.These experiments validate the outstanding super-resolution performance of LSwinSR.

Building Extraction Experiments
This section evaluates the performance of building extraction based on the high-resolution images obtained by the superresolution model in the section 3.3.In addition to using highresolution images generated by Bicubic and LSwinSR, the original high-resolution images are also directly fed into the building extraction model as a comparison reference.Two building extraction models, namely BuildFormer and ABCNet, are evaluated in this section based on five evaluation metrics including Precision, Recall, F1, IoU and OA.ABCNet in building extraction tasks.For example, Build-Former achieved a leading IOU of 4.58%, 3.88%, and 4.22% than ABCNet respectively on images processed by Bicubic, LSwinSR, and the original high-resolution images.Similarly, when using the same source of high-resolution images, Build-Former exceeded ABCNet in Precision, Recall, OA, and F1 by at least 1.88%, 2.90%, 1.19%, and 2.42%, respectively.This phenomenon is mainly due to two reasons.In the future, we will further investigate the performance gaps on different super-resolution scales.In addition, the potential combination of other super-resolution models and building extraction models will be analyzed.

Figure 1 .
Figure 1.Illustration of the local context and global context.The squares represent the receptive field of the convolution operation.The orange regions represent the ambiguous building pixels where the local context is indistinguishable.

Figure 2 .
Figure 2. Flowchart of the two-stage super-resolution-based building extraction.into 512 × 512 pixels image tiles.It is noteworthy that subsets for super-resolution and building extraction are collected in different cities.Therefore, the cross-domain capability of the super-resolution-based building extraction framework can be then verified.
images and super-resolution outputs.The outstanding performance of LSwinSR illustrates the effectiveness of both shallow and deep feature extraction through transformer-based deeplearning blocks for super-resolution.
First, this suggests that Transformer-based methods have certain advantages over CNN-based methods.Second, ABCNet evaluated here has fewer parameter amounts than BuildFormer which may lead to worse performance of ABCNet.The visual comparison in Figure 4 further demonstrates the importance of the application of super-resolution.Compared with the original high-resolution image, building extraction with LSwinSR has less shape deformation.Also, BuildFormer performs better than ABCNet when processing the same input image.Among all results, the combination of LSwinSR and BuildFormer performs best, which validates the proposed two-stage framework using the transformer-based super-resolution model and building extraction model.4. CONCLUSIONS In this work, we investigated a two-stage building extraction framework based on LSwinSR and BuildFormer.The developed framework addressed the effectiveness of combining transformer-based super-resolution and building extraction models.Super-resolution and building extraction experiments were conducted separately on the Inria Aerial Labeling Dataset, which demonstrated superior building extraction performance with LSwinSR.Also, the super-resolution experiments demonstrated the dramatic enhancement of super-resolution by LSwinSR compared with the simple Bicubic interpolation.Furthermore, the building extraction experiments between AB-CNet and BuildFormer indicated that transformer-based building extraction models always perform better than CNN-based methods.

Figure 4 .
Figure 4.A visual comparison on building extraction results.Inputs include images from Bicubic, LSwinSR with ×4 super-resolution and the original high-resolution image.The building extraction model includes ABCNet and BuildFormer.
The first step utilizes a convolutional layer to process early visual features and expand the feature space.The shallow features are then processed by several Residual Linear Swin Transformer Blocks (RLSTB) and a convolution layer with the shifted window mechanism to obtain deep features.Finally, the shallow features and deep features are aggregated and fed into the reconstruction module for generating high-resolution images.
(Wang et al., 2022a)ructure i.e. shallow feature extraction, deep feature extraction and image reconstruction.2.2 Building Extraction ModelsIn this paper, two deep-learning-based building extraction models are evaluated, i.e.ABCNet(Li et al., 2021b)and Build-Former(Wang et al., 2022a).The success of building extraction algorithms depends on their ability to extract both local and global information.ABCNet utilizes an attentive bilateral network structure in its convolutional neural network to extract both types of features.This structure generates spatial paths to extract local features and contextual paths to extract global features, and an attention mechanism is used to capture rich contextual information.By integrating both low-level and detailed features from the spatial path and high-level and semantic features from the contextual path, ABCNet realizes a CNN-based building extraction model with competitive performance.BuildFormer, on the other hand, is a dual-path variant of the Vision Transformer.It extracts spatial-detailed features through convolutional blocks and global context features through Build-Former blocks with a window-based linear multi-head selfattention mechanism.Like ABCNet, it fuses the spatialdetailed and global context features to produce building extraction results.According to its claims, BuildFormer outperforms most CNN-based models for building extraction.

Table 1
surpasses Bicubic by an obvious gap at about 2.73 and 2.81 respectively.Additionally, LSwinSR exceeds Bicubic on SSIM by more than 6.86% with lower MAE (at least 1.15% lower) on both training and validation sets.Higher PSNR and lower MAE indicate lower peak and mean error respectively while higher SSIM demonstrates a higher similarity between low-resolution