An Improved Mask R-CNN: Extraction of Door and Window Instances on Village Building Façade Images

: Rapid access to the basic structure of village buildings is conducive to the investigation of the load-bearing bodies of village houses and provides data support for disaster assessment and post-disaster rescue and reconstruction. The development of computer vision technology provides new ideas and tools for identifying and extracting basic structures of housing buildings. Considering that the original Mask R-CNN ignores the spatial association and relationship of door and window elements, an advanced deep learning model based on Mask R-CNN network is proposed in this paper to detect and segment the door and window structure from the façade images. The improved network architectures integrate the attention mechanism with the original network, containing an improved Coordinate Attention(CA) module and a relationship module-based head network. The experimental results show that the Average Precision(AP) value of the backbone combined with the improved CA module is increased by 0.7% and 0.7% on regression and segmentation tasks respectively, compared with the original Mask R-CNN network. In the head network based on the relationship module, the calculation strategy of the relational module proposed in this paper increases the AP values of detection and segmentation from 76.7% and 77.7% to 80.6% and 80.0%, respectively.


INTRODUCTION
Among natural disasters, earthquake damage is most closely related to housing construction.The areas with severe earthquake damages are mainly in village and town areas, therefore, the survey of village and town housing disaster-bearing body information can provide technical support and decision basis for post-earthquake emergency rescue, decision-making command and post-disaster reconstruction in the region (Xu et al., 2014).
In the building structure survey of the disaster-bearing body investigation, the size and location of openings in walls are the key factors in understanding the building structure and carrying out damage assessment of disaster-bearing bodies, in which case, window and door elements are one of the most distinctive and numerous elements of the building façade.Compared with traditional extraction methods such as contour-based methods (Haugeard et al., 2009;Lee & Nevatia, 2004;Recky & Leberl, 2010), intensity-based methods (Čech & Šára, 2009), and machine learning-based methods (Jampani et al., 2015;Reznik & Mayer, 2008;Yang et al., 2012), extracting doors and windows with UAV data and deep learning based methods would greatly reduce the difficulty of survey work and improve efficiency.First, rather than conducting large amounts of on-the-spot investigation, unmanned aerial vehicles (UAV), aircraft, satellites, and other equipment are used to scan a specific area and obtain the corresponding data.Second, the quality and accuracy of data are greatly improved by using aerial surveys and remote sensing technology.Third, deep learning has achieved state-of-the-art performance in a number of vision tasks (Dosovitskiy et al., 2020;Girshick, 2015;Long et al., 2015;Redmon et al., 2016;Ren et al., 2015).However, extracting components from building façades with UAV data and deep learning methods is barely studied in current research (Liu et al., 2020).
In this paper, a deep learning model based on Mask R-CNN (He et al., 2017) is proposed to detect and segment door and window elements from the building façade.For the detection of door and window structures of village houses, several improved architectures fused with Mask R-CNN to improve the accuracy of the extraction of doors and windows were designed.The distribution of doors and windows of houses in villages and towns is generally not as well arranged as the elements of houses in cities, especially the houses in residential areas are more significant.Therefore, it is necessary to improve the network's attention to the window and door elements in feature extraction.Also, considering that the original Mask R-CNN only extracts the appearance features of individual targets without considering the spatial relationship of door and window elements and the association between objects, the improved network architectures in this paper integrate the attention mechanism with the original network model.The specific work is as follows: (1) An improved CA(Coordinate Attention) module is proposed to fuse coordinate attention and channel attention to optimize the features extracted by the backbone, enabling the network to better extract RoI for object detection; (2) Relation module is embedded in the fully connected layer of the head network to integrate the appearance features and geometric features among different objects, and a novel strategy is proposed to seek a weighted sum of the attention of geometric features and appearance features to improve the degree of relationship of geometric features among door and window objects.Firstly, the ResNet (He et al., 2016) + FPN (Lin et al., 2017) feature extraction structure is adopted by the backbone network to extract multi-scale feature maps.Considering the balance of detection effect and speed, ResNet50 is selected.At the same time, an improved CA module is introduced, and the feature maps output by ResNet is extracted with the attention feature by the improved CA module, and then the pyramid feature structure is generated by the FPN network.

Network
Second, the RPN network is used to predict the classification score of the anchor box generated by each pixel and the bounding box.Then, select the anchor boxes with higher scores as the proposal regions into the head network.
In head network, one branch is for classification and bounding box regression, and the other is for instance segmentation.The relation module is embedded after the two fully connected layers of the first branch and is used to learn the relationship between objects.Specifically, given an input  ∈ ℝ !×#×$ , each channel is first encoded along the horizontal and vertical coordinate directions using pooling kernels of dimensions (, 1) and (1, ) , respectively, and the output  % & of the th channel of height ℎ can be expressed as follows:

Backbone fused with Improved CA Module
Similarly, the output  , & of the th channel of width  can be expressed as: The above 2 transformations aggregate features along each of the two spatial directions to obtain a pair of direction-aware feature maps, allowing the attention module to capture the long-term dependencies along one spatial direction and preserve the precise location information along the other spatial direction, which helps the network to locate the region of interest more accurately.
In other words, the coordinate information embedding operation is to average pool the coordinates in the X and Y directions where each pixel is located, and obtain X AvgPool and Y AvgPool.
By embedding the position information, concatenate the two position features and then perform the linear transformation operation as follows: where  % and  , are the coordinate information embedding,  ( (•) is the 1 × 1 convolution, (•) is the activation function (typically ReLU),  ∈ ℝ !/.×(#0$) is the generated intermediate feature map for spatial information in the horizontal and vertical directions, and  denotes the downsampling ratio to control the size of the module.
Finally, the output ′ ∈ ℝ !×#×$ with coordinate attention features is obtained by multiplying the original features with the coordinate features using the broadcast mechanism: Improved CA Module.The CA Module only considers the importance of spatial location information, while it is not reflected on which channel is more worthy of attention.Therefore, An improved CA module was designed.On the basis of CA Module, the channel attention branch is added to enable the network to better detect objects by combining both their position information and channel information, and its structure is shown in the green dashed box in Fig. 2.
When considering the channel attention, we are inspired by the SE (J.Hu et al., 2018) and CBAM (Woo et al., 2018) modules to perform global average pooling and maximum pooling on the channel dimension of the original feature image to generate two different global features Then,  ! 345and  ! 678are simultaneously fed into a shared multilayer perceptron containing two fully connected layers  ( and  9 .Finally, the channel attention feature  ! 2 ∈ ℝ (×(×! is obtained by element-wise sum as follows: where (•) refers to the sigmoid function, (•) refers to the ReLU function, and  ( and  9 are two fully connected layers with shared weights.Given the th RoI object, the relationship between it and all other objects is characterized by  = () as follows: where  3 > denotes the appearance feature of the th object,  ? is the linear transformation matrix, and  >; denotes the relational feature weights between the th object and the th object.The relation feature  = is added with the original image features  3 and used as the input of the next layer.
The calculation of the relational feature weight  >; consists of three parts: (1) Calculating the appearance correlation  3 >; between two objects.
where  @ and  A are linear transformation matrices and  B is the number of feature dimensions.
where ′ : >; denotes the geometric feature after coordinate transformation,  : is the linear transformation matrix of the geometric feature, and  : is an operator to embed the 4dimensional coordinate information into a 64-dimensional feature with sinusoidal position encoding.
The coordinate transformation is to ensure translation nondeformation and scale invariance between different target features: (3) Calculate the correlation  >; between the two targets.
In calculating the relation module, the appearance features are divided into  .equal so that they pass through different relation modules, and then sum up with the appearance features  3 as the input of the next layer: The relational features  A ,  @ and  ?correspond to the Query and the Key-Value pair in the attention mechanism, respectively.
For each pair of relational features,  A ,  @ ,  ?and  : are learnable weights, so the relation module can be integrated with the network for end-to-end network training as shown in Fig. 5(a).
The geometric similarity between objects in the above relation module is involved in the calculation as the weight coefficients of object similarity, and we propose an alternative strategy that uses both appearance features and geometric features for the calculation of relation features to enable the network to learn the association between the location information of different objects.
In this strategy, the relation features are computed as follows: where  ( >; and  9 >; are the appearance similarity and geometric similarity weights between the  th and  th RoI, obtained from  3 >; and  : >; by softmax transformation:

EXPERIMENTS
In this section, AP(Average Precision) is used as the evaluation criteria for each category.The AP value is the area below the PR curve for a certain class of targets.The recognition accuracy of a model can be measured by analyzing the AP values of the network training results.The PR curve is plotted with Recall as the horizontal axis and Precision as the vertical axis for different values.Usually, the larger the area contained below the PR curve, the better the model is.In the COCO(Microsoft Common Objects in COntext) evaluation criteria (Lin et al., 2014), AP@[0.5:0.95] is the average of the 10 AP values obtained by dividing the  C%.DE%FGH from 0.5 to 0.95 in steps of 0.05, which allows for a more refined evaluation of the detection accuracy of the model.AP 50 and AP 75 represent the AP values when  C%.DE%FGH is 0.50 and 0.75 respectively, while AP S , AP M and AP L are the AP values for different object sizes with S, M and L representing object sizes of area < 32 2 , 32 2 < area < 96 2 and area > 96 2 respectively.

Datasets
The datasets for this experiment were obtained from 3D models of building facades reconstructed from UAV aerial imagery in selected villages and towns, with four wall facades for each building as our goal is to make quick and large-batch window and door extractions for village houses.
Most of the buildings photographed are two-or three-story bungalows in height.The image is size of 2000 × 3000 pixels, which consists of Red, Green and Blue channels.

Three Fusion Methods in Improved CA Module
Three different improved CA module fusion methods are as follows: (1)X/Y/C: The original feature map X is input in parallel to the X-AM, Y-AM, and C-AM branches, and the output attentional feature image is obtained by multiplying it with the original feature by the broadcast mechanism.(2)XY/C: The original feature map X is input in parallel to the XY-AM and C-AM branches, and then multiplied with the original image to obtain the attentional feature image.(3)X+Y+C: The three branches are connected in series and the input and output of each branch are multiplied by skip connections to obtain the attentional feature.Although the X/Y/C fusion method is slightly better than the original Mask R-CNN, it performs not well as XY/C.This is probably because the XY/C attention extraction method integrates the X-and Y-direction features when extracting coordinate attention, which is richer in information association and long-term feature dependency than the X-and Y-direction input branches separately.As shown in Table .2and Table .3,the X+Y+C method performs worse than the original Mask R-CNN, presumably because some of the detailed features are lost after the original features are extracted by modular cascading.

Two Strategies in Relation Module
In a relational module-based head network, two crucial hyperparameters exist in the model: the number of relations  .and the number of modules { ( ,  9 }.We set  ( = 1 and  9 = 1 on the basis on Hu's experiment (Hu et al., 2018), and compare the results for different numbers of  . .
The AP is highest when the number of  . is 16 according to Hu's research.However, we came to a different conclusion in our trials.Experiments on the header network were carried out on the improved CA module based on XY/C in the previous section, and the experimental results are shown in the table.According to Table .4,the mAP values for the regression and segmentation tasks showed an increasing trend as they increased overall.Specifically, when  . is less than 8, both AP bbox and AP segm tend to increase by increasing the number of  . .When  . is equal to 16, the mAP values decreased slightly.When  . is equal to 32, the mAP value is the highest, and the accuracy increases by 2.6% and 3.9% in the regression task and 2.3% and 2.2% in the segmentation task, respectively.Considering the computational efficiency and computational complexity,  .= 32 is chosen.
The two different relation module computation strategies perform well on our dataset both.Considering that the output of the segmentation branch of Mask R-CNN will resize to the size of RoI, the accuracy of the regression branch will affect the accuracy of the post-segmentation processing.For the above reasons, the relational module computation strategy of Head+RM Eq.( 15)+Eq.( 18)is proposed.

Precision Analysis
The comparison between the detection results of the original Mask R-CNN and the improved Mask R-CNN reveals that the original Mask R-CNN performs less effectively for door elements, and is prone to false and missed detections, as shown in Fig. 8.After the introduction of the attention mechanism, the number of false and missed detections is significantly reduced, which is presumably due to the fact that the introduction of the attention mechanism enables the network to learn the connection and difference between the appearance and location of the target object and other objects, thus enabling the network to identify the objects more accurately.
Specifically, some of the severely missed objects were mainly door elements.On one hand, door elements were less distributed and regularly arranged in a building façade than windows.On the other hand, the appearance, style and scale of door elements varied, making them easy to be missed in the original network.
In the improved network, combined with the coordinate information, the network is able to detect door elements based on the geometric relationships and the relative positions of the window elements in the façade.

CONCLUSION
In this paper, a Mask R-CNN network was proposed for extracting instances of window objects improved from UAV building façade images.Two improvements have been made.First, with the combination of the attention features, the improved CA module enables the network to learn the attention features based on coordinate positions and channel attention features through coordinate encoding.Therefore, the RoIs containing the door and window objects can be extracted more accurately, which helps further tasks in the head network.Second, by combining the relation module with the head network, a set of objects can be processed simultaneously through the interaction of appearance features and geometric features, so that relationships between doors and windows can be learned during the learning process, enhancing the network's ability to represent image features and geometric relationships between objects.Through these attention mechanisms, the network can learn the appearance, space, and object relationships between objects on a deeper level, resulting in greater accuracy.
Quantitatively, the average precisions of the improved CA module are higher than the baseline by 0.7% and 0.7% on both regression and segmentation tasks.The average precisions of the head with relation module increased from 76.7% and 77.7% to 80.6% and 79.9% compared to those without relation module.
Experiments show that the proposed method can fully utilize the spatial relationship features between doors and windows, and therefore can achieve better detection results.
The attention mechanism can be utilized to enhance the network's attention to the feature extraction, which reduces the occurrence of misdetections and under-detection in extracting the RoIs of the door and window elements.Besides, combined with the relation module, the network can model the appearance and geometric relationships between each object, learning the similarities between the same class of objects and the differences between different classes.
However, due to the complexity and irregularity of window and door elements in village buildings, the improvement of the network in attention modeling for window and door elements is still limited.Future work will continue to consider modeling the relative relationship between window and door elements and house facades to further investigate instance segmentation of facade elements.

ArchitectureFigure 1 .
Figure 1.General framework of the proposed network.The improved Mask R-CNN network structure is shown in Fig.1.The network model consists of three parts(from left to right): (1)
Three different fusion methods.Considering that the CA module is essentially composed of X-directional attentional feature module (X-AM) and Y-directional attentional feature module (Y-AM), plus channel attentional feature module (C-AM), different fusion methods may produce different effects, so three fusion methods are tried in this paper as shown in Fig.3.(a)X/Y/C (b)XY/C (c)X+Y+C Figure 3. Three different fusion methods for improved CA module.The first fusion method is to input the original feature map into the three branches of X-AM, Y-AM, and C-AM in parallel and obtain the output attention feature map by multiplying it with the original features in the broadcast mechanism as shown in Fig.3(a).The second fusion method is to input the original feature map in parallel to the XY-AM and C-AM branches, where the XY-AM branch is the attention feature that extracts the features through 1×1 convolution by concatenating the coordinate encoding in the X and Y directions and then splits after linear transformation.Then multiply the XY-AM branch's feature with the original feature map and then multiply it with the output of the C-AM branch to obtain the attention feature image as shown in Fig.3(b).The third fusion method is to connect the three branches in series and multiply the input and output of each branch with skip connections to obtain the attentional feature image as shown in Fig.3(c).Backbone with improved CA module.The ResNet was used in Backbone networkt, and then the outputs of ResNet, C1 to C5, are put into the improved CA module.The final outputs, P2 to P6, at different scales are extracted by the CA module in combination with the FPN network as shown in Fig.4.
In this way, the obtained relational features integrate the weighted sum of the appearance and geometric features between different targets after a linear transformation as shown in Fig.5(b).As a result, the network is able to fully consider the geometric relationships between different objects.Head with Relation Module.The computational flow of the head network in the original Mask R-CNN is shown in Fig.6(a).Given the th RoI object feature, it is input to two fully connected layers of size 1024 dimensions, and then the corresponding classification scores and bounding box regression values are obtained by linear transformations.Each object is classified and regressed using only the features extracted by the preorder network.The computational flow of the head network with the relationship module is shown in Fig.6(b), where all RoI object features are used as inputs and the dimensions of the inputs and outputs are kept constant, and each target achieves correlation learning between object features by learning associations with others, including appearance features and geometric features.
Figure placement and numbering.
To prepare the input data for training, window and door objects are extracted from the annotated images and encoded in the COCO instance segmentation format.To maintain the consistency of the results, every tested model ran 300 epochs on a single GPU.Also, each model was trained and finetuned on the basis of the pre-trained ResNet model on ImageNet-1k dataset.

Figure 8 .
Detection and segmentation results.(a)(b)(c) are the results of origin Mask R-CNN and (d)(e)(f) are the results of improved Mask R-CNN.Red boxes are for underdetection and blue boxes are for misdetection.

Table 1 .
The mAP (%) of three different fusion methods of improved CA module.Table.1 shows that the mAP values of both X/Y/C and XY/C fusion methods improved compared to the baseline.The mAP value of the XY/C fusion method is the highest in the regression and segmentation tasks, with 0.7% and 0.7% improvement over the baseline model, respectively.The results show that the addition of the CA module attentional feature extraction in the backbone enables the network to better extract and deliver the door and window objects as RoIs to the downstream tasks in the subsequent RPN network.

Table 2 .
The AP values(%) of three different fusion methods of improved CA module on regression task.

Table 3 .
The AP values(%) of three different fusion methods of improved CA module on segmentation task.

Table 4 .
The mAP (%) for Evaluating the Effect of  .ontwo strategies.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume X-1/W1-2023 ISPRS Geospatial Week 2023, 2-7 September 2023, Cairo, Egypt This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.