EEI-NET: EDGE-ENHANCED INTERPOLATION NETWORK FOR SEMANTIC SEGMENTATION OF HISTORICAL BUILDING POINT CLOUDS

: In recent years, the conservation research of historical buildings and cultural relics has received a lot of attention from the state and the people, which not only provides a deeper understanding of their historical value and cultural significance, but also promotes the expansion of conservation research to the three-dimensional level. In this context, the semantic segmentation of historical building components is particularly important, which can provide basic support for various historical building applications, such as research and study of historical buildings, repair and protection, and 3D fine reconstruction, etc. However, most of the current methods for semantic segmentation of point clouds of historical buildings suffer from the problems of not being able to fully exploit the local neighborhood information of point clouds and poor edge segmentation. Therefore, we propose a new deep learning semantic segmentation-based approach, which we call EEI-Net. It is an end-to-end deep neural network in which we designed an edge enhancement interpolation (EEI) module and an edge interaction classifier (EIC). The edge enhancement interpolation module performs edge enhancement interpolation by fusing multi-layer features between the encoder and decoder. The edge interaction classifier enables the interaction of edge information through information transfer between individual nodes. EEI-Net incorporates contextual features and better preserves and enhances the edge information of the point cloud. We conduct experiments on the constructed historical architecture dataset, and the results show that the proposed EEI-Net has better performance.


INTRODUCTION
Chinese culture has a long history, among which our historical architectural art style is exquisite and scientific, which not only shows the wisdom crystallization of our ancient working people, but also is a valuable heritage wealth.However, with the passage of time and the impact of the environment, only a few ancient wooden structures have been preserved today (Chun et al., 2015).This makes it particularly urgent to strengthen the research work on the conservation of historical buildings in China.
With the rapid development of 3D sensors, the conservation of historical buildings is gradually evolving from the traditional way to the digital way.Using 3D point cloud technology, we can obtain more realistic dimensions and various architectural details of historical buildings and provide precise geometric coordinates (X, Y, Z) in the form of millions of points, which has become one of the most effective ways to document the shape of cultural heritage (Yang et al., 2020).Airborne laser scanners (ALS) (Elsner et al., 2018), mobile laser scanning (MLS) (Zhang et al., 2019), terrestrial laser scanning (TLS) (Zhu et al., 2021) and unmanned aerial vehicle (UAV) photogrammetry (Poli and Caravaggi, 2013) have become the most popular methods for collecting urban 3D point clouds, which can be applied to both indoor and outdoor scenes.In this paper, we construct a traditional historical building point cloud dataset by using MLS, which provides the real coordinates of historical buildings compared to images and is not affected by lighting conditions and image distortion.In addition, the dataset can show more clearly the spatial information between each component of historical buildings.
In recent years, researchers have proposed many segmentation methods for 3D point clouds, and the irregular and disordered structure of 3D point clouds has become one of the biggest challenges for 3D feature extraction and further semantic segmentation (Xie et al., 2020;Bello et al., 2020;Cheng et al., 2021;Chen et al., 2021a).Therefore, researchers are exploring and investigating more effective feature extraction methods.
Traditional point cloud semantic segmentation methods are based on machine learning methods, such as support vector machines, random forests, etc.These methods use handdesigned features to identify semantic information in the point cloud.While these methods perform well on specific tasks, traditional point cloud semantic segmentation methods have difficulty producing good results on larger and more complex datasets due to the limitations of hand-designed features.As a result, researchers have turned their attention to deep learningbased approaches.Currently, most point cloud feature extraction methods and their corresponding semantic segmentation methods can be classified into three types: point cloud projection-based methods (Milioto et al., 2019;Lyu et al., 2020), voxel-based methods (Le et al., 2018;Meng et al., 2019) and point-based methods (Triess et al., 2020).Since both projection-based and voxel-based methods may lose information during projection or voxelization, most researchers have focused on point-based methods (Chen et al., 2021b;Qian et al., 2021;Qian et al., 2022).Among them, PointNet (Qi et al., 2017a) as well as PointNet++ (Qi et al., 2017b) are considered as pioneer works.PointNet, the pioneering method, runs directly on the point cloud, but because its point-by-point features are learned individually from each point in PointNet, it ignores the local contextual information between points.Based on this, this work has been extended in various ways to enhance the acquisition of local information from point clouds.For example, DGCNN (Wang et al., 2019) proposes an edge convolution (EdgeConv) for learning edge features, which is used to extract features of centroids and edge vectors of centroids and K nearest neighbors (KNN) points.While these methods achieve better performance in semantic segmentation, most of them are limited to very small 3D point clouds input to the network and do not scale directly to larger point clouds.Thus, RandLA-Net (Hu et al., 2020) was proposed in order to adequately accommodate large-scale point cloud scenarios.It uses an efficient point cloud random sampling strategy and local spatial location encoding, which can achieve high segmentation accuracy and processing speed, but also leads to the possibility that he may lose critical contextual information during reduced sampling.Therefore, most researchers turn to new structures of backbone networks and work on different ways to improve the semantic segmentation accuracy of point clouds.
In recent years, attention-based approaches have flourished (Feng et al., 2020), and most of the different backbones employ different types of attention mechanisms, which automatically learn important local features by assigning larger weights to key information.Existing attention mechanisms include include channel attention, spatial attention, self-attention and multiattention, etc.In particular, self-attention has shown excellent performance in image analysis (Hu et al., 2019) and point cloud processing (Guo et al., 2021).In point cloud processing (Guo et al., 2021), the self-attentive mechanism is used to establish the relationship between the centroid and all points in the global space.As for the historical architectural heritage, we want to completely divide the upper part of the windows, square and columns, especially their edges, from the walls, so we need to pay more attention to learning the local neighborhood features.
More semantic segmentation methods based on deep learning have also been proposed in the field of historical architectural heritage.Dong et al. (Ji et al., 2021) modified DGCNN to segment MQDOA roofs.Francesca Matrone et al. (Matrone et al., 2020) compared machine learning methods with deep learning methods for large 3D artifact classification, synthesized the advantages of both methods, and proposed a cultural heritage point cloud semantic segmentation architecture, DGCNN-Mod+3Dfeat, that incorporates the advantages of both methods.Pierdicca et al. (Pierdicca et al., 2020) proposed an improvement by adding meaningful features to the DGCNN, such as normal vectors and colors, but the framework was unable to evaluate the accuracy performance of the acquisition technique and continued improvement is needed.Although all these improvements improve the segmentation accuracy of the point cloud of historical buildings, the accuracy of the column components is lower compared to the other components, and all of them have the problem of unsatisfactory acquisition of local features and edge information.
Our major contributions to this work include: (1) In order to improve the performance of semantic segmentation of 3D point clouds, we propose a new semantic segmentation method that attempts to redesign the framework structure of semantic segmentation of 3D point clouds.EEI-Net manages to extract differentiated semantic features and predict smoother results.
(2) We design an Edge Enhancement Interpolation Encoder-Decoder (EEI-ED) for efficient feature extraction by reducing the semantic gap between the encoder and decoder through edge enhancement interpolation.With this design, multilayer features from the encoder and features in the decoder are edge-enhanced interpolated.The interaction capability of multi-layer features is enhanced.
(3) We designed an Edge interactive classifier (EIC), which enhances the context-awareness of points through information transfer between nodes for better label prediction.

METHODS
In this paper, we propose an edges enhancement interaction network for semantic segmentation of 3D point clouds, which consists of Edge Enhancement Interpolation Encoder-Decoder (EEI-ED) and Edge interactive classifier (EIC).The model uses a U-shaped encoding-decoding structure, where the encoder reduces the number of points by a factor of ×4 through Random Sample (RS) and extends the dimensionality of each point feature in five consecutive layers through the Local feature Aggregate (LFA) (Hu et al., 2020) module, and the corresponding decoder increases the number of points through upsampling (US) and compresses the dimensionality of point features in five connected layers using mlp.To handle the semantic gap between encoder and decoder, we design an edge enhancement interpolation module in the decoder to obtain more discriminative features.Its structure diagram is shown in Figure 1:

Edge enhancement interpolation module
The purpose of edge enhancement interpolation is to enhance the information exchange ability of multi-layer features between the encoder and decoder.For convenience, we denote the features of the decoder layer corresponding to as .As shown in Figure 2, for in the decoder, we augment it with and from the encoder.We first reapply to a shared MLP on , compress its dimension to D1, weighted interpolation of the points of the intermediate features, then obtain the output features using the convolution operation, fuse and enhance the features of different layers by point-by-point multiplication, then calculate the difference between the fused features and the output features of the jth-1st layer encoder using the absolute difference, and finally multiply the high-resolution features with the above feature fusion differences element-by-element to enhance the expressiveness of .We introduce more high-frequency information while keeping the feature space consistent, and then add the enhanced features f_enhance and the original features f_encoder_list[-j-2] to form the final enhanced feature representation.In general, it utilizes the edge-aware upsampling module and edge enhancement module of the deep neural network, thus allowing better preservation and enhancement of the edge information of the point cloud and improving the segmentation quality.The full operation can be expressed as follows: (1) where is the modulation feature between the lth layer encoder and the lth layer decoder, and since this method uses multiple layers of features from the encoder and features in the decoder for edge-enhanced interpolation, the process is called edge-enhanced interpolation (EEI).Edge-enhanced interpolation captures the interaction information of edge multilayer features and achieves local contrast enhancement of the features using convolution, weighted interpolation and residual concatenation.To further attenuate the effect of spatial indistinguishability, we process the intermediate features using a Feature Enhancement (FE) module, which is implemented with a simple shared MLP.Finally, we fuse the modulated encoder features with the decoder features.This edge enhancement interpolation mechanism can be expressed as： (2)

Edge interactive classifier
In previous work, the classifier generates point-by-point semantic labels individually through mlp implemented by a fully connected layer.A fully connected layer consists of a linear exchange and a nonlinear activation function, however, the nonlinear activation function leads to inconsistent neighbors in the prediction.For this purpose we introduce the edge interaction module, which uses the neighboring node information of each node, based on the information transfer from the current node to each neighboring node, thus enabling edge-to-edge interaction.Edge interactive classifier module is illustrated in Figure 3: , obtain edge features by matching them with feature maps, and edge features as: (3) where W is the learnable weight and ReLU denotes the ReLU activation function.is the edge feature between the i-th point and its j-th neighboring point.The attention matrix is then computed for each node by computing its attention matrix over the original features and edge features to calculate the relationship between each node and its neighboring nodes: (4) Further calculate its attention weight matrix.The edge-to-edge interaction is achieved by multiplying it by f_neigh to calculate the neighbor representation of each node and obtain the information transfer from the current node to each neighboring node, and this process is: (5) where denotes the aggregated features of node i, denotes the feature matrix of neighbor node j, and denotes the attention coefficient between node i and neighbor node j.Finally, we perform maximum pooling, select the most important features among them, and add them with the results of the previous convolutional layers to form fused features.This module is embedded in the last two mlps to capture edge-toedge interaction features, improve contextual information, and employ the cross-entropy loss between predicted and true values as the loss function of EIC.

Experiment details
Our experiments were conducted on a single NVIDIA GeForce RTX 3060 TI GPU and an Intel® Core^TMi7-10700K CPU @ 3.80GHz *16 and implemented in TensorFlow on a server configured with CUDA11.3 and CUDNN8.2.1.We use the Adam algorithm with default parameters as the optimizer.The initial learning rate is set as 0.01 and decreases by 10% after each epoch.We trained 100 epochs on our own dataset, with batch size set to 2 and the number of nearest points K set to 16, and sampled a fixed number of points (40960) from each point cloud to feed into the network during training, while feeding the entire point cloud data during testing.

Dataset
This study relies on the actual historical architectural heritage, using mature UAV tilt photography and 3D laser scanning and other technical means to obtain the complete internal and external mapping data of historical buildings.We reviewed relevant information to master the basic composition of historical building structures, and used a variety of software to process and label the data to establish the historical building point cloud datasets.
In this study, two measurement areas were selected.Experimental Data 1 was built in the Ming Dynasty, it is located in Tucheng outside Deshengmen, Chaoyang District, and is not only a landmark on the northern extension of the central axis of Beijing city, but also a Beijing-level cultural relics protection unit.Experimental Data 2 was built in early 1908 and is located in the southwest of the Qing agricultural test site, a typical Chinese classical building.
We use UAV tilt photography and 3D laser scanning to obtain the interior and exterior point cloud data and roof data to obtain the complete historical building scene, and use CloudCompare software to align the generated roof point cloud with the scanned point cloud to fuse into a complete historical building point cloud data including the interior and exterior structures of historical buildings.To control the amount of point cloud data, we use 0.01m spatial subsampling to streamline the point cloud data by referring to a study related to European architectural heritage, which is sufficient to enable the neural network to learn effective structural features from the point cloud.The final number of points obtained for our dataset is 20182330 points, each with coordinates and colors.Finally, the data of the required experimental range is cropped for subsequent annotation work, and the complete point cloud after fusion is shown in Figure 4.The manual annotation of the point cloud of historical buildings is shown in Figure 5 and Figure 6.In order to scientifically evaluate the generalization ability of the segmentation network on historical building data, we cut both historical buildings along the central axis of the scene into a left part point cloud and a right part point cloud, a total of four parts, in order to carry out a four-fold cross-validation experiment.The point cloud scene is a complete point cloud containing both external and internal structures, with six semantic categories labelled: roof, column, square, door, window, and wall.

Semantic Segmentation Results
In this paper, we choose RandLA-Net and our network to validate the present and past point cloud semantic segmentation experiments on the historical architecture dataset using a fourfold crossover.To fully evaluate the semantic segmentation effect of the EEI-Net network model on the historical architecture dataset.We used two models: (a) region one for testing, regions 2 to 4 for training, and again, each region in turn for testing and the other regions for training.(b) k-fold crossvalidation (k=4).We used overall accuracy(OA), mean class accuracy(mAcc), the mean intersection over union(mIoU) as standard metrics.As shown below： (6) (7) (8) There, k is the number of categories, is the number of correctly classified positive samples, is the number of incorrectly classified positive samples, is the number of incorrectly classified negative samples, and is the total number of points.We adopted a four-fold cross validation method to evaluate our method, and the experimental results are shown in Table 1

CONCLUSIONS
Nowadays, historical building point clouds have complex geometry, but the point cloud dataset of historical building scenes is very limited, which is a key problem we need to solve.Therefore, a dataset of historical buildings is constructed in this paper.In this study, we propose EEI-Net in order to investigate semantic segmentation for the special nature of point clouds of historical buildings.The network consists of an edge-enhanced interpolation codec and an edge interaction classifier designed to fully explore the local neighborhood and edge features.The edge enhancement interpolation module performs edge enhancement interpolation by fusing multi-layer features between the encoder and decoder.The edge interaction classifier enables the interaction of edge information through information transfer between individual nodes.EEI-Net incorporates contextual features and better preserves and enhances the edge information of point clouds.The whole framework exhibits significant feature representation capability, which makes the features more discriminative and achieves excellent performance on the historical architecture dataset.
However it still has some problems to overcome, such as incomplete and imbalanced categories.In addition to this, the work is performed under full tagging, and large point cloud datasets are becoming more and more common, resulting in high costs.Therefore, in future work, we will continue to improve the accuracy while reducing the annotation point cloud share to achieve better performance with reduced annotation cost and effort.

Figure 1 .
Figure 1.Network architecture of the EEI-Net.

Figure 3 .
Figure 3.The schematic diagram of the edge enhancement interpolation module.

Figure 7 .
Figure 7. Qualitative results of this method on historical architecture data set.

Table 1 .
and Figure 7.It is obvious from Table 1 that the performance of our proposed EEI-Net network in OA, mAcc and mIoU is better than that of the original RandLA-Net.As shown in Fig 7, the part shown in red box can clearly indicate that our method can segment the scene more smoothly, especially for the object boundaries.Comparison of edge contour feature extraction methods.