PAN-SUNET: UTILITY CORRIDOR UNDERSTANDING USING SPATIAL LAYOUT CONSISTENCY

: The article addresses the need for a dependable and efficient computer vision system to examine utility networks with minimal human intervention, given the deteriorating state of these networks. To classify the dense and irregular point clouds obtained from the airborne laser terrain mapping (ALTM) system, which is used for data collection, we suggest a deep learning network named Panoptic-Semantic Utility Network (Pan-SUNet). The proposed network incorporates three networks to achieve voxel-based semantic segmentation and 3D object detection of the point clouds at various resolutions, including object categories in three dimensions, and predicts two-dimensional regional labels to differentiate utility and corridor regions from non-corridor regions. The network also ensures spatial layout consistency in the prediction of the voxel-based 3D network using regional segmentation. By testing the proposed approach on 67 km 2 of utility corridor data with an average density of 5 pts/m 2 , the paper demonstrates the effectiveness of the technique. The proposed network outperforms the state-of-the-art baseline network, achieving an F1 score of 94% for the pylon class, 99% for the ground class, 99% for the vegetation class, and 99% for the powerline class. It also shows high performance for 3D object detection for pylon and span achieveing average precision of 99% and 92% respectively.


INTRODUCTION
To protect and enhance the economy, it is necessary to ensure the durability and resilience of the utility grid.This requires conducting safe, effective, and precise inspections of the utility network.Traditionally, ground crews have been deployed to conduct these inspections, which involve managing vegetation encroachment and inspecting the physical condition of the infrastructure (Kim and Sohn, 2010).In recent years, unmanned aerial vehicles (UAVs) have emerged as a cost-effective alternative to traditional data acquisition methods, but there remain significant challenges in using UAVs to collect data across entire utility corridors due to strict flying regulations, limited flight time, and constrained spatial coverage.As a result, Airborne Laser Terrain Mapping (ALTM) is still used as the primary data collection platform (Zhou et al., 2019, Pu et al., 2019).However, the process of labeling semantic features in point clouds using visual perception tasks is still challenging, expensive, and prone to errors (Kim and Sohn, 2013).Therefore, there is a significant need to automate post-data acquisition procedures to reduce user involvement and improve efficiency (Jwa et al., 2009, Wang et al., 2017).
Recent advances in deep neural networks (DNNs) have shown significant improvement in computer vision tasks.There have been successful designs of DNNs for semantic segmentation using point clouds, such as PointNet (Qi et al., 2016), PointNet++ (Qi et al., 2017), and KPConv (Thomas et al., 2019), and object detection using point cloud and 2D images.These vision tasks have been integrated into a panoptic framework to better interpret all visual features of the complete scene.However, these networks have not fully exploited the spatial arrangement of infrastructure, especially for utility corridors, nor have they embedded spatial layout consistency for global context.Previous work have not modeled the 3D transmisson line network as span to pylon relationship.Therefore, this research proposes a network with hierarchical spatial regularity that can be generalized for standard layout panoptic segmentation problems and generate a network of object to object relationship.This research carefully examines utility corridors and unravels spatial layout consistency, identifying hierarchies of regions (utility, corridor and non-corridor) and object classes such as ground, towers, power lines, and vegetation, as shown in figure 1.The proposed Pan-SUNet extends SUNet (Jameela and Sohn, 2023), which dealt with only two regions, by dividing the task of extracting three regions into a 2D pipeline.This pipeline segments the 2D bird's-eye view (BEV) into two regions (corridor and non-corridor) and detects the pylons and span for the utility region.Three regions are generated by fusing the outputs from segmentation and object detection heads.Three regions fusion facilitates 3D segmentation through a loss-based late fusion module.The work also uses 2D object detection of pylons and spans to project into the third dimension using segmentation results and refine segmentation results based on the detected 3D objects (pylons and spans) and assign them instance id as well.In the upcoming section, we will delve into the related work on spatial layout consistency, utility corridors, and computer vision tasks.The methodology will provide a detailed explanation of the proposed system, while the experiments will showcase the novelty and effectiveness of the concept.

LITERATURE REVIEW
This section discusses the spatial layout consistency in utility corridors and reviews the techniques of semantic segmentation, object detection and panoptic semantic segmentation.

Spatial Layout
In a number of disciplines, including cognitive science, architecture, and civil engineering, the idea of how items are arranged in a scene and their interrelationship has been investigated.Decisionmaking is aided by the relationship between items, which gives contextual information.In order to learn global context, it is essential to include spatial recurring patterns.The performance of tasks like railway lane extractions, road lane recognition, and 3D building modelling has been enhanced by the use of spatial arrangement (Jeon and Kim, 2019).Spatial interactions are crucial for identifying small items that can be overlooked, according to previous studies (Rosman and Ramamoorthy, 2011).This serves as the foundation for our investigation, in which we use the network to highlight the significance of embedding spatial consistency.

Utility Corridor Layout
Electric hydro companies around the world follow guidelines for setting utility transmission zones, which are designed to address safety concerns related to infrastructure, residential areas, and vegetation (National Grid Transco UK, Wales, and USA, 11 12 2022).These zones consist of the utility zone, the corridor zone, and the non-corridor zone, each with specific regulations on the size and type of vegetation allowed (Electric Power Research Institute (EPRI), 2012).Existing literature on hierarchical relationships in visual perception, such as detecting human motion (Toshev and Szegedy, 2014), was examined to establish a baseline for the neural network.

Semantic Segmentation
In recent years, deep learning has played a significant role in semantic segmentation for 3D point clouds, with research focusing on intrinsic, extrinsic, and deep features to classify each point with an enclosing object (Liu et al., 2009).Traditional approaches for utility corridor segmentation have been purely geometric, but they have limitations such as the need for extensive preprocessing and domain expertise (Jung et al., 2020, Jwa et al., 2009).Machine learning algorithms such as support vector machines and decision trees have been used to classify utility objects, but they have limitations when applied to large-scale datasets (Jwa and Sohn, 2010, Wang et al., 2017, Kim and Sohn, 2010, Kim and Sohn, 2013, Pu et al., 2019).Deep learning, on the other hand, offers the ability to learn features automatically, which has allowed the development of generalizable solutions.Various deep learning-based segmentation networks have been proposed, including PointNet (Qi et al., 2016), PointNet++ (Qi et al., 2017), KPConv (Thomas et al., 2019), and RandLA (Hu et al., 2019).However, none of these methods have taken advantage of the spatial regularity found in utility infrastructure.Pan-SUNet, an extension of SUNet, has been proposed to address this limitation by fusing regions from the regional networks providing spatial guidelines to improve performance.

2D Object Detection
Object detection is a process of classifying and locating objects in a given scene.Traditional techniques have been replaced by deep learning-based neural networks, with two main types of architectures: single-stage(Redmon and Farhadi, 2018) and two-stage detectors.While single-stage detectors are computationally efficient, they can be less accurate than two-stage detectors due to the coarse-to-fine process.Most object detection networks can only handle horizontal bounding boxes, which is problematic for realworld scenes with non-horizontal objects.To address this, rotated object detection methods (Lang et al., 2021), such as Oriented R-CNN (Xie et al., 2021) and Rotated Faster R-CNN (Yang et al., 2020), have been developed to use oriented bounding boxes, and have shown comparable performance to horizontal bounding box detectors while reducing background errors.

Panoptic Segmentation
Our work is based on panoptic segmentation, which separates scenes into "stuff" (e.g.ground, sky, vegetation) and "things" (e.g.objects like cars, pylons, and spans) using a combination of semantic and object detection.EfficientLPS (Sirohi et al., 2021), is an extension of a 2D panoptic segmentation network that fuses instance and semantic segmentation across the entire scene.However, there is a lack of fusion from different dimensions in this method, specifically for airborne LiDAR point clouds.
Our research proposes a 2D panoptic segmentation approach that focuses on detecting pylons and spans, as well as segmenting regions, tailored to utility cases.Additionally, it projects the predictions into a 3D pipeline to obtain pylon instance and segmentation labels to model utility corridor.

METHODOLOGY
Pan-SUNet is unique novel network which takes multidimensional input uses multi-resolution deep learning neural networks that embeds spatial regularities between the regions and objects of interest.It comprises of three different classifiers.A two-dimensional regional prediction network (Ronneberger et al., 2015) and a two-dimensional object detection (Xie et al., 2021) network which collaborate together to generate utility region based segmentation masks and detects objects that constrains and refine the predictions of a three-dimensional network through regional fusion, loss-based late fusion and panoptic fusion.

3D Pipeline
3D semantic segmentation pipeline is major component of our system design.It can use any existing deep learning semantic segmentation networks as baseline.For our network we have utilized 3D convolution based multi-resolution encoder-decoder network with skip-connections and additive attention module in decoder.It facilities learning process incrementally on different feature maps H 2l × W 2l × D 2l × 32l on resolution level l and aggregate features to output the probability of semantic labels for 3D object classes.Segmentation head generates the confidence score against each class which is then passed through a loss-based late fusion module to refine and constrain the predictions using spatial layout consistency and back-propagate the loss to better learn deep features.

Panoptic Segmentation
The 3D semantic segmentation network's output and object detection in 2D pipeline for pylons and spans are combined by this module that leverages semantic spatial layout consistency.Our work focuses on utility regions and key objects like pylons and spans, which are detected in 2D BEV and projected onto 3D voxel space using a simple process.The Panoptic segmentation module estimates the height of pylons and spans based on the 3D semantic segmentation prediction, enabling the conversion of 2D bounding boxes to 3D by adding the z-axis.While our work emphasizes semantic segmentation, panoptic segmentation generates instance IDs for pylons and spans and fine-tunes segmentation results by adjusting misclassifications within 3D bounding boxes.In the figure 2, pylon and span instances are color-coded according to their instance IDs.

2D BEV Pipeline
Making spatially accurate predictions requires integrating a broader receptive field and global context into a semantic segmentation network, which is a considerable problem.Human scene semantic perception strongly depends on our capacity for understanding the larger context.Our three-dimensional segmentation network is only capable of encoding the local context and does not have the coarser scene information required to encode the spatial layout and item global association.In order to solve this problem, we employ a 2D BEV pipeline that fuses the missing data using a loss-based late fusion module.2D Pipeline consists of two classifiers; a regional semantic segmentation and a two-dimensional object detection.

Regional Semantic Segmentation
We can use any simple 2D segmentation network for regional segmentation.In our specific design we are using a U-shaped encoder-decoder which generate regional class probability of shape W × H × Cp and takes Bird's Eye View (BEV) representation of a complete 3D point cloud.Regional segmentation and object detection share an encoder to learn shared features.

Object Detection
To take advantage of the regional semantic layout of utility corridors, we added a separate module for 2D object detection with a shared encoder.The utility community divides layouts into three regions: a utility region with pylons and wires, a corridor region with a buffer around the utility region that can contain 3-5 meters tall trees, and a non-corridor region that can have tall trees and buildings.To differentiate between the corridor and non-corridor region can be difficult when noncorridor regions don't have extremely tall trees.Therefore, we merged the corridor and utility regions and performed regional segmentation to generate segmentation masks.We detected span and pylon objects using a shared encoder, which were used to update the regional segmentation mask with the utility corridor class.These regionally fused segmentation masks can improve the prediction of 3D objects of interest.Our orientated object detection network is used to detect span and pylon objects.It consists of a two-stage object detection network that uses a shared encoder and consists of an oriented regional proposal network and a regional classifier and regressor.The regional proposal network takes an encoded feature and generates a regional proposal vector (x, y, h, w, δα, δβ) with a rotated RoI that is passed to stage II where the R-CNN produces a classification label and spatial location of the object or rotated bounding box.The network is shown in figure 2.

Fusion
Fusion Module have four key modules, a) regional fusion, b) logits interpolation, c) hierarchical layout consistency loss function, and d) panoptic fusion.
3.3.1 Regional Fusion As discussed in above section we have two classifiers: segmentation classifier outputs a 2D masks with two regional classes of corridor and non-corridor regions.Corridor region also have utility region merged into it.This fusion performs following steps to generate three classes based segmentation mask using predicted oriented bounding boxes and predicted regional segmentation.
• Step 3: Fuse the labels from regional segmentation and utility region label generated in previous step as shown in figure 4 3.3.2Logits Interpolation Module It is a simple module that takes the projection matrix between 3D voxels and 2D BEV and converts the 2D logits for regional network into a 3D H × W × D × CP by exploring the one-to-many relationship between the two representations.(5)

Panoptic Fusion
The module takes input from 2D Object detection of pylon and span W × H × C and W × H × 6 and use the 3D semantic segemntation results W × H × D × C and adjust the z-axis for 3D object detection of pylon and spans.
The semantic segementation results of pylon and powerline help in estimating the minimum and maximum height of pylon points and powerline points to adjust the 2D bounding boxes.There are four corners p1, p2, p3, p4 and height information zmin and zmax to estimate the 3D bounding boxes.These 3D bounding boxes then help in adjust the prediction and misclassification with in 3D box for refining semantic segmentation results as scene table 4.2.
Figure 4: Regional fusion from 2D regional segmentation and utility segmentation mask.

Voxelization and BEV Projection
A voxel grid is produced by pre-processing the raw point cloud and determining a mean value for all the points that lie within each 3D voxel as the input representation for our segmentation network.Depending on the chosen voxel size, this voxel grid strikes a compromise between efficiency and effectiveness.To make label projection simple, the network also keeps a projection matrix from the voxel grid to the unprocessed point cloud.A bird's eye view (BEV), on the other hand, is a 2D depiction of a 3D point cloud.Our 2D BEV pipeline makes use of a 3D scene's XY-projection, in which each pixel corresponds to a location's points.The best BEV for obtaining global context for regional and item prediction is produced by the XY-projection.Our projection matrix between the 3D voxel grid and the BEV allows for projection compatibility between feature spaces to combine the 2D and 3D predictions, improving the use of spatial layout consistency.

EXPERIMENTS AND RESULTS
In our study, we performed a comparative analysis to highlight the importance of the hierarchical layout consistency.We conducted experiments on a test set that assessed the performance of our proposed method on four important classes for the utility industry: ground, pylon, powerline, and vegetation.These classes are crucial for predictive maintenance of utility networks.

Dataset
We used a Riegl Q560 laser scanner to gather data across a 67km 2 region in Steamboat Springs, Colorado, in the United States.The acquired data were later split into train and test sets in order to conduct experiments.The first 8km 2 of the dataset served as testing, and the remaining data served as network training.The collection consisted of 67 non-covering scenes in total, each having millions of focuses and an average thickness of 5pts/m 2 .Using Terrasolid's point cloud handling software (Team, 2023b), we manually labelled the data to produce our groundtruth which require technical and industry experience.The training dataset contained five classes: powerline, low vegetation, pylon, and medium-high vegetation.We combined the low vegetation class with the ground class since low vegetation covered the most of the ground.Also, we named our regional classes based on the literature of utility communities.

2D Object Detection Groundtruth
To produce object detection labels for pylons and spans as horizontal bounding boxes, the matlab image tool is used; for orientated bounding boxes, Labelme (Wada, 2021)  The performance of Pan-SUNet for detecting 2D pylon and span objects, as well as projecting them into 3D, was evaluated using average precision (AP), average recall (AR), pylon AP, and span AP at IoU=0.5.We selected state-of-the-art networks to compare performance on horizontal and oriented bounding boxes along two-stage and one stage systems such as faster rcnn, rotated faster rcnn, oriented rcnn, dafne, and yolov3 and selected orieneted rcnn as our baseline due to smoothness and effectiveness of results.
4.3.22D Regional Semantic Segmentation We pre-trained our 2D regional prediction network on half of the scenes for global regional prediction of spatial layout in order to prepare for our trials.According to the GPS time of flight line, each scene was separated into four smaller scenes, each of which was then projected onto a 640x640 2D BEV grid with pixels measuring 1m 2 and three feature channels that represented the normalized standard deviation of elevation between [0-1], normalized standard deviation of elevation values between 0 and 255, and binarized standard deviation of elevation values per pixel.The 2D network's input size was 640 × 640 × 640 × 3, while the output, which represented the confidence ratings for regional classes, was 640 × 640 × 3.In order to prevent overfitting, we pre-trained the network using K-cross validation with a batch size of 1 and a total of 100 epochs.Random rotation, horizontal flip, and vertical flip were all used to augment the data.Training was done on two GPU RTX 6000 and took between 4-5 hours with inference taking about 30 seconds.

3D Semantic Segmentation
We constructed a voxel grid with a size of 640 × 640 × 448 and a voxel size of 1m 3 over each subscene.Each batch 32 × 32 × 448 × 4 contained the highest elevation of the entire picture to give the network a comprehensive look and help it cope with vertical context more effectively.The number of returns, the number of occupancy points, and the absolute and relative elevation were all included in the feature channels.To choose these features, we used the feature engineering study that was presented by SUNet.A confidence score against 3D classes is produced by Pan-SUNet as 32 × 32 × 448 × 5 (background, pylon, powerline, vegetation, and ground).The final prediction provides a true label based on the greatest confidence score and projects voxel labels on points using the projection matrix.Using two GPU RTX 6000, Pan-SUNet was trained for 100 epochs over the course of 48-60 hours, with inference requiring only a few minutes.

Evaluation Matrices
4.4.1 Object Detection: Two metrics, Average Precision (AP) and Average Recall (AR), are used to assess how efficiently object detection networks locate and identify items in an image.
The area under a precision-recall or recall-precision curve, which accounts for the IoU threshold to separate true positives from false positives, is computed for both metrics.

Pan-SUNet
In the paper, the performance of the semantic segmentation model is evaluated using F1 scores for four different classes: vegetation, powerline, pylon, and ground.The F1 score is calculated by taking the harmonic mean of the precision and recall for each class.Recall measures the number of true positives that are correctly identified, while precision measures the proportion of predicted positives that are actually true positives.By considering both precision and recall, the F1 score provides an overall measure of the model's performance that accounts for both accuracy and completeness.

Results
We conducted compartive study of 2D object detection as shown in table 1 for selecting baseline for 2D object detectuon.Our object detection results in table 2 shows the promising results for 2D and projecting those results onto third dimension.Results indicate comparable performance in 2D and 3D facilitating the modeling of utility corridor through pylon to span relationship.We also evaluated Pan-SUNet's performance for semantic segmentation against various SUNet versions, Attention 3D, and pre-trained RandLA.Our results demonstrated that the regional fusion module in Pan-SUNet offers significant benefits in terms of higher recall and F1 score for the pylon class in table 3, highlighting the advantages of including spatial arrangement context on a global scale.We used a voxel-based network, which delivers comparable quality performance to point-based networks but with faster inference times (10x faster).Our network outperforms currently available commercial software tools that depend on network maps for predictive analysis or human labeling of powerlines and pylons.Visualization in figure 7 shows the contribution of each stage in the Pan-SUNet framework to model the 3D utility transmission lines infrastructure.

CONCLUSION
Our research has shown that ambiguous regions can be given global context by using pre-tasks like object detection.We have also demonstrated that pylon and span objects can be identified using a basic 2D BEV map.This information can then be utilised to extract a utility zone for segmentation, 2D powerline modelling, and investigation of vegetation encroachment.By transforming utility items into utility zones and employing them to impose a hierarchy on the 3D semantic segmentation of objects of interest, our Pan-SUNet network makes use of this module.Our work have made major contribution in convertin 2D object detection into 3D objects and adjusted the misclassification.We intend to concentrate on real-time semantic segmentation of the utility corridor in the future.

Figure 1 :
Figure 1: Pan-SUNet is a multi-dimensional and multi-resolution network that imposes the spatial layout consistency (a) through a 2D bird's-eye view (BEV) of utility regions on the outcomes of 3D panoptic segementation via loss-based late fusion and panoptic fusion (b).

Figure 2 :
Figure 2: Pan-SUNet generates 3D voxel grids and 2D BEVs from point clouds, combining multi-resolution 3D semantic segmentation and shared encoders for corridor/non-corridor detection, spatial consistency, and label adjustment onto 3D objects.

Figure 3 :
Figure 3: Regional fusion generation of segmentation shows a) polygon generated from step 1 and b) segmentation mask generated from polygon.
It uses the hierarchical layout consistency loss function as used in SUNet as shown in equation 5 to impose the spatial layout consistency on outcome.This loss function now uses three classes instead of two to impose better spatial regularities.Pc m × (log(hΘ(xm, c, p))))

Table 1 :
is utilised.Our experiments are conducted on two different types of bounding boxes.The horizontal bounding boxes usually in oriented object cases include alot of background hence we compared the performance using both systems as shown in figure5.Comparative study for 2D object detection for pylon and span objects for average precision (AP), average recall (AR), pylon average precision (AP) and span average precision (AP) over IoU=0.5.