LEARNING ON THE EDGE: BENCHMARKING ACTIVE LEARNING FOR THE SEMANTIC SEGMENTATION OF ALS POINT CLOUDS

: While most research in automatic semantic segmentation of 3D geospatial point clouds is concerned with enhancing respective Machine Learning (ML) models, we aim to shift the focus to be more of a data-centric nature. This means, we consider the creation of respective data sets that ML models learn from as key component, since even the most sophisticated model performs poorly when learning from suboptimal data. In this regard, the straightforward approach of providing labeled data abundantly is prohibitively expensive and just not scalable in times of high-frequency data acquistion cycles, where a dedicated training set should be available for each new epoch, as ML models often lack generalizability. As a remedy, we rely on Active Learning (AL), which is a cost-efficient and quick method to generate required training data at scale. Although AL has been (scarcely) applied in the geospatial domain before, a comprehensive evaluation of its capabilities, including benchmarking of achievable accuracies is lacking. Therefore, we apply the AL concept to both ISPRS’ current point cloud benchmark data sets as well as to a third large scale National Mapping Agency point cloud. Respective experiments are conducted with both a feature-driven Random Forest classifcation approach and a data-driven Submanifold Sparse Convolutional Neural Network classifier. Our experiments verify that by labeling only a fraction of available training points (typically ≪ 1 % ), we can still reach accuracies that are at maximum only about 5 percentage points worse compared to leading benchmark contributions.


INTRODUCTION
Being capable to automatically interpret (geo)spatial 3D data enables a plethora of different applications, such as safe autonomous vehicle navigation through surrounding awareness, derivation of digital terrain models (Hui et al., 2019) or detection of significant changes in monitoring applications (Haala et al., 2020).To this end, supervised Machine Learning (ML) methods are often employed and have drawn considerable attention in research over the last decade.While conventional feature-driven classification approaches have achieved a rather mature state, the branch of data-driven Convolutional Neural Network (CNN) approaches, triggered by the introduction of PointNet (Qi et al., 2017), is a hot research topic currently.However, the main focus of ML, especially in the geospatial domain, has always been on the classification model rather than the careful generation of the training data set the model is supposed to learn from.For the latter, the long-held standard is that providing a sufficient training data set is an expert's burden and has to be completed before any model can be employed (Waldhauser et al., 2014).But only recently, Ng (2021) stressed that more emphasis should be given to the creation of training data sets, and recommended that ML system development should be more data-centric rather than model-centric.
One scheme following this mindset is Active Learning (AL) (Settles, 2009).In this iterative supervised ML approach, data annotation and training a respective model are no longer seen as two self-contained steps, but the machine represented by the ML model is actively involved in constructing the training set and is allowed to request labels for specific instances from one or more human labelers, known as the oracle.The basic idea behind determining such points is that predictive uncertainty of * Corresponding author the model is directly correlated with informativeness and that by adding such points, the model improves, i.e., we seek to minimize epistemic uncertainty.With AL, the labeling effort can be focused only on those points that actually justify human involvement.Therefore, when realizing an AL framework, we build what we call hybrid intelligence systems (Vaughan, 2018), in which humans or human processing units work together with electronic processing units, so that both parties perform the tasks they are best at, i.e., human interpretation capabilities for data annotation and machine-based scanning through data highlighting potentially valuable instances.
Despite the great potential to perform cost-efficient data interpretation, for 3D point clouds, especially Airborne Laser Scanning (ALS) point clouds, respective AL-based approaches are scarce.One of the first methods to this end was proposed by Luo et al. (2018), who perform semantic segmentation of mobile laser scanning point clouds by means of a pair-wise conditional random field built upon an Random Forest (RF) classifier (Breiman, 2001) integrated into an AL loop.Also for the classification of terrestrial point clouds, AL-based solutions are presented by Wu et al. (2021), Shi et al. (2021) and Shao et al. (2022), each relying on superpoint regions as AL primitives (instead of single points) but differing in the sampling procedure.
An AL approach actually developed for ALS point clouds is presented by Hui et al. (2019), who formulate the generation of a Digital Terrain Model (DTM) as a binary classification problem where points are to be assigned to class ground or non-ground, but utilize an automated oracle based on both the current prediction and the distance to the approximated DTM level.To predict a more extensive class catalog, Li and Pfeifer (2019) combine AL built around an RF classifier with a semi-supervised learning scheme in which labels of an initially provided coarse training set are each propagated to the point in an optimal neighborhood that exhibits the highest sampling score.A more typical AL scheme for semantic segmentation of ALS points clouds is pursued by Lin et al. (2020a,b).In this approach, AL operates on a regularily tiled point cloud and is designed to identify most-informative tiles, with tile scores obtained by averaging either point-based or segmentbased sampling scores derived from a PointNet++ classifier (the segments are obtained by a preceding unsupervised segmentation).Although this approach greatly minimizes labeling effort to most-informative tiles, costly full annotations of those are still expected.Kölle et al. (2020) and Kölle et al. (2021b) mitigate this issue by only requesting labels of single mostinformative points, which are identified based on both RF and Submanifold Sparse Convolutional Neural Network (SCN) prediction scores.Furthermore, the assumption of an error-free Ground Truth (GT) oracle is lifted, as labels are provided directly by crowdworkers, thus completely excluding experts from the annotation process and actually forming a hybrid intelligence system.
While the aforementioned approaches have demonstrated great potential for cost-efficiently building ML models for a given data set as alternative to the conventional Passive Learning (PL) approach, they lack a comprehensive ranking of results compared to the current state of the art in semantic point cloud segmentation that allows for a fair ranking of AL.In this work, we aim to address this limitation and hope to thereby foster a wider dissemination of AL in the geospatial domain in the spirit of data-centric ML (Ng, 2021).Our contribution can thus be summarized as follows: i) We give a brief overview of AL, but particularly illustrate its working principle for ALS point clouds, followed by ii) a discussion of versatile add-ons for the key component of AL, namely the definition of a query function to identify most-informative samples, suitable for both datadriven and feature-driven classification approaches, and iii) we benchmark AL results for both an RF and SCN classifier by applying the respective methods to both ISPRS' semantic labeling challenges for 3D point clouds as well as to a typical National Mapping Agency (NMA) ALS cloud.

METHODOLOGY
Our system to efficiently train ML models consists of three main components, namely the query function for sampling most-informative samples in context of the AL loop (Section 2.1), an appropriate ML model (Section 2.2), and our oracle capable of returning labels for selected instances (Section 2.3).

Setting-up the AL Loop
To initialize our pipeline (cf.Algorithm 1), we present a given unlabeled point cloud U to our oracle O, which can either be a simulated machine oracle or, more realistically, can be represented by human operators.The first task of the oracle is then to generate an initial (coarse) training set with nj samples for each of our nΩ classes.When humans are asked to perform this task, they will naturally select samples that are fairly easy to label well away from respective class borders in object space.Using Linit (cf.Algorithm 1), we can then train a respective ML model M capable to perform semantic segmentation of 3D point clouds, and rely on it to derive predictions on the remaining unlabeled data set U , so that the loop can theoretically be terminated already at this point.However, if we aim at highaccuracy results, the loop/iteration is to be continued and thus the second and even more important task of the classifier is to estimate the model's confidence by means of the predicted posterior probabilities p(c|x) for each point x ∈ U .To actually select most-informative points from this unlabeled pool, we rely on entropy sampling defined as: x Generally speaking, this measure is designed to sample points in the vicinity of the current (perhaps suboptimal) class borders (cf. Figure 1(a)), i.e., we score aleatoric uncertainty, but especially in early iteration steps, epistemic uncertainty will also have a significant impact.This can be interpreted as mimicking the core idea of Support Vector Machines (SVMs), that is building separation hypotheses solely based on samples situated near class borders, essentially.However, when only uncertainty is scored, for ALS points clouds where we are typically confronted with heavily class-imbalanced data sets (e.g., consider the relative frequency of class Car vs. Impervious Surface), it is likely that classes that are underrepresented in the underlying data set are all the more underrepresented in our sampled training set.This is because (most likely) regions of class borders are populated by proportionally fewer representatives of such smaller classes.Thus, refinement of class borders with respect to these classes is likely to be neglected, eventually resulting in suboptimal separability.As a remedy, in each iteration step i of our loop (cf.algorithm 1), we compute dynamic class weights wc based on the relative frequency of the number of samples of a specific class nc in our current training set L with nL points.
Those weight values are then multiplied by the predicted posterior probabilities, normalized, and inserted into the entropy formula in Equation 1.However, such AL sampling strategies are designed to add only one instance at a time, but re-training an ML model each time only one sample is added is both ineffi- cient and statistically questionable (especially in case of a CNN model).Thus, AL is usually applied in batch-mode, where multiple n + samples are selected and presented to the oracle O for labeling.But in this case, it is likely that queried points are too similar to each other with respect to their representation in feature space (cf. Figure 1(b)).Thus, sampling such quasiduplicates essentially wastes labeling resource.To get the most out of a fixed labeling budget, we therefore follow the recommendation of Zhdanov ( 2019) and compute a weighted k-means clustering with n + clusters according to: where µ are cluster centers (3) By explicitly considering the (weighted) entropy scores s in clustering, we can guarantee that in this Diversity in Feature Space (DiFS) method, we sample a batch of points that is both as informative and as diverse as possible in order to boost the convergence of the loop (cf. Figure 1(c)).After determining the points to be added to the training set, the oracle O is asked to annotate these points, so that the ML model can be re-trained based on this expanded training set to complete the first training cycle.This iteration continues until a certain stopping criterion is reached (e.g., a fixed labeling budget, a certain number of iteration steps, or a more sophisticated stopping criterion as discussed by Bloodgood and Vijay-Shanker ( 2009)) and eventually results in an optimal training set tailored to the specific ML model M .
To get an intuition of the working principle of AL, it is worthwhile to examine samples that have been identified as informative within the loop.As humans, we tend to utilize the object space to this end.However, AL queries are based on the representation of instances in a high-dimensional feature space, which should then be the focus of such an analysis.But since such a representation is hard for humans to interpret, we should apply a re-mapping of high-dimensional spaces to 2D for visualization.For this, we rely on the non-linear t-SNE mapping (van der Maaten and Hinton, 2008) that aims to keep relative distances between samples based on their similarity.This is exemplarily applied to the feature description of the ISPRS Vaihingen 3D (V3D) data set (used features and the data set is briefly introduced in Sections 2.2 and 3, respectively) and yields the 2D feature space visualization in Figure 2.For exemplary points selected within an AL loop launched for this data set, we trace back respective point in feature space to object space.In this regard, Figure 2 corresponds well to our expectation that humanselected points in the initialization phase are typically easy for the machine to interpret, as they are situated well away from class borders in feature space and populate centers of rather homogeneous regions (cf. Figure 2(a) & (f)).Points sampled within the loop, on the other hand, naturally stem from inhomogeneous regions in feature space that correspond to spots near class borders in object space (cf. Figure 2(b)-(e)).Thus, as previously mentioned, AL can in fact be interpreted as emulating the working principle of SVMs, but focusing not only on selecting most-informative points but also on avoiding to sample quasi-duplicates.This typically minimizes labeling effort to only a really small fraction of available training points (Mackowiak et al., 2018;Kellenberger et al., 2019).

The ML Model
Although the basic assumption is that even the simplest classifier can perform well just by tailoring an appropriate training set to it (Ho and Baird, 1997;Stork, 1999), still the achievable performance will be partly determined by the suitability of the employed model.To demonstrate generalizability of results, we thus rely on both a representative of the feature-driven domain, an RF classifier, and a representative of the data-driven domain, a 3D-convolution-approximating, voxel-based SCN classifier, which is based on the work of Schmohl and Sörgel (2019).
For an ML model to be successfully incorporated into AL, it i) needs to be capable to learn from sparsely labeled data, ii) must be suitable reliably assessing its uncertainty -especially, its epistemic uncertainty, which we seek to minimize, and iii) has to be provided with/needs to be capable of inferring, explicit point-wise feature vectors to guarantee diversity within sampled batches.
For the RF classifier, the latter requirement is met by design, as we utilize hand-crafted features.Precisely, we use a set of both geometric (structural tensor features, orientation of fitted plane, roughness, height above ground etc.) and radiometric features (LiDAR inherent features and color information) evaluated for multi-scale spherical neighborhoods, as described in the work of Haala et al. (2020).Also, learning from sparsely labeled data (challenge i)) can be straightforwardly implemented for the RF, as we simply reduce the list of samples provided for training.Furthermore, we argue that the predicted (pseudo) posterior probability of the RF is well suited to assess epistemic uncertainty, as it is the result of averaging over multiple bagging ensemble members and thus satisfies condition ii).
As for the representative of the data-driven domain, the aforementioned challenges are more complex to overcome.Usually, ML models compute the loss over all labeled instances (or voxels in our case).However, dealing with sparse annotations, not every voxel carries a label, but should still be presented to the network to enable it to derive meaningful geometric descriptors (at least if it lies within the receptive field of one of the few labeled voxels, i.e., if it describes the neighborhood of labeled cells).Thus, to address i), we modify the loss function so that unlabeled "background" voxels are ignored in loss calculation, but still contribute in training due to their passive presence.To address ii), we employ a so-called deep ensemble, where each ensemble member is trained on the same training set but they differ in the randomly initialized weight values.In inference, we then compute the average over all ensemble-wise posterior probabilities to reliably estimate epistemic uncertainty (Jospin et al., 2022).
Although the network implicitly utilizes self-taught features, for iii), we need to find a way to explicitly output point-wise feature vectors.To do so, we concatenate filter responses of the different levels of our 3-level U-Net like architecture from both the encoding and decoding branch to obtain a multi-scale description of our input points.However, at deeper levels, the original input voxel cloud is represented in a more abstract manner at a lower resolution than the input.As a remedy, we assign respective features of deeper levels to all voxels at the original resolution that have been aggregated into this specific cell.As can be seen from Figure 3, this often leads to a voxelated representation where upsampled filter responses from deeper en-coding levels are smoother than their counterparts from decoding levels (although stemming from the same lower resolution).This is due to retrieving features in the decoding branch directly at the deconvolutional layer, essentially incorporating the resolution of the previous deeper level, which is contrary to the encoding branch where features are retrieved after a series of 3D convolutions at the last layer of an encoding level.
Obtained filter responses of the encoding branch in Figure 3 often resemble typical features utilized by feature-driven classifiers.For instance, Figure 3(a) is reminiscent of a verticality measure and Figure 3(c) seems to score flatness.However, both responses also appear to be impacted by radiometric features, as convolutions are performed over all available input channels.Also, the model tries to gradually enhance its context awareness with Figure 3(e) resembling height above ground, which can only be inferred from a wider spatial context.Contrary to the encoding branch, where the data is solely described by deriving descriptive features, in the decoding branch the model progressively develops its ability to recognize individual classes.In this regard, Figure 3(f) attempts to accentuate buildings, but also lower parts of high vegetation that are often geometrically similar (both are vertically oriented and noisy, either due to fac ¸ade furniture or detailed branch structures), but are already far less emphasized in Figure 3(d).Eventually, Figure 3(b) is clearly suited to extract points of a specific class, in this case class Car.

The AL Oracle
Another key component of AL is the formulation of an oracle capable of providing labels for selected points.In literature, an omniscient GT oracle OO is often assumed, but this is unrealistic in real world scenarios where humans are tasked with point annotation.Thus, labeling errors should also be taken into account when simulating oracles.Respective errors can be either purely random or systematic in nature (Lockhart et al., 2020).
A noisy oracle ON will always assign a fraction of points to any class, except the correct one.But more severely, a confused oracle will follow some distinct mapping function (e.g., always labeling fac ¸ades as class Roof ), which can be particularly harmful for classification approaches (Kölle et al., 2021b).
Especially in AL, this becomes problematic since we sample points from class borders (both in feature and object space, cf. Figure 2), where selected points are often ambiguous and thus systematic errors can be the result of different class un-derstanding.To avoid such errors, as recommended by Kölle et al. (2021b), we modify our sampling strategy slightly and consider the point originally queried by the machine (cf.Section 2.1) as seed point only, but instead select the neighboring point in a spherical neighborhood of radius dRIU with the lowest sampling score.This strategy, referred to as Reducing Interpretation Uncertainty (RIU), assumes that the distance to the class border correlates directly with annotation complexity and has proven an efficient means of minimizing systematic labeling errors (Kölle et al., 2021b).

DATA SETS
As our main goal is to benchmark AL in the domain of geospatial ALS point cloud semantic segmentation, we rely on both IS-PRS' current benchmark data sets.These are the Vaihingen 3D Semantic Labeling Contest (V3D) as a typical ALS point cloud (Niemeyer et al., 2014) and the high-resolution Hessigheim 3D Benchmark (H3D) captured from an UAV (Kölle et al., 2021a).
Although both data sets incorporate rich and challenging class catalogs, they cover only limited spatial regions.Therefore, we utilize as a third data set, an NMA ALS point cloud depicting the city center of Stuttgart (S3D), that is about 30 times larger in extent than the V3D data set, but contains a comparably small class catalog, as can be seen from Table 3.Nevertheless, it is well suited for evaluating the scalability of AL.

EXPERIMENTS
To assess the capabilities of AL for semantic point cloud classification, we derive a series of solutions for our three data sets that incorporate the different strategies and classifiers described in Section 2.1 & 2.2.We report results of pure weighted entropy sampling (wE) as well as the adapted variant with the DiFS sampling add-on.But to also give realistic estimates of accuracies to be expected in an AL scenario where human processing units are employed for labeling the queried points, we i) augment sampling with RIU, to reduce chances for encountering an oracle following a systematic error behavior, and ii) incorporate a noisy oracle ON where 10 % of labels are randomly misclassified in each iteration step.In each of our AL runs, the initial data sets consist of nj = 10 samples per class.Unless stated otherwise, we report AL results after 30 iteration steps with 300 points queried in each step, exclusively from the dedicated training set, predicting on the respective test splits (i.e., we adhere to the official data splits for the benchmark data sets).As for the incorporated ML models, the RF is parametrized by 100 binary decision trees with a maximum depth of 18 and a minimum number of samples at a node to justify a new split of 7. Respective features are computed for spherical neighborhoods of r ∈ {1, 2, 3, 5} m.For the SCN classifier, we employ a deep ensemble of 5 networks, each operating on a 0.5 m voxelized input point cloud.To reduce computation time, networks of each iteration step start their training cycle based on the result of the previous iteration step and use the current decayed learning rate.Apart from these AL runs, we rely on both the PL results of our classifiers using the fully labeled training set and the PL result of the respective benchmark leader (for V3D & H3D) as baseline solutions.
As for the results for the V3D data set, we can firstly conclude from Table 1 that both our classifiers are well suited for the task at hand, as our PL results are on a level comparable to the top-performing benchmark submission, and are only worse by about 1 percentage point (pp) in Overall Accuracy (OA).However, we prefer comparing our AL-based runs to the PL result obtained with our classifiers, as these can be considered the limit of achievable accuracy for the specific model.Regarding the AL runs, it is evident that the DiFS sampling add-on contributes significantly to the improvement of the classification accuracy, so that the wE+DiFS strategy can be considered as optimal result from the point of the machine, performing less than 3 pp worse in OA compared to PL for both the RF and SCN classifier.However, in a realistic scenario with imperfect human operators as oracle, these accuracies are unlikely to be achieved.Thus, we add the RIU technique with dRIU = 1.5 m to minimize chances of systematic errors and consequently simulate only the effect of a noisy oracle ON .Such more realistic AL runs perform only marginally worse with a final loss of < 5 pp in OA compared to the best-performing PL benchmark submissions, but are far more cost-efficient since only 1.15 % of points from the training set require labeling.
With respect to the performance of individual classes, underrepresented categories such as Powerline or Car tend to perform better in AL than in their PL counterparts.This effect can be traced back to the generation of a training set in AL, which, thanks to the weighted sampling scheme (cf.Section 2.1), has a distribution that is close to that of an equal distribution, as clearly visible from Figure 4.
As for the RF classifier vs. the SCN classifier, results are rather similar, with the RF slightly outperforming the SCN.However, the two models differ significantly in computational complexity, which is due to their basic working principle.With the RF, features of each point only need to be computed once and can be kept throughout the iteration.But for the SCN, whenever new labels become available, we need to recompute or at least refine features of all points (voxels), which is inevitably computationally more expensive.Precisely, an RF-based AL iteration step can be completed in about 1 minute, whereas such a training cycle for the SCN takes about 50 times as long.Therefore, for AL, CNN-based approaches are a suboptimal choice -at least from a purely economic point of view.
Hence, for the high-resolution H3D data set incorporating a significantly larger voxel volume, we are compelled to ease the computational load by reducing the number of training cycles to 10 iteration steps, but then sampling 600 points in each step.We also slightly adapt our RF classifier to H3D's resolution and compute features for neighborhoods of r ∈ {0.125, 0.25, 0.5, 0.75, 1, 2, 3, 5} m.Generally, results on H3D confirm our observations on V3D with final classification accuracies for wE + DiFS + RIU with an ON oracle that are less than 3 pp worse compared to our classifier's optimal PL results and only require 0.12 h (RF) and 0.08 h (SCN) of available training points.We would like to emphasize that in such an ultra-high-resolution data set, due to spatial proximity of neighboring points, we always face a significant number of quasiduplicates with respect to the representation of these points in feature space.This underlines the significance of DiFS, which is capable of improving OA values by > 4 pp and mF1 values by > 5 pp for both classifiers.
Since our two classifiers lead to similar accuracy levels for V3D and H3D, due to the aforementioned advantages in time complexity, we restrict ourselves to reporting solely RF-based AL runs for the large-scale S3D data set.As this data sets depicts a significantly larger scene with a plethora of representatives for each class, we are dealing with a much greater intra-class variety, which is further amplified by generalization through the rather coarse class catalog.Thus, the highest accuracies are achieved for S3D in the PL run.Especially class Urban Furniture suffers when learning from only limited training sets, as those fail to truthfully characterize the large variety of this quasi-class Other.Nevertheless, with the optimal configuration from the machine's point of view (wE + DiF S), we obtain a result that is less than 1 pp worse in OA than in PL, but only utilizing 0.23 h of available training points (please note that the effect of boosting convergence by DiFS is not visible at this saturated state of the iteration after 30 iteration steps, but improves OA by > 2 pp at iteration step 10, for instance).

CONCLUSION
This work represents a first attempt to benchmark AL in the domain of ALS point cloud classification and underlines its great potential to minimize labeling effort and thus make ML methods broadly applicable.Although the accuracy of AL approaches is slightly worse compared to corresponding PL approaches, models can be flexibly set up for a given (new) data set with minimal labeling overhead, which is an important property in times of rapid data acquisition cycles.More significantly, our AL-based results emphasize that the long-held understanding of ML models requiring vast annotated data sets is not the key to success, but rather building a versatile (small) training set with most-informative (actively queried) samples.In this spirit, the geospatial community can benefit from the recommendation of Ng (2021) to focus more on the data-centric branch of ML research to really enable its true capabilities.This is especially the case, as the community lacks annotated data sets at a level comparable to that of the computer vision community.

( a )
Figure 1.Comparison of different sampling strategies to select most informative points.Transparent points represent the current training data defining the decision boundary.Yellow border lines indicate samples with highest scores.

Figure 2 .
Figure 2. t-SNE embedding of feature vectors of the V3D data into 2D space.Exemplary regions where AL points originate from are indicated by blue circles.For each region, a representative is traced back to object space and colored blue.While points (a) and (f) were selected by human operators in the initialization step, the remaining examples were actively queried in the course of the AL loop.

Figure 3 .
Figure 3. Filter responses from selected filters at each level of our SCN, arranged in an order to match the SCN's U shape.

Figure 4 .
Figure 4. Comparison between the class distributions in the original V3D training data set (a) vs. the one obtained by AL after 30 iteration steps (b).
numb. of samples n + to be labeled in each iteration step i • definition of desired class catalog containing nΩ classes • access to oracle O 1: initialize labeled training set L = {} 2: query O to generate initialization data set Linit = •

Table 1 .
Comparison of reachable accuracies [%] for different training approaches and oracles using RF and SCN for the V3D data set after 30 iteration steps.TP represents the result of the top-performing model of the benchmark challenge.

Table 2 .
Comparison of reachable accuracies [%] for different training approaches and oracles using RF and SCN for the H3D data set after 30 iteration steps (RF) and 10 iteration steps (SCN), respectively.Furthermore, we report the result of the (at the time of writing this paper) top-performing TP model of the still ongoing benchmark challenge.

Table 3 .
Comparison of reachable accuracies [%] for different training approaches and oracles using RF for the S3D data set after 30 iteration steps.