CONTEXTUAL CLASSIFICATION OF POINT CLOUD DATA BY EXPLOITING INDIVIDUAL 3D NEIGBOURHOODS

: The fully automated analysis of 3D point clouds is of great importance in photogrammetry, remote sensing and computer vision. For reliably extracting objects such as buildings, road inventory or vegetation, many approaches rely on the results of a point cloud classiﬁcation, where each 3D point is assigned a respective semantic class label. Such an assignment, in turn, typically involves statistical methods for feature extraction and machine learning. Whereas the different components in the processing workﬂow have extensively, but separately been investigated in recent years, the respective connection by sharing the results of crucial tasks across all components has not yet been addressed. This connection not only encapsulates the interrelated issues of neighborhood selection and feature extraction, but also the issue of how to involve spatial context in the classiﬁcation step. In this paper, we present a novel and generic approach for 3D scene analysis which relies on ( i ) individually optimized 3D neighborhoods for ( ii ) the extraction of distinctive geometric features and ( iii ) the contextual classiﬁcation of point cloud data. For a labeled benchmark dataset, we demonstrate the beneﬁcial impact of involving contextual information in the classiﬁcation process and that using individual 3D neighborhoods of optimal size signiﬁcantly increases the quality of the results for both pointwise and contextual classiﬁcation.


INTRODUCTION
The fully automated analysis of 3D point clouds has become a topic of major interest in photogrammetry, remote sensing and computer vision.Recent research addresses a variety of topics such as object detection (Pu et al., 2011;Velizhev et al., 2012;Bremer et al., 2013;Serna and Marcotegui, 2014), extraction of curbstones and road markings (Zhou and Vosselman, 2012;Guan et al., 2014), urban accessibility analysis (Serna and Marcotegui, 2013), or the creation of large-scale city models (Lafarge and Mallet, 2012).A crucial task for many of these applications is point cloud classification, which aims at assigning a semantic class label to each 3D point of a given point cloud.Due to the complexity of 3D scenes caused by the irregular sampling of 3D points, varying point density and very different types of objects, point cloud classification has also become an active field of research, e.g.(Guo et al., 2014;Niemeyer et al., 2014;Schmidt et al., 2014;Weinmann et al., 2014;Xu et al., 2014).
Most of the approaches for point cloud classification consider the different components of the classification process (i.e.neighborhood selection, feature extraction and classification) independently from each other.However, it would seem desirable to connect these components by sharing the results of crucial tasks across all of them.Such a connection would not only be relevant for the interrelated problems of neighborhood selection and feature extraction, but also for the question of how to involve spatial context in the classification task.
In this paper, we focus on the combination of (i) feature extraction from individual 3D neighborhoods and (ii) contextual classification of point cloud data.This is motivated by the fact that such a combination provides further important insights into the interrelated issues of neighborhood selection, feature extraction and contextual classification.Using features extracted from individual neighborhoods has a significantly beneficial impact on the individual classification of points (Weinmann et al., 2014).On the other hand, using contextual information might even have more influence on the classification accuracy, because it takes into account that class labels of neighboring 3D points tend to be correlated.Consequently, this paper addresses the question whether the use of features extracted from neighborhoods of individual size still improves the classification accuracy when contextual classification is applied, and whether it is beneficial to use the same neighborhood definition for contextual classification.We propose a novel and generic approach for 3D scene analysis which relies on individually optimized 3D neighborhoods for both feature extraction and contextual classification.Considering different neighborhood definitions as the basis for feature extraction, we use a Conditional Random Field (CRF) (Lafferty et al., 2001) for contextual classification and compare the respective classification results with those obtained when using a Random Forest classifier (Breiman, 2001).As the unary terms of the CRF are also based on a Random Forest classifier, we can quantify the influence of the context model on the classification results.
After reflecting related work in Section 2, we explain the different components of our methodology in Section 3. Subsequently, in Section 4, we evaluate the proposed methodology on a labeled point cloud dataset representing an urban environment and discuss the derived results.Finally, in Section 5, concluding remarks and suggestions for future work are provided.

RELATED WORK
When focusing on point cloud classification, different strategies may be involved for each component of the processing workflow.

Fixed vs. Individual 3D Neighborhoods
In order to describe the local 3D structure at a given 3D point, the spatial arrangement of 3D points within the local neighborhood is typically taken into consideration.The respective local neighborhood may be defined as a spherical (Lee and Schenk, 2002) or cylindrical (Filin and Pfeifer, 2005) neighborhood with fixed radius.Alternatively, the local neighborhood can be defined to consist of the k ∈ N nearest neighbors either on the basis of 3D distances (Linsen and Prautzsch, 2001) or 2D distances (Niemeyer et al., 2014).The latter definition based on the k nearest neighbors offers more flexibility with respect to the absolute neighborhood size and is more adaptive to varying point density.All these neighborhood definitions, however, rely on a scale parameter (i.e.either a radius or k), which is commonly selected to be identical for all 3D points and determined via heuristic or empiric knowledge on the scene.As a result, the derived scale parameter is specific for each dataset.
In order to obtain a solution taking into account that the selection of a scale parameter depends on the local 3D structure as well as the local point density, an individual neighborhood size can be determined for each 3D point.In this context, most approaches rely on a neighborhood consisting of the k nearest neighbors and thus focus on optimizing k for each individual 3D point.This optimization may for instance be based on the local surface variation (Pauly et al., 2003;Belton and Lichti, 2006), iterative schemes relating neighborhood size to curvature, point density and noise of normal estimation (Mitra and Nguyen, 2003;Lalonde et al., 2005), dimensionality-based scale selection (Demantké et al., 2011) or eigenentropy-based scale selection (Weinmann et al., 2014).In particular, the latter two approaches have proven to be suitable for point cloud data acquired via mobile laser scanning, and a significant improvement of classification results can be observed in comparison to the use of fixed 3D neighborhoods with identical scale parameter (Weinmann et al., 2014).

Single-Scale vs. Multi-Scale Features
Given a 3D point and its local neighborhood, geometric features may be derived from the spatial arrangement of all 3D points within the neighborhood.For this purpose, it has been proposed to sample geometric relations such as distances, angles and angular variations between 3D points within the local neighborhood (Osada et al., 2002;Rusu et al., 2008;Blomley et al., 2014).However, the individual entries of the resulting feature vectors are hardly interpretable, and consequently, other investigations focus on deriving interpretable features.Such features may for instance be obtained by calculating the 3D structure tensor from the 3D coordinates of all points within the local neighborhood (Pauly et al., 2003).The eigenvalues of the 3D structure tensor may directly be applied for characterizing specific shape primitives (Jutzi and Gross, 2009).In order to obtain more intuitive features which also indicate linear, planar or volumetric structures, a set of features derived from these eigenvalues has been presented (West et al., 2004) which is nowadays commonly applied in lidar data processing.This standard feature set may be complemented by further geometric features derived from angular statistics (Munoz et al., 2009), height and local plane characteristics (Mallet et al., 2011), height characteristics and curvature properties (Schmidt et al., 2012;Schmidt et al., 2013), or basic properties of the neighborhood and characteristics of a 2D projection (Weinmann et al., 2013;Weinmann et al., 2014).Furthermore, the combination with full-waveform and echo-based features has been proposed (Chehata et al., 2009;Mallet et al., 2011;Niemeyer et al., 2011).
When deriving features at a single scale, one has to consider that a suitable scale (in the form of either fixed or individual 3D neighborhoods) is required in order to obtain an appropriate description of the local 3D structure.As an alternative to selecting such an appropriate scale, we may also derive features at multiple scales and subsequently involve a classifier in order to define which combination of scales allows the best separation of different classes (Brodu and Lague, 2012).In this context, features may even be extracted by considering different entities such as points and regions (Xiong et al., 2011;Xu et al., 2014) or by involving a hierarchical segmentation based on voxels, blocks and pillars (Hu et al., 2013).However, multi-scale approaches result in feature spaces of higher dimension, so that it may be advisable to use appropriate feature selection schemes in order to gain predictive accuracy while at the same time reducing the extra computational burden in terms of both time and memory consumption (Guyon and Elisseeff, 2003).

Individual vs. Contextual Classification
Based on the derived feature vectors, classification is typically conducted in a supervised way, where the straightforward solution consists of an independent classification of each 3D point relying only on its individual feature vector.The list of respective classification methods that have been used for lidar data processing includes classical Maximum Likelihood classifiers based on Gaussian Mixture Models (Lalonde et al., 2005), Support Vector Machines (Secord and Zakhor, 2007), AdaBoost (Lodha et al., 2007), a cascade of binary classifiers (Carlberg et al., 2009), Random Forests (Chehata et al., 2009) and Bayesian Discriminant Classifiers (Khoshelham and Oude Elberink, 2012).Such an individual point classification may be carried out very efficiently, but there is a severe drawback, namely the noisy appearance of the classification results.
In order to account for the fact that the class labels of neighboring 3D points tend to be correlated, contextual classification approaches may be applied which also involve a model of the relations between 3D points in a local neighborhood.For that purpose, statistical models of context have been increasingly used for point cloud classification, e.g.Associative and non-Associative Markov Networks (Munoz et al., 2009;Shapovalov et al., 2010), Conditional Random Fields (Lim and Suter, 2009;Schmidt et al., 2012;Niemeyer et al., 2014), Simplified Markov Random Fields (Lu and Rasmussen, 2012), multi-stage inference procedures focusing on point cloud statistics and relational information over different scales (Xiong et al., 2011), and spatial inference machines modeling mid-and long-range dependencies inherent in the data (Shapovalov et al., 2013).Some methods are based on point cloud segments, e.g.(Shapovalov et al., 2010), whereas others directly classify points, e.g.(Niemeyer et al., 2014).As segment-based methods heavily depend on the quality of the results of the segmentation algorithm, we prefer point-based techniques.Typically, statistical models for context, e.g. in a Conditional Random Field (CRF), are based on interactions between neighboring point pairs, and the considerations made about the size of a local neighborhood (Section 2.1) also apply to the selection of the set of points interacting with a given point.However, existing investigations are usually based on a radius search or on the k nearest neighbors either in 2D or in 3D, involving either a fixed radius or a fixed value for k.In (Niemeyer et al., 2011), the impact of varying the radius of a cylindrical neighborhood for defining the set of neighbors is investigated.The results indicate a saturation effect when increasing that radius, so that the average number of involved neighbors is 7, but in each experiment the radius is fixed.In this paper, we want to investigate the effect of using individual 3D neighborhoods of optimal size for defining the edges of a CRF.

METHODOLOGY
The proposed methodology for point cloud classification consists of (i) neighborhood selection, (ii) feature extraction and (iii) contextual classification.Instead of treating these components separately, we focus on sharing the result of the crucial task of neighborhood selection across all components.Details are explained in the subsequent sections.

Estimation of Optimal Neighborhoods
We start from a point cloud consisting of NP points Xi ∈ R 3 with i ∈ {1, . . ., NP }.In order to obtain flexibility with respect to the absolute neighborhood size, we employ neighborhoods consisting of the k ∈ N nearest neighbors.As we intend to avoid an empirical selection of an appropriate fixed scale parameter k which is identical for all points, we focus on the generic selection of individual neighborhoods described by an optimized scale parameter k for each 3D point Xi, where the optimization relies on a specific energy function.This strategy is motivated by the fact that the distinctiveness of geometric features calculated from the neighboring points is increased when involving individually optimized neighborhoods (Weinmann et al., 2014).
The energy functions used to define the optimal neighborhood size are based on the covariance matrix calculated from the 3D coordinates of a given 3D point Xi and its k nearest neighbors.This covariance matrix is also referred to as the 3D structure tensor.Denoting the eigenvalues of the 3D structure tensor by λ1,i, λ2,i, λ3,i ∈ R, where λ1,i ≥ λ2,i ≥ λ3,i ≥ 0, two recent approaches for selecting individual neighborhoods can be applied.On the one hand, the dimensionality features of linearity L λ,i , planarity P λ,i and scattering S λ,i with sum up to 1 and may be used in order to derive the Shannon entropy (Shannon, 1948) representing the energy function Edim,i for dimensionality-based scale selection (Demantké et al., 2011): Alternatively, we may normalize the three eigenvalues by their sum j λj,i in order to obtain the normalized eigenvalues j,i with j,i = λj,i/ j λj,i for j ∈ {1, 2, 3}, summing up to 1, and we can use the Shannon entropy of these normalized eigenvalues as the basis of the energy function E λ,i for eigenentropy-based scale selection (Weinmann et al., 2014): For each 3D point Xi, the energy functions Edim,i and E λ,i are calculated for varying values of k, and the value yielding the minimum entropy is selected to define the optimal neighborhood size.Note that minimizing Edim,i corresponds to favoring dimensionality features which are as dissimilar as possible from each other, whereas minimizing E λ,i corresponds to minimizing the disorder of points within the neighborhood.Similarly to (Weinmann et al., 2014), we vary the scale parameter k between kmin = 10 and kmax = 100 with ∆k = 1.

Feature Extraction
We involve the same feature set as (Weinmann et al., 2014) Particularly in urban environments, we may face a variety of man-made objects which, in turn, are characterized by almost perfectly vertical structures (e.g.building fac ¸ades, walls, poles, traffic signs or curbstone edges).For this reason, we also involve features based on a 2D projection of a given 3D point Xi and its k nearest neighbors onto a horizontal plane P. Exploiting the projected 3D points, we may easily obtain the respective radius r k-NN,2D,i and point density D2D,i in 2D.Furthermore, we derive the covariance matrix of the 2D coordinates of these points in the projection plane, i.e. the 2D structure tensor, whose eigenvalues provide additional features, namely their sum Σ λ,2D,i and their ratio R λ,2D,i .Finally, we derive features resulting from a 2D projection of all 3D points onto P and a subsequent spatial binning.For that purpose, we discretize the projection plane and define a 2D accumulation map with discrete, quadratic bins with a side length of 0.25 m as proposed in (Weinmann et al., 2013).
The additional features for describing a given 3D point Xi are represented by the number NB,i of points as well as the maximum difference ∆HB,i and standard deviation σH,B,i of height values within the respective bin.
All the extracted features are concatenated to a feature vector and, since the geometric features describe different quantities, a normalization [•] n across all feature vectors is involved which normalizes the values of each dimension to the interval [0, 1].Thus, the 3D point Xi is characterized by a 21-dimensional feature vector fi with which is used as input for the classification of that point.

Classification Based on Conditional Random Fields
We use a Conditional Random Field (CRF) (Lafferty et al., 2001;Kumar and Hebert, 2006) for classification.CRFs are undirected graphical models that allow to model interactions between neighboring objects to be classified, and, thus, to model local context.The underlying graph G(n, e) consists of a set of nodes n and a set of edges e, the latter being responsible for the context model.
In our case, similarly to (Niemeyer et al., 2014), the nodes ni ∈ n correspond to the 3D points Xi of the point cloud, whereas the edges eij ∈ e connect neighboring pairs of nodes (ni, nj).Consequently, the number of nodes in the graph is identical to the number NP of points to be classified.It is the goal of classification to assign a class label ci ∈ c1 , . . ., c L to each 3D point Xi (and thus to each node ni of the graph), where L is the number of classes, superscripts indicate specific class labels corresponding to an object type, and subscripts indicate the class label of a given point.Due to the mutual dependencies between the class labels at neighboring points induced by the edges of the graph, the class labels of all points have to be determined simultaneously.We collect the class labels of all points in a vector C = [c1, . . ., ci, . . ., cN P ] T .Denoting the combination of all input data by x, we want to determine the configuration of class labels that maximizes the posterior probability p(C|x) (Kumar and Hebert, 2006): Here, Z(x) is a normalization constant called the partition function.As it does not depend on the class labels, it can be neglected in classification.The functions φ(x, ci) are called association potentials; they provide local links between the data x and the local class labels ci.The functions ψ(x, ci, cj), referred to as interaction potentials, are responsible for the local context model, providing the links between the class labels (ci, cj) of the pair of nodes connected by the edge eij and the data x.Ni denotes the set of neighbors of node ni that are linked to ni by an edge.Details about our definitions of the individual terms and the local neighborhood are given in the subsequent subsections.

Association Potentials:
Any local discriminative classifier whose output can be interpreted in a probabilistic way can be used to define the association potentials φ(x, ci) in Equation 5. Note that the data x appear without an index in the argument list, which means that the association potential for node ni may depend on all the data (Kumar and Hebert, 2006).This is usually considered by defining site-wise feature vectors fi(x), in our case one such vector per 3D point Xi to be classified.We use the feature vectors fi defined according to Equation 4 as site-wise vectors fi(x), whose components are functions of the data within a neighborhood of point Xi.In our experiments, we will compare different variants of these feature vectors based on different definitions of the local neighborhood used for computing the features as defined in Section 3.1.The association potential can be defined as the posterior probability of a local discriminative classifier based on fi(x) (Kumar and Hebert, 2006): For individual point classification, a good trade-off between classification accuracy and computational effort can be achieved by using a Random Forest classifier (Breiman, 2001).Such a Random Forest consists of a pre-defined number NT of random decision trees which are trained independently on different subsets of the given training data, where the subsets are randomly drawn with replacement.The random sampling results in randomly different decision trees and thus in diversity in terms of de-correlated hypotheses across the individual trees.In the classification, the site-wise feature vectors fi(x) are classified by each tree.Each tree casts a vote for one of the class labels c l .Usually, the majority vote over all class labels is used as the classification output, because it can be expected to result in improved generalization and robustness.In order to use the output of a Random Forest for the association potential, we define the posterior of each class label c l to be the ratio of the number N l of votes cast for that class and the number NT of involved decision trees: The most important parameters of a Random Forest are the number NT of trees to be used for classification, the minimum allowable number nmin of training points for a tree node to be split, the number of active variables na to be used for the test in each tree node, and the maximum depth dmax of each tree.For our experiments, we use the Random Forest implementation of openCV 1 .

Interaction Potentials:
Just as the association potentials, the interaction potentials can be based on the output of a discriminative classifier (Kumar and Hebert, 2006).In (Niemeyer et al., 2014), a Random Forest is used as discriminative classifier delivering a posterior p (ci, cj|µij(x)) for the occurrence of the class labels (ci, cj) at two neighboring points given an observed interaction feature vector µij (the concatenated node feature vectors).Thus ψ(x, ci, cj) = p (ci, cj|µij(x)) is used to define the interaction potential.The derived results show that such a model delivers a better classification performance for classes having a relatively small number of instances in a point cloud.However, in order to apply such an approach, it is a prerequisite to have a sufficient number of training samples for each type of class transition; if the original number of classes is N l , one would need enough training samples for N l × N l such transitions, which may be prohibitive.Consequently, we use a simpler model, namely a variant of the contrast-sensitive Potts model (Boykov and Jolly, 2001) for the interaction potentials: In this equation, dij (x) 2 = fi (x) − fj (x) 2 is the square of the Euclidean distance between the node feature vectors fi (x) and fj (x) of the two nodes connected by the edge eij.Furthermore, δc i c j represents the Kronecker delta returning 1 if the class labels ci and cj are identical and 0 otherwise.The parameter σ is the average square distance between the feature vectors at neighboring training points, Na is the average number of edges connected to a node in the CRF and N k i is the number of neighbors of node ni.The weight parameter w1 influences the impact of the interaction potential on the classification results.The normalization of the interaction potential by the ratio Na/N k i is required for the interaction potentials to have an equal total impact on the classification of all nodes (Wegner et al., 2011).The model in Equation 8will result in a data-dependent smoothing of the classification results.The second weight parameter w2 ∈ [0, 1] describes the degree to which smoothing will depend on the data.

Definition of the Neighborhood:
An important question in the application of a CRF is the definition of an appropriate neighborhood Ni for each node ni.For images, one can for instance use the four neighbors defined on the image grid (Kumar and Hebert, 2006).For point clouds, such a simple definition is impossible.Typically, the definition of the local neighborhood is based on the k nearest neighbors or on all neighbors within a fixed radius of the node ni.In both cases, a cylindrical or a spherical neighborhood can be used, i.e. the search for neighbors can be carried out using a 2D or a 3D neighborhood.In case of airborne laser scanning data, it has been shown that a 2D neighborhood is to be preferred, because in an urban area building fac ¸ades will only receive a relatively small number of laser points, and the height differences between neighboring points (in 2D) carry a lot of information (Niemeyer et al., 2014).The method described in this paper is designed for data acquired by laser scanners on mobile mapping devices, where one has to deal with many points on building fac ¸ades, in which case a cylindrical neighborhood does not make much sense.Consequently, we use the k nearest neighbors in 3D of each point to define the edges of the graph.However, selecting a single value for k may not be appropriate in case of varying point density.Hence, we use the neighborhood size as defined in Section 3.1 for spatially varying definitions of the local neighborhood.For performance reasons, we have to apply stricter limits to the size of the local neighborhood than for the size of the local neighborhood used to extract the features.
Thus, if the neighborhood size determined according to one of the methods defined in Section 3.1 is larger than a threshold kmax,CRF, it will be set to kmax,CRF.In our experiments, we will compare several such definitions of the neighborhood size, some of them using a neighborhood with fixed scale parameter k.For variants with variable k, the average number Na of neighbors in Equation 8 will only be based on the actual number of neighbors per node (that is, after enforcing the threshold kmax,CRF).

Training and Inference:
In order to determine the parameters of our classifier, we need training data, i.e. a set of 3D points with known class labels.The parameters of the two types of potentials are trained independently from each other.In case of the association potentials, this involves the training of a Random Forest classifier, where we randomly select an identical number NS of training samples per class.This is required because otherwise a class with many samples might lead to a bias towards that class in training (Chen et al., 2004).Note that for classes with a small number of training samples, this might result in a duplication of training samples.For the interaction potentials, the parameter σ is determined as the average square distance between neighboring points in the training data based the same local neighborhood that is used for the definition of the graph in classification.The weight parameters w1 and w2 could be set based on a technique such as cross validation (Shotton et al., 2009).Here, they are set to values that were found empirically.
For inference, i.e. for the determination of the label configuration C maximizing the posterior in Equation 5once the parameters of the potentials are known, we use Loopy Belief Propagation (Frey and MacKay, 1998), a standard optimization technique for graphs with cycles.

EXPERIMENTAL RESULTS
In the following, we present the involved dataset, describe the conducted experiments and discuss the derived results.

Dataset
A benchmark point cloud dataset representing an urban environment has been released with the Oakland 3D Point Cloud Dataset2 (Munoz et al., 2009).The data have been collected in the vicinity of the CMU campus in Oakland, USA, with a mobile laser scanning system.This system captures the local 3D geometry with side looking SICK LMS laser scanners used in push-broom mode.After acquisition, the dataset has been split into a training set consisting of approximately 37k points and a test set with about 1.3M points.The reference class labels were assigned to the points in a semi-automatic annotation process.Thus, the classification task consists of assigning each 3D point a semantic label from the set {wire (w), pole/trunk (p/t), fac ¸ade (f ), ground (g), vegetation (v)}.The distribution of the classes in the test set is very inhomogeneous, with 70.5% and 20.2% of the data belonging to classes g and v, respectively.Class f constitutes 8.4% of the points, whereas the two remaining classes (w and p/t) only consist of 0.3% and 0.6% of the points, respectively.

Experiments
For our experiments, we use five different variants of the definition of the neighborhood for computing the features described in Section 3.2.Three variants (denoted by N10, N50 and N100) are based on fixed scale parameters (thus a fixed neighborhood) of k = 10, 50 and 100, respectively, for all points of the point cloud.For variant Nopt,dim the optimal neighborhood derived via dimensionality-based scale selection is used, whereas for variant N opt,λ the optimal neighborhood is derived via eigenentropybased scale selection (cf.Section 3.1).For each variant of the feature vectors, two variants of the Random Forest classifier based on different settings are compared.In variant RF100 the Random Forest consists of 100 trees with a maximum tree depth of dmax = 4 which are trained on 1,000 training samples per class (NS = 1,000), whereas in variant RF200 we train 200 trees with a maximum tree depth of dmax = 15 on 10,000 training samples per class.In both variants, a node is only split if it is reached by at least nmin = 20 training samples, and the number of features for each test (na) is set to the square root of the number of features, following the recommendations of the openCV implementation.The first setting is a standard one, whereas the second one is expected to lead to a slightly improved performance due to the larger number of training samples and to the larger number of trees, though at the cost of a higher computational effort.
First, we apply a classification solely based on the association potentials to the dataset, i.e. on the results of the two variants of the Random Forest classifier; the respective classification variants are denoted by RF100 and RF200, respectively.After that, we apply the contrast-sensitive Potts model in a CRF-based classification.We use w2 = 0.5, a value found empirically; in a set of experiments not reported here for lack of space, we found that changes of that parameter had very little influence on the results.The chosen value gives equal influence of the data-dependent and the data-independent terms of the interaction potential.We compare three different values of the weight w1 (w1 = 1.0, w1 = 5.0 and w1 = 10.0) to show its impact on the classification results; the respective classification variants are referred to as CRF 1 N T , CRF 5 N T and CRF 10 N T , respectively, where NT is either 100 or 200, depending on whether the association potential was based on RF100 or on RF200.The size of the neighborhood for each node of the graph is based on the one for the definition of the features, but thresholded by a parameter kmax,CRF.For variant N10, we connect each point to its 10 nearest neighbors, whereas for N50 and N100 the number of neighbors is set to kmax,CRF = 15.For the other variants, we use kmax,CRF = 25, but vary the size of the neighborhood according to the one used for the definition of the features.This results in an average number of Na = 21 neighbors for Nopt,dim and Na = 15 neighbors for N opt,λ .
As a consequence of these definitions, we carry out 40 experiments.In each case, the test set is classified, and the resulting labels are compared to the reference labels on a per-point basis.We determine the confusion matrices and derive the overall accuracy (OA), completeness (cmp), correctness (cor) and quality (q) of the results.For most experiments, we only report OA and q, the latter being a compound metric indicating a good trade-off between omission and commission errors (Heipke et al., 1997).

Results and Discussion
The overall accuracy achieved in all experiments is summarized in Table 1, whereas the quality q for the five classes is shown in Tables 2-6.Some results are visualized in Figure 1.Looking at the numbers in Table 1, one can get the impression that the classification performs reasonably well in all cases, the lowest value of overall accuracy being 85.3% (RF100, N10).The best overall accuracy is better than that by about 10% (95.5% for CRF 5 200 , N opt,λ ).However, these results are dominated by the excellent discrimination of class g from the others, which is expressed by a quality of 92.3% -98.4% for that class (cf.Table 5), which, as mentioned above, accounts for 70.5% of all points in the test set.The quality is still reasonable for class v, which contains the second largest part of the data (20.2%),though the variation is much larger (61.8% -88.7%; cf.Table 6).For the other classes, in particular for w, it is very low, and whereas for p/t and f it can be improved considerably by varying the neighborhood definitions and the classifier, for class w the best result is q = 11.7%, with a variation of about 8% between variants (cf.Table 2).The main reason for the poor quality numbers of classes w and p/t is a low correctness for these classes, i.e. there are many false positives (for an example, cf.Table 7).In both cases, this is due to a relatively large number of misclassified points that actually correspond to class f.In case of poles/trunks, structures appearing like semicolumns in the fac ¸ades are frequently misclassified as p/t.Misclassifications between f and w frequently occur at fac ¸ades that are orthogonal to the road so that they show a more sparse point distribution than those facing the roads.In any case, we have to put the relatively high values for overall accuracy into perspective: some classes can be differentiated well, independently from the classification setup, whereas wires of power lines (w) cannot be differentiated using any of the methods compared here, and the main difference between the individual experiments is in the quality of the differentiation of the classes p/t and f.
Comparing the results based on a Random Forest classifier consisting of 100 trees (RF100, CRF 1 100 , CRF 5 100 , CRF 10 100 ) to those based on 200 trees, it is obvious that using more trees and more training data leads to a slightly better classification performance.The increase in OA by using 200 trees is in the order of 0.2% -3.6% for all variants (cf.Table 1).The difference in q is largest for the variants based on a fixed neighborhood.This is particularly the case for the class f for variants N10 and N50.Here, the ordering is reversed, and the variants based on RF100 achieve a considerably better performance (cf.Table 4), though at the price of other misclassifications.However, these versions are not the best-performing ones for that class, and for the variants based on a variable neighborhood the differences in q in Table 4 are smaller, in particular for the versions based on a CRF.
Of the variants using a fixed neighborhood, N50 performs best in nearly all indices.N10 performs considerably worse in OA and particularly in the quality of classes p/t and f.This also holds for the largest constant neighborhood, N100, though to a lesser degree.A neighborhood size of 50 points seems to give a relatively good trade-off between smoothing and allowing changes at class boundaries.If no interactions are considered (RF100 and RF200), the variants based on a variable neighborhood perform slightly worse than N50 in overall accuracy, with N opt,λ performing slightly better than Nopt,dim in quality for the "small" classes (w, p/t, f ) if RF200 is used as the base classifier.
Involving contextual information in the classification process improves nearly all classification indices.The improvement in overall accuracy varies between about 1% and 5% (cf.neighborhood than for N50, in the order of 2% for the first and of 1% for the latter if 200 trees are used for the association potential.Consequently, the variant N opt,λ performs better than N50 in all cases, the margin being in the order of 1%.If RF100 is used for the association potential, this also holds for Nopt,dim, whereas in case 200 trees are used Nopt,dim performs similar to N50.Again, the differences in quality for the classes w, p/t and f show higher variations.It becomes obvious that if the better base classifier (RF200) is used, these classes are differentiated best by using an adaptive neighborhood as in variant N opt,λ , in case of class p/t by a large margin.The weight of the interaction potential does have an impact on the results, but at least in those cases where 200 trees are used for the association potentials, the effect of changing the weight in the range tested here is relatively low compared to the impact of using the interactions in the first place.The value w1 = 5.0 seems to be a good trade-off in this application.
One can see from our results that the main impact of using interactions in classification consists of a considerable improvement in the classification performance of classes that are not dominant in the data, which is consistent with the findings in (Niemeyer et al., 2014) for airborne laser scanning data.In the case of mobile laser scanning data, it might in fact be those classes one is mainly interested in.The most dominant class g can easily be distinguished from the remaining data by simply considering height, and the respective completeness and correctness numbers do not vary much.In contrast, p/t might for instance be a class of major interest for mapping urban infrastructure.When using a fixed neighborhood N50 and a Random Forest without interactions (variant RF200), the completeness and the correctness of the results are 52.5% and 42.0%, respectively, resulting in a quality of 30.4% (Table 3).Nearly half of the points on poles or trunks are not correctly detected, and more than half of the points classified as p/t are in fact not situated on poles or trunks.Using the neighborhood N opt,λ and a CRF (CRF 5 200 ), these numbers are increased to a completeness of 78.8% and a correctness of 59.7% (cf.Table 7), which results in a quality of 51.4% and certainly provides a better starting point for subsequent processes.

CONCLUSIONS
In this paper, we have presented a generic approach for automated 3D scene analysis.The novelty of this approach addresses the interrelated issues of (i) neighborhood selection, (ii) feature extraction and (iii) contextual classification, and it consists of using individual 3D neighborhoods of optimal size for the subsequent steps of feature extraction and contextual classification.The results derived on a standard benchmark dataset clearly indicate the beneficial impact of involving contextual information in the classification process and that using individual 3D neighborhoods of optimal size significantly increases the quality of the results for both pointwise and contextual classification.
For future work, we want to carry out deeper investigations concerning the influence of the amount of training data as well as the influence of the number of different classes on the classification results for different datasets.Moreover, we intend to exploit the results of contextual point cloud classification for extracting single objects in a 3D scene such as trees, cars or traffic signs.

Figure 1 .
Figure 1.Classified 3D point clouds for the neighborhoods {N50, N opt,λ } (left and right column) and the classifiers RF200, CRF 5 200 (top and bottom row) when using a standard color encoding (wire: blue; pole/trunk: red; fac ¸ade: gray; ground: brown; vegetation: green).Note the noisy appearance of the results for individual point classification (top row).
group of 3D features represents basic properties of the neighborhood such as absolute height Hi of the center point Xi, radius r k-NN,i of the neighborhood, maximum difference ∆H k-NN,i and standard deviation σ H,k-NN,i of height values within the neighborhood, local point density Di, and verticality Vi.Further 3D features are based on the normalized eigenvalues of the 3D structure tensor and consist of linearity L λ,i , planarity P λ,i , scattering S λ,i , omnivariance O λ,i , anisotropy A λ,i , eigenentropy E λ,i , the sum Σ λ,i of eigenvalues and the change of curvature C λ,i .
, which has been shown to give good results in point cloud classification.This feature set consists of both 3D features and 2D features.A

Table 1 )
. It is most pronounced for the variant having the poorest OA if no interactions are considered (RF100, N10).Apart from this single example, it is in general better for the variants having an adaptive

Table 2 .
Quality q [%] for class w achieved in all experiments.

Table 3 .
Quality q [%] for class p/t achieved in all experiments.

Table 4 .
Quality q [%] for class f achieved in all experiments.

Table 5 .
Quality q [%] for class g achieved in all experiments.

Table 6 .
Quality q [%] for class v achieved in all experiments.