FINDING A GOOD FEATURE DETECTOR-DESCRIPTOR COMBINATION FOR THE 2D KEYPOINT-BASED REGISTRATION OF TLS POINT CLOUDS

: The automatic and accurate registration of terrestrial laser scanning (TLS) data is a topic of great interest in the domains of city modeling, construction surveying or cultural heritage. While numerous of the most recent approaches focus on keypoint-based point cloud registration relying on forward-projected 2D keypoints detected in panoramic intensity images, little attention has been paid to the selection of appropriate keypoint detector-descriptor combinations. Instead, keypoints are commonly detected and described by applying well-known methods such as the Scale Invariant Feature Transform (SIFT) or Speeded-Up Robust Features (SURF). In this paper, we present a framework for evaluating the inﬂuence of different keypoint detector-descriptor combinations on the results of point cloud registration. For this purpose, we involve ﬁve different approaches for extracting local features from the panoramic intensity images and exploit the range information of putative feature correspondences in order to deﬁne bearing vectors which, in turn, may be exploited to transfer the task of point cloud registration from the object space to the observation space. With an extensive evaluation of our framework on a standard benchmark TLS dataset, we clearly demonstrate that replacing SIFT and SURF detectors and descriptors by more recent approaches signiﬁcantly alleviates point cloud registration in terms of accuracy, efﬁciency and robustness.


INTRODUCTION
In order to obtain a highly accurate and detailed acquisition of local 3D object surfaces within outdoor environments, terrestrial laser scanning (TLS) systems are used for a variety of applications in the domains of city modeling, construction surveying or cultural heritage.Each captured TLS scan is represented in the form of a point cloud consisting of a large number of scanned 3D points and, optionally, additional attributes for each point such as intensity information.In order to provide a dense and (almost) complete 3D acquisition of interesting parts of a scene, typically, multiple scans have to be captured from different locations and -since the spatial 3D information is only measured w.r.t. the local coordinate frame of the laser scanner -it is desirable to automatically estimate the respective transformations in order to align all captured scans in a common coordinate frame.The estimation of these transformations is commonly referred to as point cloud registration, and nowadays most approaches rely on specific features extracted from the point clouds instead of using the complete point clouds.Despite a variety of features which may be extracted from point clouds and alleviate point cloud registration (e.g.planes or lines), we focus on the use of keypoints, since these may efficiently be extracted as local features.While approaches for detecting and describing 3D keypoints have recently been involved for point cloud registration (Theiler et al., 2013;Theiler et al., 2014), such a strategy typically relies on a subsampling of the original point (e.g. via voxel grid filtering) in order to get an approximately homogeneous point density.The alternative of directly working on the captured data consists of exploiting the discrete (spherical or cylindrical) scan pattern and deriving 2D image representations for either range or intensity information.Particularly in the intensity images, distinctive 2D keypoints may efficiently be derived and -in case they are part of any feature correspondence between different images -subsequently projected to 3D space by exploiting the corresponding range information which, in turn, yields sparse point sets of corresponding 3D points.
Concerning the use of 2D keypoints, those keypoint detectors and descriptors presented with the Scale Invariant Feature Transform (SIFT) and Speeded-Up Robust Features (SURF) are typically exploited.In the absence of noise, a low number of feature correspondences may already be sufficient to obtain a suitable estimate for the relative orientation between two 3D point sets.However, the range measurements of a TLS system are typically corrupted with a certain amount of noise which requires additional effort.Consequently, it has been proposed to increase the reliability of the estimate by assigning each keypoint a quality measure which indicates the reliability of the respective range information and allows discarding unreliable keypoints (Barnea and Filin, 2008;Weinmann and Jutzi, 2011).Furthermore, the feature matching may result in a certain amount of wrong feature correspondences which may be identified by involving the well-known RANdom SAmple Consensus (RANSAC) algorithm (Fischler and Bolles, 1981) for robustly estimating a given transformation model (Barnea and Filin, 2007;Boehm and Becker, 2007).However, little attention has been paid to the fact that more recent approaches for extracting local features seem to overcome the main limitations of SIFT and SURF by increasing computational efficiency and simultaneously delivering even more feature correspondences which are still reliable and thus may significantly contribute to obtain a suitable estimate.
In this paper, we present a framework involving different modern 2D keypoints detectors and descriptors for finding corresponding image contents in the panoramic intensity images derived for the single scans.The respective forward-projection to 3D space yields sparse point sets of corresponding 3D points.Instead of directly exploiting these corresponding 3D points for point cloud registration, however, we exploit the bearing vectors defined by the origin of the local coordinate frame and the sparse 3D point sets, since these bearing vectors may be determined with a higher reliability in comparison to range measurements.This allows us to transfer the task of point cloud registration to the task of finding the relative orientation between sets of bearing vectors which may efficiently be handled by well-known algorithms for estimat-ing the pose of omnivision cameras.As main contributions, • we involve different approaches for extracting local features from the panoramic intensity images, • we exploit the local features and the corresponding range information to define the respective bearing vectors, • we transfer the task of point cloud registration from the object space (i.e. the direct alignment of 3D point sets) to the observation space (i.e. the direct alignment of sets of bearing vectors), and • we investigate the influence of the feature extraction methods on the results of point cloud registration.
After briefly reviewing related work w.r.t.keypoint extraction and matching and w.r.t.keypoint-based point cloud registration in Section 2, we present the different methods involved in our framework in Section 3. Subsequently, in Section 4, we provide an extensive evaluation of our framework on a standard benchmark TLS dataset and discuss the derived results w.r.t.performance, efficiency and robustness.Finally, in Section 5, we draw conclusions and outline ideas for future research.

RELATED WORK
In our work, we focus on keypoint-based point cloud registration, where the keypoints are derived from 2D imagery.In the following, we hence reflect different approaches to detect and describe such keypoints representing the basis for deriving sparse 3D point sets (Section 3.1) and, subsequently, we discuss how the corresponding sparse 3D point sets may be aligned in a common coordinate frame (Section 2.2).

Keypoint Extraction and Matching
Generally, different types of visual features may be extracted from images in order to detect corresponding image contents (Weinmann, 2013).However, local features such as corners, blobs or small image regions offer significant advantages.Since such local features (i) may be extracted very efficiently, (ii) are accurately localized, (iii) remain stable over reasonably varying viewpoints and (iv) allow an individual identification, they are well-suited for a variety of applications such as object recognition, autonomous navigation and exploration, image and video retrieval, image registration or the reconstruction, interpretation and understanding of scenes (Tuytelaars and Mikolajczyk, 2008;Weinmann, 2013).
Generally, the extraction of local features consists of two steps represented by feature detection and feature description.
For feature detection, corner detectors such as the Harris corner detector (Harris and Stephens, 1988) or the Features from Accelerated Segment Test (FAST) detector (Rosten and Drummond, 2005) are widely used.The detection of blob-like structures is typically solved with a Difference-of-Gaussian (DoG) detector which is integrated in the Scale Invariant Feature Transform (SIFT) (Lowe, 1999;Lowe, 2004), or a Determinant-of-Hessian (DoH) detector which is the basis for deriving Speeded-Up Robust Features (SURF) (Bay et al., 2006;Bay et al., 2008).Distinctive image regions are for instance detected with a Maximally Stable Extremal Region (MSER) detector (Matas et al., 2002).Accounting for non-incremental changes between images with similar content and thus possibly significant changes in scale, the use of a scale-space representation as introduced for the SIFT and SURF detectors is inevitable.While the SIFT and SURF detectors rely on a Gaussian scale-space, the use of a nonlinear scale-space has been proposed for detecting KAZE features (Alcantarilla et al., 2012) or Accelerated KAZE (A-KAZE) features (Alcantarilla et al., 2013).
For feature description, the main idea consists of deriving keypoint descriptors that allow to discriminate the extracted keypoints very well.Being inspired by investigations on biological vision, the descriptor presented as second part of the Scale Invariant Feature Transform (SIFT) (Lowe, 1999;Lowe, 2004) is one of the first and still one of the most powerful feature descriptors.
Since, for applications focusing on computational efficiency, the main limitation of deriving SIFT descriptors consists of the computational effort, a more efficient descriptor has been presented with the Speeded-Up Robust Features (SURF) descriptor (Bay et al., 2006;Bay et al., 2008).In contrast to these descriptors consisting of a vector representation encapsulating floating numbers, a significant speed-up is typically achieved by involving binary descriptors such as the Binary Robust Independent Elementary Feature (BRIEF) descriptor (Calonder et al., 2010).
For many applications, it is important to derive stable keypoints and keypoint descriptors which are invariant to image scaling and image rotation, and robust w.r.t.image noise, changes in illumination and small changes in viewpoint.Satisfying these constraints, SIFT features are commonly applied in a variety of applications which becomes visible in more than 9.2k citations of (Lowe, 1999) and more than 29.5k citations of (Lowe, 2004), while the use of SURF features has been reported in more than 5.4k citations of (Bay et al., 2006) and more than 5.9k citations of (Bay et al., 2008). 1 Both SIFT and SURF features are also typically used for detecting feature correspondences between intensity images derived for terrestrial laser scans.For each feature correspondence, the respective keypoints may be projected to 3D space by considering the respective range information.This, in turn, yields sparse point sets of corresponding 3D points.

Keypoint-Based Point Cloud Registration
Once sparse point sets of corresponding 3D points have been derived for two scans, the straightforward solution consists of estimating a rigid-body transformation in the least squares sense (Arun et al., 1987;Umeyama, 1991).However, the two 3D point sets may also contain some point pairs resulting from incorrect feature correspondences and, consequently, it is advisable to involve the well-known RANSAC algorithm (Fischler and Bolles, 1981) for obtaining an increased robustness (Barnea and Filin, 2007;Boehm and Becker, 2007).
In case of two coarsely aligned 3D point sets, the well-known Iterative Closest Point (ICP) algorithm (Besl and McKay, 1992) and its variants (Rusinkiewicz and Levoy, 2001;Gressin et al., 2013) may be applied.The main idea of such an approach is to iteratively minimize a cost function representing the difference between the respective sparse 3D point sets.However, if the coarse alignment between the considered 3D point sets is not good enough, the ICP algorithm may fail to converge or even get stuck in a local minimum instead of the global one.Consequently, such an approach is mainly applied for fine registration.
A further alternative consists of exploiting the spatial information of the derived 3D point sets for a geometric constraint matching based on the 4-Points Congruent Sets (4PCS) algorithm (Aiger et al., 2008) which has recently been presented with the Keypointbased 4-Points Congruent Sets (K-4PCS) algorithm (Theiler et al., 2013;Theiler et al., 2014).While the K-4PCS algorithm provides a coarse alignment which is good enough to proceed with an 1 These numbers were assessed via Google Scholar on 30 April 2015.
ICP-based fine registration, the processing time for the geometric constraint matching significantly increases with the number of points in the 3D point sets due to an evaluation of best matching candidates among point quadruples of both 3D point sets.
A different strategy has been presented by transferring the task of point cloud registration to the task of solving the Perspective-n-Point (PnP) problem which, in turn, may be achieved by introducing a virtual camera and backprojecting the sparse 3D point sets onto its image plane (Weinmann et al., 2011;Weinmann and Jutzi, 2011).Thus, 3D/2D feature correspondences are derived and provided as input for an efficient RANSAC-based scheme solving the PnP problem.Being robust due to accounting for both 3D and 2D cues, and being efficient due to involving a noniterative method with only linear complexity, such an approach is still among the most accurate and most efficient approaches for registering sparse 3D point sets.

METHODOLOGY
As shown in Figure 1, our framework for automatically aligning TLS point clouds consists of two major steps: (i) feature extraction and matching and (ii) point cloud registration.The respective methods involved in these components are provided as well and described in the following subsections.

Keypoint Extraction and Matching
Generally, the performance of keypoint matching is an interplay of the applied keypoint detector and descriptor (Dahl et al., 2011).Hence, different keypoint detector-descriptor combinations may be applied and these may differ in their suitability, depending on the requirements of the respective application.Focusing on scan representations in the form of panoramic intensity images, where keypoint descriptors have to cope with significant changes in rotation and scale for changes in the scanner position, we only involve scale and rotation-invariant keypoint representations as listed in Table 1.More details on the respective keypoint detectors and descriptors are provided in the following.(Lowe, 1999;Lowe, 2004) relies on convolving the image I and subsampled versions of I with Gaussian kernels of variable scale in order to derive the Gaussian scalespace.Subtracting neighboring images in the Gaussian scalespace results in the Difference-of-Gaussian (DoG) pyramid, where extrema in a (3 × 3 × 3) neighborhood correspond to keypoint candidates.These keypoint candidates are improved by an interpolation based on a 3D quadratic function in the scale-space in order to obtain subpixel accurate locations in image space.Furthermore, those keypoint candidates with low contrast which are sensitive to noise as well as those keypoint candidates located along edges which can hardly be distinguished from each other are discarded.
In the next step, each keypoint is assigned its dominant orientation which results for the respective scale by considering the local gradient orientations weighted by the respective magnitude as well as a Gaussian centered at the keypoint.Subsequently, the local gradient information is rotated according to the dominant orientation in order to achieve a rotation invariant keypoint descriptor.The descriptor itself is derived by splitting the local neighborhood into 4 × 4 subregions.For each of these subregions, an orientation histogram with 8 angular bins is derived by accumulating the gradient orientations weighted by the respective magnitude as well as a Gaussian centered at the keypoint.
The concatenation of all histogram bins and a subsequent normalization yield the final 128-dimensional SIFT descriptor.For deriving feature correspondences, SIFT descriptors are typically compared by considering the ratio of Euclidean distances of a SIFT descriptor belonging to a keypoint in one image to the nearest and second nearest SIFT descriptors in the other image.This ratio indicates the degree of similarity and thus the distinctiveness of matched features.
3.1.2SURF: Speeded-Up Robust Features (SURF) (Bay et al., 2006;Bay et al., 2008) are based on a scale-space representation of the Hessian matrix which is approximated with box filters, so that the elements of the Hessian matrix may efficiently be evaluated at a very low computational cost using integral images.Thus, distinctive features in an image correspond to locations in the scale-space where the determinant of the approximated Hessian matrix reaches a maximum in a (3 × 3 × 3) neighborhood.
The detected maxima are then interpolated in order to obtain subpixel accurate locations in image space.
Similar to SIFT, a dominant orientation is calculated for each keypoint.For this purpose, the Haar wavelet responses in xand y-direction within a circular neighborhood are weighted by a Gaussian centered at the keypoint and represented in a new 2D coordinate frame.Accumulating all responses within a sliding orientation window covering 60 • yields a local orientation vector, and the orientation vector of maximum length indicates the dominant orientation.For obtaining a rotation invariant keypoint descriptor, the local gradient information is rotated according to the dominant orientation.Then, the local neighborhood is divided into 4 × 4 subregions and, for each subregion, the Haar wavelet responses in x -and y -direction are weighted by a Gaussian centered at the keypoint.The concatenation of the sum of Haar wavelet responses in x -and y -direction as well as the sum of absolute values of the Haar wavelet responses in x -and ydirection for all subregions and a subsequent normalization yield the final 64-dimensional SURF descriptor.The comparison of SURF descriptors is the same as for SIFT descriptors.

ORB:
The approach presented with the Oriented FAST and Rotated BRIEF (ORB) detector and descriptor (Rublee et al., 2011) represents a combination of a modified FAST detector and a modified BRIEF descriptor.
The Features from Accelerated Segment Test (FAST) detector (Rosten and Drummond, 2005) analyzes each pixel (x, y) of an image I and takes into account those pixels located on a surrounding Bresenham circle.The intensity values corresponding to those pixels on the surrounding Bresenham circle are compared to the intensity value I(x, y).Introducing a threshold t, the investigated pixel (x, y) represents a candidate keypoint if a certain number of contiguous pixels have intensity values above I(x, y) + t or below I(x, y) − t.A subsequent non-maximum suppression avoids keypoints at adjacent pixels.The modification resulting in the ORB detector is based on employing a scale pyramid of the image, producing FAST features at each level in the pyramid and adding an orientation component to the standard FAST detector.
The Binary Robust Independent Elementary Feature (BRIEF) descriptor (Calonder et al., 2010) is derived by computing binary 3.1.4A-KAZE and M-SURF: Instead of the Gaussian scalespace of an image, using a non-linear scale-space may be favorable as Gaussian blurring does not respect the natural boundaries of objects and smoothes to the same degree both details and noise, reducing localization accuracy and distinctiveness of features (Alcantarilla et al., 2012).Such a non-linear scale-space may for instance be derived by using efficient additive operator splitting (AOS) techniques and variable conductance diffusion which have been employed for detecting KAZE features (Alcantarilla et al., 2012).The nonlinear diffusion filtering, in turn, makes blurring locally adaptive to the image data and thus reduces noise while retaining object boundaries.However, AOS schemes require solving a large system of linear equations to obtain a solution.In order to increase computational efficiency, it has been proposed to build a nonlinear scale-space with fast explicit diffusion (FED) and thereby embed FED schemes in a pyramidal framework with a fine-to-coarse strategy.Using such a non-linear scale-space, Accelerated KAZE (A-KAZE) features (Alcantarilla et al., 2013) may be extracted by finding maxima of the scale-normalized determinant of the Hessian matrix, where the first and second order derivatives are approximated by means of Scharr filters, through the nonlinear scale-space.After a subsequent non-maximum suppression, the remaining keypoint candidates are further refined to subpixel accuracy by fitting a quadratic function to the determinant of the Hessian response in a (3 × 3) image neighborhood and finding its maximum.
In the next step, scale and rotation invariant feature descriptors may be derived by estimating the dominant orientation of the keypoint in analogy to the SURF descriptor and rotating the local image neighborhood accordingly.Based on the rotated neighborhood, using the Modified-SURF (M-SURF) descriptor (Agrawal et al., 2008) adapted to the non-linear scale-space has been proposed which, compared to the original SURF descriptor, introduces further improvements due to a better handling of descriptor boundary effects and due to a more robust and intelligent twostage Gaussian weighting scheme (Alcantarilla et al., 2012).
3.1.5SURF * and BinBoost: Finally, we also involve a keypoint detector-descriptor combination which consists of applying a modified variant of the SURF detector and using the BinBoost descriptor (Trzcinski et al., 2012;Trzcinski et al., 2013).In comparison to the standard SURF detector (Bay et al., 2006;Bay et al., 2008), the modified SURF detector, denoted as SURF * in our paper, iterates the parameters of the SURF detector until a desired number of features is obtained2 .Once appropriate parameters have been derived, the dominant orientation for each keypoint is calculated and used to rotate the local image neighborhood in order to allow a scale and rotation invariant feature description.
As descriptor, the BinBoost descriptor (Trzcinski et al., 2012;Trzcinski et al., 2013) is used which represents a learned lowdimensional, but highly distinctive binary descriptor, where each dimension (each byte) of the descriptor is computed with a binary hash function that was sequentially learned using Boosting.The weights as well as the spatial pooling configurations of each hash function are learned from training sets consisting of positive and negative gradient maps of image patches.In general, Boosting combines a number of weak learners in order to obtain a single strong classifier.In the context of the BinBoost descriptors, the weak learners are represented by gradient-based image features that are directly applied to intensity image patches.During the learning stage of each hash function, the Hamming distance between image patches is optimized, i.e. it is decreased for positive and increased for negative patches.Since, in our experiments, the BinBoost descriptor with 32 bytes worked best, we only report the results for this descriptor version.

Point Cloud Registration
Introducing a superscript j which indicates the respective scan Sj and a subscript i which indicates the respective feature correspondence, the forward-projection of n corresponding 2D keypoints x 1 i ↔ x 2 i between the panoramic intensity images of two scans S1 and S2 according to the respective range information yields sparse point sets of corresponding 3D points X 1 i ↔ X 2 i .Classically, the task of keypoint-based point cloud registration is solved by estimating the rigid Euclidean transformation between the two sets of corresponding 3D points, i.e. a rigid-body transformation of the form with a rotation matrix R 2 1 ∈ R 3×3 and a translation vector t 2 1 ∈ R 3 (where the superscript indicates the target coordinate frame and the subscript indicates the current coordinate frame).Accordingly, the rigid-body transformation is estimated in object space.
As omnidirectional representations in the form of panoramic range and intensity images are available, we propose to estimate the transformation in observation space, i.e. we intend to find the relative orientation between consecutive scans directly.For this purpose, we apply a spherical normalization N(•) which normalizes 3D points X j i given in the local coordinate frame of scan Sj to unit length and thus yields the so-called bearing vectors that simply represent the direction of a 3D point X j i w.r.t. the local coordinate frame of the laser scanner.Thus, the task of point cloud registration may be transferred to the task of finding the transformation of one set of bearing vectors to another.In photogrammetry and computer vision, this is known as relative orientation and the transformation is encoded both in the essential matrix E and the fundamental matrix F, respectively.The relationship between both is given by: where [t]× denotes the skew symmetric matrix of the translation and K represents a calibration matrix.For omnidirectional or panoramic images, the standard fundamental matrix F cannot be estimated, since it encapsulates the calibration matrix K which, in turn, is based on perspective constraints that do not hold in the omnidirectional case.The essential matrix E, however, is independent of the camera type and may hence be used to estimate the transformation between two panoramic images.
In general, at least five points are necessary to calculate a finite number of solutions for E (Philip, 1998).Various algorithms for the estimation of the essential matrix E exist, ranging from the minimal five-point algorithms (Nistér, 2004;Stewénius et al., 2006) to the six-point (Pizarro et al., 2003), the seven-point (Hartley and Zisserman, 2008), and the eight-point (Longuet-Higgins, 1987) algorithms.As observed in seminal work (Stewénius et al., 2006;Rodehorst et al., 2008) and verified by own experiments, the five-point solver of (Stewénius et al., 2006) performs best in terms of numerical precision, special motions or scene characteristics and resilience to measurement noise.For this reason, we focus on the use of this algorithm in our paper.

Coarse registration:
The input to our registration procedure is represented by putative feature correspondences between 2D points x j i that are subsequently transformed to bearing vectors v j i by exploiting the respective range information.Since the 2D points x j i are localized with subpixel accuracy, a respective bilinear interpolation is applied on the 2D scan grid in order to obtain the respective 3D coordinates X j i .Then, the essential matrix E is estimated using Stewénius' five-point algorithm (Stewénius et al., 2006) and thereby involving the RANSAC algorithm (Fischler and Bolles, 1981) for increased robustness.For this, we use the implementation of OpenGV (Kneip and Furgale, 2014).A subsequent decomposition of E yields the rotation matrix R 2 1 and the translation vector t2 1 (Hartley and Zisserman, 2008).Since the essential matrix E only has five degrees of freedom, the translation vector t2 1 is only known up to a scalar factor s, which is indicated by aˆsymbol.In order to recover the scale factor s, the following is calculated over all inliers: The median is used to diminish potential outliers that could still reside in the data.Finally, the direction vector t2 1 is scaled by smedian to get the final translation t 2 1 = smedian t2 1 .

Fine registration:
In order to remove those 3D points indicating potential outlier correspondences from the 3D point sets, we apply a simple heuristic.First, the point set X 1 i is transformed to X2 i using Equation 1 and the coarse estimates for R 2 1 and t 2 1 .Then, the Euclidean distance between all corresponding 3D points is calculated and only those points with an Euclidean distance below 1m are kept.To remove such heuristics, one could employ iterative reweighted least squares techniques or a RANSAC-based modification of the ICP algorithm.
The remaining 3D points of the sparse point sets are provided to a standard ICP algorithm (Besl and McKay, 1992) which generally converges to the nearest local minimum of a mean square distance metric, where the rate of convergence is high for the first few iterations.Given an appropriate coarse registration delivering the required initial values for R 2 1 and t 2 1 , even a global minimization may be expected.In our experiments, we apply an ICP-based fine registration and consider the result after 10 iterations.

EXPERIMENTAL RESULTS
In our experiments, we use a standard benchmark TLS dataset (Section 4.1) and focus on the performance of different methods for each component of the framework (Section 4.2).Additionally, we discuss the derived results w.r.t.pros and cons of the involved methods (Section 4.3).

Dataset
The involved TLS dataset3 has been captured with a Riegl LMS-Z360i laser scanner in an area called "Holzmarkt" which is located in the historic district of Hannover, Germany.According to (Brenner et al., 2008), the Riegl LMS-Z360i has a single shot measurement accuracy of 12mm and its field-of-view covers 360 • × 90 • , while the measurement range reaches up to 200m.Furthermore, the angular resolution is about 0.12 • and, thus, a full scan results in 3000 × 750 = 2.25M scanned 3D points.
In total, the dataset consists of 20 scans of which 12 were taken with (approximately) upright scan head and 8 with a tilted scan head.The single scan positions for the upright scans have been selected systematically along a trajectory with a spacing of approximately 5m, whereas the scan positions for the tilted scans almost coincide with the scan position for an upright scan, and reference values for both position and orientation have been obtained by placing artificial markers in the form of retro-reflective cylinders in the scene and carrying out a manual alignment based on these artificial targets.Thus, errors in the range of a few millimeters may be expected.In our experiments, we consider the similarity between upright and tilted scans acquired at almost the same position as too high to allow a fair statement on the registration accuracy obtained with our framework (since the respective errors w.r.t. the estimated scan position are significantly below the measurement accuracy of 12mm), and hence we only use the upright scans (Figure 2).
Since both range and intensity information are recorded for each point on the discrete scan raster, we may easily characterize each scan with a respective panoramic range image and a respective panoramic intensity image, where each image has a size of 3000× 750 pixels.As the captured intensity information depends on the device, we adapt it via histogram normalization to the interval [0, 255] in order to obtain 8-bit gray-valued images.

Experiments
Our experiments focus on the successive pairwise registration of scan pairs Pj = {Sj, Sj+1} with j = 1, . . ., 11.For this purpose, we apply the different methods for feature extraction as described in Section 3.1 and the registration scheme as described in Section 3.2.We use the implementations provided in OpenCV 2.4 for SIFT, SURF and ORB, while we use the implementations provided with the respective paper for the other two keypoint detector-descriptor combinations.An example showing feature correspondences derived via the combination of an A-KAZE detector and an M-SURF descriptor is provided in Figure 3.  (Brenner et al., 2008).
For evaluating the performance of our framework, the respective position and angle errors after coarse and fine registration are visualized in Figure 4 and Figure 5. Thereby, the position error indicates the deviation of the estimated scan position from the reference values, whereas the angle error has been determined by transforming the estimated rotation matrix and the respective reference to Rodrigues vectors which, in turn, allow to derive angle errors as the difference of these Rodrigues vectors w.r.t.their length (Kneip and Furgale, 2014).Furthermore, we provide the number of correspondences used for coarse and fine registration in Figure 6 in order to quantify differences between the different methods for feature extraction and matching.For coarse registration, we further provide the ratio of inliers w.r.t.all feature correspondences as well as the number of RANSAC iterations in Figure 7.In order to obtain an impression on the computational effort on a standard notebook (Intel Core i7-3630QM, 2.4Ghz, 16GB RAM), the average processing times for different subtasks are provided in Table 2 as well as the expected time for the whole process of aligning two scans.Finally, we also provide a visualization of registered TLS scans in Figure 8.

Discussion
The results provided in Figure 4 and Figure 5 reveal that the position errors after fine registration are less than 0.06m for almost all keypoint detector-descriptor combinations when considering the scan pairs P1, . . ., P10, where the distance between the respective scan positions is between 4m and 6m.Since the respective angle errors after fine registration are below 0.15 • with only a few exceptions, we may conclude that the presented method for coarse registration represents a competitive method in order to coarsely align the given scans, since a respective outlier rejection based on 3D distances is sufficient for an ICP-based fine registration.The applicability of our method for coarse registration is even further motivated by the fact that the respective processing time is less than 0.085s for the considered scan pairs (Table 2).Thus, we reach a total time of less than 10s for the registration of the considered scan pairs for four of the five tested keypoint detector-descriptor combinations, and only the involved implementation for SIFT is not that efficient.Note that the time for feature extraction is counted twice, since this task is required for both scans of a scan pair.
Considering the five involved keypoint detector-descriptor combinations, we may state that A-KAZE + M-SURF and SURF * + BinBoost tend to provide the best results after fine registration (Figure 4 and Figure 5).Note that only these combinations are also able to derive a suitable position and angle estimate for the last scan pair P11 = {S11, S12}, where the distance between the respective scan positions is approximately 12m.The respective position errors for A-KAZE + M-SURF and SURF * + BinBoost after fine registration are 0.054m and 0.085m, while the angle errors are 0.056 • and 0.083 • , respectively.In contrast, SIFT, SURF and ORB provide a position error of more than 0.20m and an angle error of more than 0.75 • for that case.
In Figure 6, it becomes visible that the number of feature correspondences used for coarse registration is similar for SIFT, ORB and SURF * + BinBoost, while it tends to be higher for SURF.
For A-KAZE + M-SURF, even a significant increase of this number may be observed across all 12 scan pairs.The increase in the number of involved feature correspondences for A-KAZE + M-SURF compared to the other keypoint detector-descriptor combinations is even more significant when considering fine registration, where it is partially even more than twice as much as for the others.Based on these characteristics, an interesting trend becomes visible when considering the respective ratio of inliers during coarse registration.While the inlier ratio is comparable for SIFT and SURF, it is better for SURF * + BinBoost, and it is considerably better for ORB and A-KAZE + M-SURF (Figure 7,top).A high percentage of inliers, in turn, has a positive impact on coarse registration by significantly reducing the number of RANSAC iterations (Figure 7, bottom).Consequently, the  combination A-KAZE + M-SURF not only increases the number of correspondences, but also the inlier ratio and, thus, the position and angle errors across all scan pairs tend to be the lowest for this combination (Figure 4 and Figure 5).For most of the scan pairs, the respective position error is even close or below the given measurement accuracy of 12mm.
Finally, we may state that our framework is suited for both urban environments and scenes containing vegetation, and it does neither depend on regular surfaces nor human interaction.The only limitation may be identified in the fact that feature correspondences have to be derived between the panoramic intensity images derived for the respective scans.In this regard, we may generally observe that the total number of feature correspondences decreases with an increasing distance between the respective scan positions and, accordingly, the quality of the registration results will decrease.However, this constraint holds for the other imagebased approaches as well and is not specific for our framework.

CONCLUSIONS
In this paper, we have presented a novel framework for evaluating the influence of different keypoint detector-descriptor combinations on the results of point cloud registration.While we involve five different approaches for extracting local features from the panoramic intensity images derived for the single scans, the registration process has been transferred from object space to observation space by considering the forward-projection of putative feature correspondences and exploiting bearing vectors instead of the corresponding 3D points themselves.Our results clearly reveal that replacing SIFT and SURF detectors and descriptors by more recent approaches significantly alleviates point cloud registration in terms of accuracy, efficiency and robustness.
For future work, we plan to integrate more approaches for feature extraction as well as more approaches for keypoint-based point cloud registration in our framework in order to objectively evaluate their performance on publicly available benchmark TLS datasets.In this context, it would also be desirable to point out chances and limitations of the different approaches w.r.t.different criteria specified by potential end-users, e.g. the spacing between adjacent scans or the complexity of the observed scene.Promising results may be expected.

Figure 1 .
Figure 1.The proposed framework for keypoint-based point cloud registration and the involved methods for each component.strings from image patches.In this context, the individual bits are obtained from a set of binary tests based on comparing the intensities of pairs of points along specific lines.The modification resulting in the ORB descriptor consists of steering BRIEF according to the orientation of keypoints and thus deriving a rotationaware version of the standard BRIEF descriptor.The similarity of such binary descriptors can be evaluated by using the Hamming distance, which is very efficient to compute.

Figure 2 .
Figure 2. Map of the Hannover "Holzmarkt": the position of buildings is visualized in dark gray and the scan positions for different scans Sj are indicated with red spots.The scan IDs are adapted according to(Brenner et al., 2008).

Figure 3 .
Figure 3. Feature correspondences between the panoramic intensity images of scans S1 and S2 when using the combination of an A-KAZE detector and an M-SURF descriptor: all correspondences (top) vs. inlier correspondences (bottom).

Figure 8 .
Figure 8. Aligned point clouds when using A-KAZE + M-SURF: the points belonging to different scans Sj are encoded with different color.

Table 2 .
Average processing times tFEX for feature extraction, tFM for feature matching, tCR for coarse registration, tFR for fine registration and average total time tΣ required for automatically aligning two scans.