Large-scale DSM registration via motion averaging

Generating wide-area digital surface models (DSMs) requires registering a large number of individual, and partially overlapped DSMs. This presents a challenging problem for a typical registration algorithm, since when a large number of observations from these multiple DSMs are considered, it may easily cause memory overflow. Sequential registration algorithms, although can significantly reduce the computation, are especially vulnerable for small overlapped pairs, leading to a large error accumulation. In this work, we propose a novel solution that builds the DSM registration task as a motion averaging problem: pair-wise DSMs are registered to build a scene graph, with edges representing relative poses between DSMs. Specifically, based on the grid structure of the large DSM, the pair-wise registration is performed using a novel nearest neighbor search method. We show that the scene graph can be optimized via an extremely fast motion average algorithm with O(N) complexity (N refers to the number of images). Evaluation of high-resolution satellite-derived DSM demonstrates significant improvement in computation and accuracy.


Introduction
Generating wide-area digital surface models (DSMs) is pivotal in establishing foundational geospatial datasets.This often involves registering a large number of individual DSMs generated either from the same or different sources, such as from satellite stereo-based reconstruction (e.g.WorldView 2 & 3 1 , PlanetScope (Huang et al., 2022, Qin andGui, 2023)), aerial sensors (Xu et al., 2024, Huang andQin, 2023), lidar, Synthetic Aperture Radar (SAR) (e.g.SRTM (JPL, 2013)).These individual DSMs, however, may have varying degrees of absolute geometric accuracy, as well as partial and even very minimal overlap.Thus, registering DSMs under this context presents as a challenging problem both due to both the scale of the problem, as well as the sub-optimal quality of input.For example, registering these DSMs may require minimizing geometric errors between billions of points, easily causing memory overflow, at the same time, registering partial and low-overlap DSMs easily brings large errors leading to error accumulation when multiple DSMs are registered.To this end, We propose a framework based on motion averaging, which entails enumerating all overlapped DSM pairs to establish a scene graph.Our approach involves the utilization of an efficient pair-wise registration method to eliminate systematic errors at the pairwise level and subsequently redistribute these errors across the graph.This process aims to reduce and evenly distribute systematic errors among multiple DSMs.
As mentioned above, the task of registering multiple DSMs raises several unique and difficult challenges.To be more specific, first of all, the huge volume of data adds to the difficulty of memory and computation.The standard method is the iterative closest point (ICP) (Besl and McKay, 1992), which iteratively minimizes the cloud-to-cloud distance between a pair of 1 https://resources.maxar.com/data-sheets/worldview-23D data.The correspondences are formed by searching for the nearest neighbor (NN) point of reference data for each query point, which usually can be sped up by k-d tree (Besl andMcKay, 1992, Greenspan andYurick, 2003), or octree (Eggert and Dalyot, 2012), but the initialization of a k-d tree necessitates caching the entire reference data into memory, requiring O(N ) space, O(N logN ) time.Such methods become impractical for city or terrain-level data.For instance, caching a single WorldView-2 DSM (usually around 20000 × 20000 equals 400 million points) consumes 22GB for the construction of k-d tree.Another challenge is accurately estimating global poses from pair-wise registration results.The prevailing method adopts a greedy approach, initiating with the initial pair and estimating subsequent DSMs based on maximum overlap.However, as the DSM count increases, this approach accumulates small registration errors, resulting in a significant deviation from the actual scene.
To address the above-mentioned challenges, we propose a motion averaging-based framework for removing the systematic errors among multiple unaligned DSMs, as shown in Figure 1.We first propose an ICP-based pair-wise registration method that can perform fast and exact NN searching by utilizing the grid structure of DSM (termed as DSM-ICP).Its space and computation complexity are independent of the data volume, allowing excellent scalability to large-scale datasets.Based on it, we perform pair-wise registration to all possible pairs of DSM, establishing a reliable scene graph, where each edge represents the relative poses and carries weights based on its overlap and registration error.Finally, a motion averaging approach over the weighted edge graph is performed to redistribute the registration error for the estimation of the global poses.In our experiments, we quantitatively and qualitatively evaluate our proposed method using a large number of individual satellitestereo-based DSMs against the ground-truth lidar point cloud data.The rest of this paper is organized as follows: section 2 Figure 1.Overview of the proposed method.Given N unaligned DSMs, our target is to remove their systematic errors to produce seamless registered DSMs (the unaligned and registered DSMs are color-coded based on their heights).We first construct a scene graph G(V, E) with edges denoting all possible pairs of DSMs.Then, the proposed DSM-ICP (see Section 3.1) is performed for all edges determining their pair-wise transformation {Tij|(i, j) ∈ E}.Finally, a motion averaging approach is performed to estimate the global poses {Ti|i = 1, ..., N } (see Section 3.2) reviews the related works of 3D registration methods.section 3 introduces the proposed motion averaging framework in detail.section 4 presents the experimental results, comparative studies, and our analysis.section 5 concludes this work.

Related Work
Pair-wise 3D registration is the process of estimating the transformation (rigid-body transformation in our case) between a pair of 3D data sets to minimize the systematic error between two 3D data, which is widely used in 3D mapping and change detection (Xu et al., 2021) tasks.It can be categorized into coarse and fine registration according to the registration accuracy.Coarse registration assumes no initial alignment, involving the extraction of key primitives (e.g.key points (Rusu et al., 2009, Ao et al., 2021), lines (Chen and Yu, 2019), planes (Chen et al., 2019), 4-points congruent set (Mellado et al., 2014)), correspondence construction (Bai et al., 2021, Zhang et al., 2023), and transformation estimation (Barath andMatas, 2018, Yang et al., 2020).Such methods often yield suboptimal accuracy due to errors in primitive localization and matching, necessitating subsequent fine registration (Xu et al., 2023).The standard fine registration method is the ICP algorithm, which iteratively refines the transformation by searching for NN points and estimating the transformation.Variants of ICP aim to enhance its robustness (Sparse-ICP (Bouaziz et al., 2013), Robust ICP (Zhang et al., 2021)) and convergence speed (Point-to-Plane ICP (Arun et al., 1987), Symmetric ICP (Rusinkiewicz, 2019), AA-ICP (Pavlov et al., 2018)).The NN search typically employs k-d tree, requiring O(N ) memory for reference data caching and O(logN ) time for querying.Typically, CODEM 2 is also an ICP-based method for DSM registration while they transform DSM into point cloud format and perform standard ICP.In comparison, ours utilizes the grid structure of DSM and aims to optimize the memory and computation requirements of NN search to make it applicable to large-scale datasets.
Multi-view registration considers each DSM as a view, and simultaneously solves the registration of each DSM at the same time.This can be accomplished through growing-based and optimization-based approaches.Growing-based methods initiate registration with the pair possessing the largest overlap and progressively register new data to existing data.The order of data registration is determined using Kruskal's algorithm (Kruskal, 1956).This type of method is efficient while as the number of data sets increases, the accumulated registration error will lead to drift problems.Optimization-based methods involve constructing a scene graph, with each edge denoting 2 https://github.com/NCALM-UH/CODEMrelative pose and global pose estimation achieved by minimizing a predefined objective function -a concept known as motion averaging or pose graph optimization in the robotics community (Carlone et al., 2015).The closed-form solutions estimate the rotation using the singular value decomposition (SVD)-based methods (Arie-Nachimson et al., 2012, Gojcic et al., 2020, Wang et al., 2023) and then solve the translation using the least square methods (Gojcic et al., 2020).The iterative least-squares methods linearize the objective function and use Gauss-Newton or Levenberg-Marquardt methods to update the global poses.Our solution involves the SVD-based method to ensure computation and memory efficiency.

Given a set of unaligned DSMs
In the following, we first introduce a computation and memory-efficient DSM registration method in Section 3.1.
Based on it, we construct a scene graph with reliable weights for each edge and apply a motion-averaging approach to estimate the poses of every DSM in Section 3.2.

DSM-ICP
Aligning a pair of DSMs P and Q involves finding the rigid transformation (R|t) such that applying the transformation to P causes the two DSMs to be as close as possible.The standard method is iterative closet point (ICP).It minimizes the distance between two DSMs by alternating between two steps: • Correspondence step: search pairs of corresponding points (pi, q k i ), where q k i is the closest point to pi given the current transformation (R k |t k ): • Estimation step: estimate the new transformation (R k+1 , t k+1 ) by optimizing the ℓ2 distance given the current correspondences (pi, Common methods solve the correspondence step by applying NN methods, such as k-d tree (Besl and McKay, 1992), approximate k-d tree (Greenspan and Yurick, 2003), or octree (Eggert Efficient and exact NN searching.In a DSM, each pixel denotes a 3D point in world coordinates, where its uv coordinates [ui, vi] indicate the geo-location in the horizontal plane and its value hi = dsm [ui, vi] represents the height in meters.The neighboring pixels within a range of x pixels can be queried in constant time using the operation [ui−x : ui+x, vi−x : vi+x].
Given a query pixel [up, vp] from the moving DSM P , we will introduce an efficient method to find the upper bound of its NN from the reference DSM Q in constant time.
Specifically, at the initial phase shown in Figure 2, we calculate its world coordinate (xp, yp) using Equation 3.1.Then, the pixel coordinate (uq, vq) from the other DSM Q covering the same world coordinates can be calculated using Equation 3.1.The height of that pixel can be looked up and termed as hq = DSMq(uq, vq).
(xp, yp, hp) = uv2world(up, vp, tf wp) Then, the pixel (uq, vq) serves as the initial correspondence to the query pixel (up, vp), which is found in constant time.Moreover, their Euclidean distance also serves as the upper bound that the NN is guaranteed to be located within the 3D sphere ball centered at the query 3D point with a radius of the d, where d = ∥hp − hq∥.
At the bounding phase, the 3D sphere ball is projected into the horizontal plane, formulating the 2D rectangle  while our method establishes edges connecting all possible nodes.In this scenario, the path between any two nodes in our graph is shorter than or equal to those in the MST graph (e.g., node 4&6 in OMA3), significantly mitigating the accumulated error.
The primary impediment to the application of k-d tree to large data volumes lies in the necessity to initialize by caching the entire dataset, requiring O(N logN ) time and O(N ) memory space.A frequently compromised solution involves building a k-d tree for downsampled datasets, albeit at the expense of accuracy reduction.In comparison, our NN search method eliminates the need for initialization.All data are stored on disk, and only local data around the query points are cached.For each NN search, our method demands O(k) time and memory space, where the upper bound d is calculated in constant time and k is the number of pixels within the upper bound.In practice, k ≪ N and as the two DSMs are getting closer, k will decrease dramatically.Our approach achieves both memory and computational efficiency compared to the k-d tree method.This enables ICP to be applied to large-scale data without compromising accuracy.

Motion Averaging-based Framework
Scene graph construction.The scene graph serves as a potent tool for representing pair-wise information among multiple objects and finds extensive application in various engineering problems.In this context, our objective is to construct a scene graph G(V, E), where each vertex vi ∈ V corresponds to a DSM Pi and each edge (i, j) ∈ E encodes the relative pose between DSM Pi and Pj.
Existing growing-based methods apply a greedy approach to search for the minimal number of edges connecting all vertex.As the number of vertex increases, the accumulated error between vertex that have a large path cannot be reduced, as shown in Figure 3.A reliable scene graph is intended to capture extensive relative information for providing redundant constraints to make sure any two vertex have a small path.Furthermore, the reliability of each edge must be taken into account to filter out potential outliers.Therefore, for each pair (Pi, Pj), where sij ∈ [0, 1]; Assuming the provided DSMs are approximately geo-referenced, the identification of overlap pixels depends on the presence of corresponding reference pixels at the same geographical location.Once the sij is larger than a predefined threshold, the edge will be inserted into the scene graph.Their relative pose Tij will further be estimated using our DSM-ICP outlined in Section 3.1.
Weight initialization.The edge reliability can be determined by two factors: the overlap ratio and the quality of pair-wise registration.We associate a weight on each edge to indicate the reliability of the estimated relative pose Tij.It is calculated based on both the overlap score sij and the quality score of pair-wise registration rij: where errij represents the pair-wise registration error representing the estimation error of registering the correspondences at the last iteration.It can be calculated based on Equation 2with Tij.
Motion averaging.Given the edge weights and the relative poses {wij, Tij = (Rij, tij)|(i, j) ∈ E}, we estimate the global poses {Ti = (Ri, ti)}.It is also known as the pose graph optimization problem in the robotics community (Carlone et al., 2015).Each relative pose (Rij|tij) can be represented by global poses (Ri|ti) and (Rj|tj) where the t ϵ ij ∈ R3 , R ϵ ij ∈ SO(3) denote the measurement noise.Therefore, we estimate the global poses {Ri, ti} by solving the optimization problem where ∥•∥F means the Frobenius norm of the matrix.The problem can be addressed by decoupling the optimization of rotation and translation.We first estimate the global rotation Ri using the closed-form solutions (Arie-Nachimson et al., 2012, Gojcic et al., 2020, Wang et al., 2023).Once the rotation is determined, the translation ti can be computed using the standard least square method (Gojcic et al., 2020).

Experimental Results
In this section, two experiments were performed to evaluate the proposed method in terms of: 1) pair-wise registration performance: the memory and computation performance of the proposed DSM-ICP given varying volumes of datasets, introduced in Section 4.2, and 2) multiple DSM registration performance: the average relative error among all DSMs and the overall error with respect to the ground truth data, introduced in Section 4.3.The data and evaluation metrics used in the experiments are introduced in Section 4.1.

Datasets and Metrics
Datasets.We choose the public multi-view satellite datasets: DFC 2019 (tack 3) (Le Saux et al., 2019), encompassing the city of Jacksonville (short for JAX), Florida, USA, and Omaha (short for OMA), Nebraska, USA.These datasets were captured by WorldView-3 3 with ground-sampling distances of 35cm.In track 3, the satellite images were cropped into sub-images, totaling 118 sub-areas.Each subarea covers 700m × 700m and is observed by 30 images.The sub-areas partially overlap, and we manually chose six sets of data as the test data, as illustrated in Figure 4.For each sub-area, we conducted dense matching to generate the DSM using satellite 3D reconstruction software4 .The ground truth data for both regions are airborne lidar scanning data from the USGS 3DEP program5 , with a point density of 24.64 pt/m 2 , 5.82 pt/m 2 in Omaha area.
Metrics.The pair-wise registration performance can be reflected by the distance between two registered DSMs.In practical scenarios, the standard Root Mean Squared Error (RMSE) may not be effective due to the presence of outlier pixels  1. Quantitative results of accumulated error reduction performance for two methods."M EANRMSE τ " represents the average RM SEτ for all possible pairs.The first row represents the alignment of the raw dataset without any registration process.When the number of DSMs is small (e.g.JAX1, JAX3, and OMA1), the two methods achieve similar performance.When the number of DSMs is large (e.g.JAX2, OMA2, and OMA3), ours achieve better error reduction performance.within DSMs, which can significantly influence overall metrics.Therefore, we implement a straightforward outlier rejection mechanism in conjunction with RMSE: where the • is the Iverson bracket, τ is a pre-defined inlier threshold (10m in our case).(pi, qi) is the pair of points corresponding to the same horizontal location.

Performance of Pair-wise Registration
The primary advantage of the proposed DSM-ICP lies in the elimination of the need for initializing spatial data structures such as k-d tree.In each NN search, its cached data is confined to the local area rather than encompassing the entire region.We assess its computational and memory efficiency across diverse volumes of reference data.Specifically, we measure the time cost for every NN search and the overall RAM consumption for both DSM-ICP and the standard ICP utilizing k-d tree.
Computation efficiency performance.We measure the time cost during the correspondence step (refer to Section 3.1) at each iteration for both the proposed DSM-ICP and the standard ICP employing k-d tree.Specifically, we fix the number Figure 6.Memory consumption of the proposed DSM-ICP and the standard ICP using k-d tree with the varying number of reference points.The experiment setting is the same as Figure 5 of query points to 2065 while varying the volume of reference data from 0.5 million to 305 million points.The results are illustrated in Figure 5.It is evident that ICP using k-d tree initially consumed more time as it traversed all reference points on the local disk to construct the k-d tree in RAM.In contrast, our method bypassed this process.The time cost gradually decreased as the two DSMs moved apart initially, resulting in a larger searching range d (refer to details in Section 3.1).As the two DSMs approached each other, the searching range d decreased, leading to a smaller time cost for the NN search.After several iterations, the time cost for both methods stabilized.Our method required more time due to accessing data via disk, while k-d tree accessed data via RAM.
Memory efficiency performance.We measure the total RAM utilization throughout the entire iteration process with varying volumes of reference data.As depicted in Figure 6, the RAM consumption of the k-d tree exhibits linearity with the number of reference points, as the space complexity of constructing a kd tree is O(N ), implying it loads all reference points into RAM.In contrast, our RAM consumption remains independent of the volume of reference data.When the number of reference points reached 305 million, our method required only 133 MB, while the k-d tree demanded 17200 MB-129 times larger than ours, revealing the superior memory efficiency of our approach.

Performance of Multiple DSM Registration
We assess the reduction in accumulated error and the reconstruction performance of the proposed graph-based method.The distinctive advantage of our approach lies in the elimination of accumulated registration errors through redundant pairwise constraints.The conventional approach for registering multiple DSMs employs a greedy method, which initiates registration with the pair exhibiting the largest overlap and sub- Table 2. Quantitative assessment of the quality of the fused DSM, measured by RM SEτ against the lidar ground truth.Our method demonstrates superior reconstruction quality in 4 out of 6 sets.As observed in Figure 9, particularly for scenarios with a large number of DSMs (e.g., JAX2, OMA2, OMA3), our approach outperforms the greedy method significantly.sequently registers the next largest overlap for the remaining DSMs.This can be accomplished using Kruskal's algorithm to find the minimal spanning tree (MST) of an undirected edgeweighted graph.Both the greedy method and our approach utilize the proposed DSM-ICP as the pair-wise registration method.
Accumulated error reduction performance.We apply the proposed method and the greedy method to six test sets (depicted in Figure 4) and gauge the average RM SEτ for all possible pairs as the evaluation metric.As reported in Table 1, our method can reduce initial systematic errors by 0.2 − 1.2m.It achieved the highest registration accuracy in 5 out of 6 sets.The greedy algorithm performs well when the graph size is small; however, as the graph enlarges, relative registration errors will accumulate along the path connecting two distant nodes.The advantage of our method lies in the provision of more pair-wise constraints, offering additional directions to distribute the accumulated error across the graph.
Specifically, we select two test sets and present their pair-wise RM SEτ in Figure 7. Nodes 4 and 6 in OMA3 are four steps away from each other in the MST graph and their RM SEτ is 2.420m.In comparison, ours reduced the registration error to 1.889m.The same holds for the nodes 1&5, 4&5, and 7&8 in JAX2.
Additionally, we showcase the qualitative results of nodes 4 & 5 of JAX2 and nodes 4 & 6 of OMA3 by illustrating the profiles of registered DSMs in Figure 8.In the JAX2 area, DSMs registered by the greedy method exhibit a rough 1m height difference, whereas ours nearly perfectly aligns the two DSMs, considering the quality of the DSM.Similarly, our method achieves nearly perfect registration of DSMs in OMA3, while the greedy method results in a rough 1m height difference.
Reconstruction performance.After eliminating systematic errors among the DSMs, we fuse all registered DSMs into a single DSM and assess its quality against lidar ground truth.Specifically, we align the fused DSM (achieved using RSP) with the ground truth using DSM-ICP and compute RM SEτ .
The quantitative results are presented in Table 2, wherein our approach achieved superior reconstruction accuracy in 4 out of 6 areas.Moreover, we showcase qualitative results illustrating the quality of the fused DSM in two selected areas, JAX2 and OMA3, as depicted in Figure 9. Regions with significant errors are colorized in "Red".In the JAX2 area, large errors produced by the greedy method are mainly concentrated around the middle part, where accumulated errors between nodes 4&5 and 1&5 (see Figure 7) were not effectively reduced by the greedy method.Similar observations apply to the left part of the OMA3 area.

Conclusion
In this paper, we presented a motion averaging-based approach for the registration of multiple large-scale DSMs.Specifically, we proposed a fast and exact NN search method by utilizing the grid structure of DSM.Its computation and memory complexity are independent of the volume of the input data, enabling our pair-wise registration method, DSM-ICP, to be scalable to large-scale data.Based on it, we perform pair-wise registration for all possible pairs of DSMs to construct a reliable scene graph, where the edge is weighted based on overlap and registration error.Then, the global poses for each DSM are estimated by redistributing the pair-wise registration errors over the whole scene graph.We evaluated the proposed method on public satellite datasets with the ground truth airborne lidar data.
The experiment results demonstrate that our DSM-ICP achieves both superior computation and memory efficiency than k-d tree-based ICP.The proposed motion averaging-based method achieves better error reduction performance and reconstruction performance than the greedy method.

Figure 2 .
Figure 2. Illustration of the proposed NN search method.To simplify, DSMs are depicted as profiles in 2D space, with the x-axis representing the horizontal plane and the y-axis denoting height.The "Blue dashed" lines depict DSM data stored on disk, while the "Green solid" line represents cached data in RAM. and Dalyot, 2012).These methods necessitate the initialization of the spatial structures to cache the whole reference data Q.Specifically, k-d tree needs O(N ) memory and O(N logN ) time, and each NN searching requires O(logN ) time.In the estimation step, the closed-form solution involves separating the rotation and translation.Rotation estimation is achieved through SVD-based methods, while the translation can be readily determined.
[xq − d :  xq + d, yq − d : yq + d]  as the final bound.Finally, at the NN phase, the uv coordinates of the bounding rectangle are calculated and any pixels within the bounding rectangle are cached.By traversing all the cached pixels, we can find the NN.

Figure 3 .
Figure3.Comparison of two approaches for scene graph construction."MST" denotes the minimal spanning tree, which identifies the minimum number of edges connecting all nodes, while our method establishes edges connecting all possible nodes.In this scenario, the path between any two nodes in our graph is shorter than or equal to those in the MST graph (e.g., node 4&6 in OMA3), significantly mitigating the accumulated error.

Figure 4 .
Figure 4. Illustration of six test sets of satellite DSMs and the ground truth lidar data, where three test sets in Jacksonville area (JAX) and three test sets in Omaha area (OMA).The detailed description is in Section 4.1 we calculate the overlap score by sij = #overlap pixels #valid pixels (4)

Figure 5 .
Figure 5. Computation efficiency of NN search for DSM-ICP and the standard ICP using k-d tree.The y-axis represents the time cost of the NN search at each iteration.In this experiment, 2065 points in total were performed NN search within the varying number of reference points from 0.5 million (a) to 305 million points (d).For ICP using k-d tree, the "Blue dashed" line represents the initialization of k-d tree, followed by NN search.

Figure 7 .
Figure 7. Registration errors of selected pairs in JAX2 and OMA3 area.The MST and our graphs are shown in Figure 3.We only display the pairs that are likely to accumulate the errors.Two nodes of these pairs are connected by a long path (more than 3 edges) in the MST graph.

Figure 8 .
Figure 8. Profiles of selected buildings in JAX2 and OMA3 area.We show the profiles of registered DSMs by greedy method (red lines) and ours (blue lines).

Figure 9 .
Figure 9. Qualitative results of error map depicting the height difference between the fused DSM and the lidar ground truth.The greedy method produces more error data than ours, particularly around the middle-top part in JAX2 and the left part in OMA3.