On-the-Fly SfM: What you capture is What you get

Over the last decades, ample achievements have been made on Structure from motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image's connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.

In response to real-time performance, there exists another related hot research topic of VSLAM (Visual Simultaneous Localization and Mapping) worth referring to, it can deal with video data in real time.Given sequential frames, VSLAM can compute realtime trajectory of cameras and 3D object points.Generally, with various embedded sensors, VSLAM can be mainly categorized into mono-VSLAM, stereo-VSLAM and Inertial-VSLAM (Mur-Artal et al., 2015;Kiss-Illés et al., 2019;Qin et al., 2019;Mur-Artal and Tardós, 2017;Campos et al., 2021), they all contain several common modules: tracking, inputting frames and outputting the corresponding pose; local mapping, generating 3D points and optimizing local maps; loop closure, detecting loop and refining loop correction.The inherent assumption of VSLAM requires that the input frames must be spatiotemporally continuous (Mur-Artal et al., 2015), which means two adjacent frames must be contiguous in time and space or auxiliary information from GPS/IMU (Qin et al., 2019; Mur-Artal and * Corresponding authors Tardós, 2017) is available, this consequentially hinders the way that the data can be collected.In this paper, as Fig. 1 exemplifies, we present a novel on-the-fly SfM: running online SfM while image capturing.Similar to conventional SfM, on-the-fly SfM yields image poses and 3D sparse points, but we do this while the image capture.More specifically, the current image's pose and corresponding 3D points can be estimated before next image is captured and on-thefly to be processed, i.e., what you capture is what you get.Also, analogous to VSLAM that can ensure real-time performance, onthe-fly SfM is further designed to be able to deal with images captured in an arbitrary way, whereby the spatiotemporal continuity is not necessary any more.The proposed SfM is mainly composed of three steps: online image collecting module, fast image matching and efficient geometric processing.The first one is first established with a camera and a Wifi transmitter, which immediately send the captured image for processing via Wifi signal transmission.The second step is to efficiently and robustly generate the matching results between the already registered images and the new fly-in image, in which fast image retrieval is the most important component for real-time performance.The last step is to estimate camera pose and 3D points robustly and fast, besides the canonical image registration and triangulation, an efficient hierarchical weighted local bundle adjustment is adopted.For each new fly-in image, we just iterate these three steps.
To approach the goal of what you capture is what you get, along with the presented on-the-fly SfM using a new online working mode, we also make three technical contributions:  Fast image retrieval based on learning-based global feature and vocabulary tree.In this work, we extract the global feature using the pre-trained model (Hou et al., 2023) and unsupervised train vocabulary tree.For each new fly-in image, the global feature is computed and traversed along the vocabulary tree for fast image retrieval.
 Refinement of correspondences using Least Squares Matching (Yue et al., 2023).Based on the geometric and photometric consistency around the local windows of matched points, a least squares system is applied to refine the 2D position of correspondences according to the grey values within the relevant local windows on two images.
 Hierarchical weighted local BA for efficient optimization of poses and 3D points.For each new fly-in image, only its neighboring connected images (already registered) are enrolled in BA.In addition, based on our image retrieval result, the influence of various connected images on the newly captured image is implied by hierarchical weights, which are employed as priors for improving BA.

RELATED WORK
In this section, two related topics (SfM and SLAM) are briefly reviewed, some state-of-the-art studies of image retrieval and efficient bundle adjustment are then introduced.

SfM & VSLAM
So far, there are a lot of open public SfM packages, e.g., VisualSFM (Wu, 2016), OpenMVG (Moulon et al., 2016), Theia (Sweeney, 2016), Colmap (Schönberger et al., 2016), etc.However, all these packages basically concentrate on offline processing mode.For example, Colmap, one of the most widelyused packages, furnishes an end-to-end 3D reconstruction pipeline for large-scale unordered images and it unfolds via a structured pipeline that is mainly comprised of three key stages: image matching, pose estimation and sparse reconstruction, dense reconstruction.To achieve the goal of real-time SfM, inspired by monocular VSLAM, Song et al. (2014) presented a monocular SfM that concentrated on eliminating scale drift using the information of ground plane.They yielded comparable performance to stereo setting on long-time sequences.Zhao et al. (2022) proposed a so-called real-time SfM (RTSfM), in which feature matching was improved by a hierarchical feature matching strategy based on BoW (Bag-of-Word) (Nister et al., 2006) and multi-view homography, and a graph-based optimization was employed for efficiency.However, both the reviewed online SfM methods still rely on the spatiotemporal continuity between images or require GPS.
Furthermore, there already exist quite a few mature VSLAM methods that are capable of real-time performance in specific scenarios or tasks.For example, the very popular ORB-SLAM series (Mur-Artal et al., 2015;Mur-Artal and Tardós, 2017;Campos et al., 2021) was continuously published, which achieve high-precision localization while capturing frames.VINS-Fusion (Geneva et al., 2020) combined input of images and GPS/IMU, it is widely adopted in autonomous driving.However, the robustness of these VSLAM methods is limited in certain scenarios, such as weak textures and motion blur.Yue et al. (2023) integrated least squares into the feature matching of ORB-SLAM2 and provided more precise observations.Note that while ample VSLAM methods are worth reviewing, we only list a few popular and related works.
Contrary to conventional SfM, our proposed on-the-fly SfM is deployed with real-time online processing while image capturing in an arbitrary way.Comparing to VSLAM, the major advantage of our method is that the requirement for input images' spatiotemporal continuity is not necessary any more, nor is the independence of GPS/IMU.

Image retrieval
Image retrieval technique has been widely deployed in SfM and VSLAM for accelerating feature matching and loop closure detection.One typical idea is to build an efficient indexing structure using local features (e.g., SIFT, ORB), in which the BoW is one of the most representative methods to fast identify similar image pairs and loop closure, such as (Snavely et al., 2006).Similar to BoW, Havlena and Schindler (Havlena et al., 2014) trained a two-layer vocabulary tree for speeding up image matching.Wang et al. (2019) introduced random KD-forest consisted of several independent KD-trees, and matchable image pairs can be efficiently determined via traversing on the KDforest.
In the last few years, learning-based methods have greatly improved image retrieval regarding both time efficiency and precision.2023) proposed a CNN fine-tuning method with multiple NetVLADs to aggregate feature maps of various channels and published an benchmarks LOIP that consists of both crowdsourced and photogrammetric images.

Efficient optimization of bundle adjustment
Nowadays, bundle adjustment (BA) has become a mature technique for optimizing image poses and 3D point positions.However, as image number increases, a lot of works for solving BA in a fast and reliable way emerged.For example, preconditioned conjugate gradients were explored to solve BA in (Shen et al., 2018), Wu et al. (2011) and Zheng et al. (2017) further improved the efficiency for solving large-scale linear equation system by means of GPU.To cope with large-scale problem, distributed approaches that split a large BA problem into several overlapping small subset BA problems attract researchers' attentions (Zhang et al., 2017;Mayer, 2019).Zhang et al (2017) parallelly solved each subset BA and proposed global camera consensus constraint to merge all subsets, Mayer (2019) employed 3D points as global consensus constraints and the corresponding covariance information was applied for better convergence behavior.MegBA (Ren et al., 2022) parallelly solved subsets via multiple GPUs, which provide a more time efficient solution.
All the above BA methods aims to efficiently optimize all unknows globally, which are inherently not feasible for sequential mode (it is not efficient to run global BA when each and every new image comes in, see section IV-C).

On-the-Fly SfM
In this section, we introduce our on-the-fly SfM in more detail.First, we overview the general pipeline of our on-the-fly SfM.Then, three key involved methodologies are explained: 1) Fast image retrieval based on learning-based global feature and vocabulary tree; 2) Correspondence refinement using least squares matching; 3) Efficient local BA optimization via weighted hierarchical tree.Two-view geometry.Similar to (Schönberger et al., 2016), a multi-model two-view geometric verification method is applied.In general, fundamental matrix is estimated and two images are geometrically reliable if at least Nf inlier matches exist, then the homography is computed with Nh inliers.For calibrated case, essential matrix is estimated as well.And the final two-view geometric model is selected according to GRIC (Torr, 1997), and initial stereo reconstruction is selected as the verified image pair with most triangulated 3D points and the median triangulation angle being closed to 90 degrees (e.g., 60~120).

Overview of on-the-fly SfM
LSM correspondence refinement.Despite the employed robust estimator in two-view geometry and online reconstruction, a further improvement can be expected by refining the generated correspondences based on least squares matching.
Online reconstruction.This part mainly addresses on image pose and 3D point estimation, among which the image registration and triangulation are solved by EPnP (Lepetit et al., 2009) and RANSAC-based multi-view triangulation (Schönberger et al., 2016).To approach online reconstruction, we solve the most time-consuming bundle adjustment by presenting hierarchical weighted local bundle adjustment which is based on the fact that newly fly-in image only affects its connected overlapping images to some degree (more details can be found in section III-D)

Fast image retrieval based on learning-based global feature and vocabulary tree
In this part, a fast image retrieval pipeline integrated with learning-based global feature and vocabulary tree is employed to guarantee online image matching for on-the-fly SfM.   2) Vocabulary tree training.To the best of our knowledge, for global features, similar images are typically retrieved by comparing Euclidean distance of two images' feature vectors, which is yet not efficient for large scale problem.Motivated by BoW, it can be expected that a vocabulary tree for global feature is able to further improve retrieval time Given the extracted global feature by (Hou et al., 2023), we can train a corresponding vocabulary tree via an unsupervised manner, i.e., the canonical K-means algorithm is hierarchically repeated to split the feature space until a certain depth is reached.To ensure the generality and even splitting of the feature space, LOIP containing various crowdsourced and photogrammetric images is used.As a consequence, a vocabulary tree with the information of each cluster center is generated for fast image retrieval.
3) Fast image retrieval for new fly-in image.Based on the pretrained models of global feature extractor and vocabulary tree, matchable images of new fly-in image can be fast found.Instead of estimating Euclidean distance of all possible image pairs, only the cluster centers are required to be compared and the assumption is that similar images should fall into the same node as Fig. 4 shows.Specifically, as a new image flies in, its global feature is extracted and fed into the vocabulary tree, the already registered images that are matchable image candidates can be fast identified via traversing the nodes of vocabulary tree, i.e., similar images should always be in the same node.

Correspondence refinement using least squares matching
Based on the original feature matching mechanism (e.g., SIFT), we present a correspondence refinement solution by integrating with the least squares matching (LSM), which is supposed to mitigate error accumulation.In general, as Fig. 5 shows, LSM is firstly applied to improve correspondences regarding 2D position and outliers, and to generate new observations for improving PnP estimation.et al., 2023).Typically, radiometric and geometric inconsistency are explored in LSM, the first one often results from illumination, various photographic conditions and errors of digitization, etc., the second one is normally due to depth changes and image distortion, etc.The basic assumptions of LSM are: radiometric inconsistency between matched points is not complicated and can be approximated by linear transformation (see equation ( 1)) and the geometric inconsistency between two corresponding small local windows can be simply modelled by affine transformation (as Fig ( 5) left shows, see equation ( 2)).LSM is formulated by Equation (3) that combines Equation 1 and 2.
Equation ( 3) can be solved using least squares in an iterative way (Yue et al., 2023).If the refinement converges successfully, the refined 2D position can be obtained from Equation ( 2), otherwise, the correspondence is deleted as outlier.
2) 2D position refinement and outlier detection.According to the basic principle of LSM, given a pairwise correspondence, i.e., ( 1 ,  1 ) and ( 2 ,  2 ), we first try to solve equation (3) using least squares: if it converges, the corresponding 2D position will be refined; if it fails, the correspondence is detected as an outlier.
3) Densifying matches.For new fly-in image, one of the main goals is to compute the corresponding pose via EPnP.To ensure a robust and reliable pose estimation, new reliable extra 2D-3D matches are produced using LSM.For some 3D points that can be viewed by a specific image, but without corresponding 2D observations, LSM is run to generate these new 2D-3D matches.More specifically, initial pose is first estimated, 3D points are reprojected onto image for coarse 2D positions, LSM is then followed to optimize for more accurate 2D positions as densified matches.Finally, all the 2D-3D matches including both original and densified ones are employed for pose estimation.To achieve real-time performance for our on-the-fly SfM, an efficient bundle adjustment is heavily required.Inspired by the natural phenomenon that the closer to center the ripple is, the larger the related amplitude is (as Fig. 6 left corner shows), analogously, the uncertainty of new fly-in image makes higher influence on closely associated images than images that are farther.As Fig. 6 implies, this work presents a new efficient local bundle adjustment with hierarchical weights.Based on the image retrieval results (section B), a hierarchical association tree is built, which indicates the association relationship between new image and registered images.Then, hierarchical weight for every locally associated image is then estimated and used for robust bundle adjustment.

Hierarchical weighted local bundle adjustment for efficient optimization
Figure 7. Example of hierarchical association tree building and weighting.
1) Hierarchical association tree building and weighting.With the presented fast image retrieval method, for every fly-in image, it is efficient to figure out top-N similar images.As images on various ripples (or hierarchical layer) are inconsistently affected by new image, a Hierarchical association tree is built.The images in first ripple are composed of Top-N similar images for current fly-in image, and the second ripple images are Top-N similar images of first ripple images, repeat until a pre-setting depth L. All the enrolled images in the hierarchical tree are denoted as  ℎ .As Fig. 7 shows, a toy 4-layer hierarchical association tree is illustrated, in which every bottom layer images are the retrieved Top-N images of the upper layer and the first layer contribute highest effect on new image (indicated by thick red line).According to the ripple phenomenon, this work introduces a simple yet efficient hierarchical weighting solution for various ripple images, as shown in Equation ( 4): where i is the index of layer number, * is the current new fly-in image and k is a constant value (k>1) denoting the basic inverse influence between new fly-in image and already registered images.The larger the i is, the higher the corresponding   is, which means images on farther ripples are much more stable and should have smaller updates.
2) Local bundle Adjustment with hierarchical weights.Based on the local block consisting of  ℎ and weighting   , we establish a new efficient and robust local BA with hierarchical weights.Equation ( 5) denotes the original reduced normal equation with only camera parameters (Wu, 2013).
(5) To run bundle adjustment in a fast and robust way for new fly-in image, this study modifies Equation (5) as shown in Equation ( 6) (   +   ) ℎ  ℎ = −   ℎ  (6) where, only the local block BA ( ℎ ) with images  ℎ are refined and reasonable weights  ℎ composed of corresponding   is employed for robust optimization.

EXPERIMENTS
In this section, we report extensive experimental results on various datasets to demonstrate the capability of "what you capture is what you get" for our on-the-fly SfM.

Implementation details
The learning-based global features are extracted by (Hou et al., 2023) and the vocabulary tree is trained with all images LOIP (Hou et al., 2023).In Fig. 8, our online image transmission is integrated with CAMFI 3.0 wireless image transmission equipment, whose working area is around 50 meters and transmission speed can be up to 10 Mb/s.Typically, 3-5s are needed to receive one image since it is captured in our tests.All experiments are run on the machine with 16 CPU processors and RTX3080 GPU.More information and code can be found at http://yifeiyu225.github.io/on-the-flySfMv1.github.io/.
Experimental datasets.As fig. 8 shows, two self-collected datasets (SX-221 images, YX-349 images) are used to evaluate the on-the-fly performance of our SfM, which were taken in an arbitrary way and transferred online to our system.Three visual sequences (fr1_desk，fr3_st_far， fr1_xyz) from TUM RGB-D datasets (Sturm et al., 2012) are simulatively employed as input.

Performance of efficient local bundle adjustment
To demonstrate the efficacy of the presented local bundle adjustment, different bundle adjustment solutions are compared: first, a global bundle adjustment that enrolls all images is performed (Glo.);second, a combined solution integrated with local and global bundle adjustment (Com.), this is actually successfully applied in Colmap (Schönberger et al., 2016); third, local bundle adjustment with hierarchical weights with (Ours).
Based on fr3_st_far, these three bundle adjustment solutions are tested for BA optimization when flying in a new image.
Figure 11.Cost time on fr3_st_far with various bundle adjustment methods.
Fig. 11 shows the time cost for different BA methods, which records the optimization time for each new fly-in image.It can be found that, as the image number grows, the consuming time increases dramatically for the Global method, the cost time of Ours increases the slowest and tends to be stable after adding some images.This can be explained by the fact that, as more images involve, more time is needed to refine more unknown parameters.The whole block is considered for Global method, whereas, Ours only solves a local bundle adjustment for images in the built hierarchical association tree.Tab. 1 lists quantitative results, i.e., averaging mean reprojection error of each BA (AMRE), mean reprojection error of final BA (MFRE) and mean track length (MLT), these results are nearly similar and in the same magnitude.Therefore, the presented BA is fast yet robust solution, and is feasible to our on-the-fly SfM.Tab. 2 presents the average processing time for all images of each dataset, in particular, several key procedures are reported: image transmission (IT), feature extraction (FE), online image matching (OIM), two-view geometric verification (GV), Image registration (IR), Triangulation (Tri.) and bundle adjustment (BA).We can find that OIM and GV take the most time, and all the others are quite fast.It is worth noting that, basically, for our on-the-fly SfM, current fly-in image can be solved before next image is received.

Comparison with other state-of-the-art SfM
In addition, to further explore how far is our SfM to the state-ofthe-art SfM, we make comparative investigation involving two popular SfM systems, namely, Colmap (Schönberger et al., 2016) and OpenMVG (Moulon et al., 2016).Due to that both Colmap and OpenMVG are only with offline mode, time efficiency is not discussed here.Our SfM results of SX and YX are visualized in Fig. 12.In Tab. 3 and 4, three criteria are studied including mean reprojection error, mean track length and rotation discrepancy (taking Colmap and OpenMVG as reference).In general, the MRE values from our SfM, Colmap and OpenMVG are less than 1 pixel, which typically demotes a converge behavior in BA, in most cases, we obtain better MRE which might be resulted from the refined correspondences from our least squares matching.
The final MTL varies a lot, this is due to that they used various image matching packages and outlier detection strategies in BA.
The very small rotation discrepancy on rotation results shows that our SfM is capable to yield considerable camera poses as these popular SfM packages are.

CONCLUSION
In this work, we present a novel on-the-fly SfM: running online SfM while image capturing, which achieves the goal of what you capture is what you get.Three technical improvements of learning-based online image matching, correspondence refinement using least squares and efficient local bundle adjustment using hierarchical weights are employed to guarantee a fast yet robust online SfM.Extensive results of various datasets demonstrate the real-time performance and robustness of our onthe-fly SfM.

Fig. 2
Fig.2 illustrates the general workflow of our on-the-fly SfM, which constitutes five parts: image capturing and transmitter, online image matching, two-view geometry, LSM correspondence refinement, online reconstruction.Image capturing and transmitter.To achieve the goal of what you capture is what you get, in this work, a consumer digital camera is used to collect images, which is integrated with a wireless Wifi transmitter to transfer images for processing in real time (see section IV for more details).After receiving a new flyin image, the other four parts start to work.Online image matching.Fast identifying matchable images for new fly-in image is one of the most important procedures, as the first step for a new image is to find the relationship with already registered images, i.e., running image matching.In this paper, we applied the learning-based global feature (Hou et al., 2023) and its corresponding vocabulary tree to fast determine new image's matchable candidate images, among which correspondences are estimated.

Figure 2 .
Figure 2. Workflow of the proposed on-the-fly SfM.
Fig.3illustrates the key idea: 1. Pre-train models.CNN model is applied as global feature extractor(Hou et al., 2023;Arandjelović et al., 2016;Radenović et al., 2019), and a vocabulary tree is built using global features of all training images; 2. Image retrieval for fly-in image.Each new image's global feature is firstly extracted using selected CNN model, and input into built vocabulary tree to fast identify matchable images.

Figure 3 .
Figure 3. Fast image retrieval workflow based on learning-based global feature and vocabulary tree. 1) Learning-based global feature extractor.CNNs have been successfully applied in retrieving visually similar images as feature extractor (Sturm et al., 2012).In this work, to determine matchable image pairs that often have partial overlapping area, the fine-tuned CNN model of (Hou et al., 2023) is selected as our global feature extractor, as we find that (Hou et al., 2023) is tailored for seeking overlapping image pairs to speed up offline SfM and is supposed to be also feasible for our on-the-fly SfM.In particular, (Hou et al., 2023) yields a new training dataset (LOIP) with ground-truth matchable pairs, and a novel architecture composed of CNN and multiple NetVLADs are finetuned by region triplet loss.Note that their off-the-shelf model is accessible and employed.

Figure 4 .
Figure 4. Toy example for fast image retrieval of new fly-in image.Similar images are clustered into the same node.

Figure 5 .
Figure 5. Least square matching refinement.1)Basic principle of LSM.The general idea of LSM is to optimize the 2D position of matches based on consistency of pixel grey values around corresponding local windows on two images(Yue et al., 2023).Typically, radiometric and geometric inconsistency are explored in LSM, the first one often results from illumination, various photographic conditions and errors of digitization, etc., the second one is normally due to depth changes and image distortion, etc.The basic assumptions of LSM are: radiometric inconsistency between matched points is not complicated and can be approximated by linear transformation (see equation (1)) and the geometric inconsistency between two corresponding small local windows can be simply modelled by affine transformation (as Fig (5) left shows, see equation (2)).LSM is formulated by Equation (3) that combines Equation 1 and 2.

Figure 8 .
Figure 8. Online Image transmission (left-Hardware, middle-SX, right-YX).Running parameters.In this work, some free parameters are empirically set.For the online image matching, the vocabulary tree is with 5-layer depth and 5 sub-clusters for each node.Each new fly-in images selects Top-30 similar images for subsequent matching.The small local window in LSM is set as 15 ×15 pixels.For efficient BA, as each image in the ripple has top-N candidate images which might return a large BA block, only top-8 similar images are considered.The constant weighting parameter k = 2 in all experiments.4.2 Performance of fast image retrieval To validate the real-time performance of our online image matching, based on SX and fr3_st_far, we investigate three different image matching strategies: exhaustive matching using Colmap with default setting (EM), exhaustive Euclidean comparison using learning-based global feature (Hou et al., 2023) (EE) and the proposed image retrieval (Ours) based on learningbased global feature and vocabulary tree.

Figure 9 .
Figure 9.Time consuming of various methods on fr3_st_far.

Fig. 9
Fig.9compares the time consumption of each strategy, in which the horizontal axis represents the number of current fly-in image and vertical axis is the time cost of retrieving current image with all the already registered images.It can be found that using global feature is significantly faster than the original local feature-based matching mechanism of Colmap, especially for large-scale dataset.In addition, when comparing with EE and Ours, the vocabulary tree can further improve time efficiency.The time cost of Ours is linear to the increasing number of images, while EE's time cost is quadratic.

Fig. 10
Fig. 10 qualitatively shows the matching results that both Ours and EE can identify the basic skeleton of EM, which means the most similar images determined by EM are successfully found by Ours and EE.
Figure 10.Overlapping graph of SX.The darker red the pixel is, the higher possibility the corresponding image pair overlaps with each other.

Table 2 .
Cost Time of each core stage in Ours SfM (ms)

Table 3 .
Figure 12.Visualization of our SfM results on SX and YX.Comparison between on-the-fly SfM and Colmap

Table 4 .
Comparison between on-the-fly SfM and OpenMVG