Assessing the Reconstruction Potential of 3D Vision Foundation Models for Oblique Photogrammetry

Wang, Junfan; Liu, Feng; Jia, Zhihao; Hu, Han; Chen, Min; Ge, Xuming; Wen, Ping; Wang, Chong; Zhu, Qing

doi:10.5194/isprs-annals-XI-2-2026-821-2026

Articles | Volume XI-2-2026

https://doi.org/10.5194/isprs-annals-XI-2-2026-821-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-XI-2-2026-821-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XI-2-2026

03 Jul 2026

| 03 Jul 2026

Assessing the Reconstruction Potential of 3D Vision Foundation Models for Oblique Photogrammetry

Junfan Wang, Feng Liu, Zhihao Jia, Han Hu, Min Chen, Xuming Ge, Ping Wen, Chong Wang, and Qing Zhu

Keywords: 3D Vision Foundation Model, Oblique Imagery, Photogrammetric Reconstruction, Evaluation

Abstract. 3D vision foundation models, which directly regress 3D geometry from 2D images in an end-to-end manner, have recently attracted growing attention in the computer vision community. However, their potential for oblique 3D reconstruction has not been systematically evaluated. To this end, we establish an automated evaluation pipeline to benchmark these models on oblique imagery. Our experiments reveal that: benefiting from the powerful zero-shot generalization, 3D vision foundation models can robustly estimate camera parameters and generate dense point clouds under sparse-view and low-overlap conditions, with some rivaling traditional photogrammetry configured with redundant observations. Counterintuitively, two-view reasoning foundation models employing explicit PnP-RANSAC for global alignment consistently outperform multi-view reasoning foundation models inferring multi-view relationships via implicit attention mechanism when processing more than 2 views. Notably, incorporating known camera parameters as conditioning inputs, which act as weak supervision rather than rigid geometric constraints, yields only marginal accuracy improvements. Based on ViT architecture, these foundation models face scalability bottlenecks to large-scale and high-resolution oblique imagery, and their prevalent ideal pinhole camera assumption still makes explicit distortion correction an unavoidable preprocessing step.

Assessing the Reconstruction Potential of 3D Vision Foundation Models for Oblique Photogrammetry

Useful Links

Useful External Links

Our Contact