ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-2-2026-821-2026

Assessing the Reconstruction Potential of 3D Vision Foundation Models for Oblique Photogrammetry

Wang

Junfan

¹ Liu

Feng

² Jia

Zhihao

¹ Hu

Han

¹ ³ Chen

Min

¹ ³ Ge

Xuming

¹ ³ Wen

Ping

³ ⁴ Wang

Chong

³ ⁴ Zhu

Qing

¹ ³

Faculty of Geosciences and Engineering, Southwest Jiaotong University, 611756 Chengdu, China

CRSC Communication & Information Group Co., Ltd.

Yunnan Engineering Research Center of 3D Real Scene, Kunming 650500, China

Kunming Engineering Corporation Limited, Kunming 650500, China

03 07 2026

XI-2-2026 821 830

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/821/2026/isprs-annals-XI-2-2026-821-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/821/2026/isprs-annals-XI-2-2026-821-2026.pdf

3D vision foundation models, which directly regress 3D geometry from 2D images in an end-to-end manner, have recently attracted growing attention in the computer vision community. However, their potential for oblique 3D reconstruction has not been systematically evaluated. To this end, we establish an automated evaluation pipeline to benchmark these models on oblique imagery. Our experiments reveal that: benefiting from the powerful zero-shot generalization, 3D vision foundation models can robustly estimate camera parameters and generate dense point clouds under sparse-view and low-overlap conditions, with some rivaling traditional photogrammetry configured with redundant observations. Counterintuitively, two-view reasoning foundation models employing explicit PnP-RANSAC for global alignment consistently outperform multi-view reasoning foundation models inferring multi-view relationships via implicit attention mechanism when processing more than 2 views. Notably, incorporating known camera parameters as conditioning inputs, which act as weak supervision rather than rigid geometric constraints, yields only marginal accuracy improvements. Based on ViT architecture, these foundation models face scalability bottlenecks to large-scale and high-resolution oblique imagery, and their prevalent ideal pinhole camera assumption still makes explicit distortion correction an unavoidable preprocessing step.