Stereo Matching of High-Resolution Satellite Images via Hierarchical ViT and Self-Supervised DINO

He, Xu; Yang, Mengran; Jiang, San; Jiang, Wanshou; Li, Qingquan

doi:https://doi.org/10.5194/isprs-annals-X-G-2025-357-2025

Articles | Volume X-G-2025

https://doi.org/10.5194/isprs-annals-X-G-2025-357-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-X-G-2025-357-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume X-G-2025

10 Jul 2025

| 10 Jul 2025

Stereo Matching of High-Resolution Satellite Images via Hierarchical ViT and Self-Supervised DINO

Xu He, Mengran Yang, San Jiang, Wanshou Jiang, and Qingquan Li

Keywords: Satellite Image, Dense Matching, Deep Learning, Semi-global Matching

Abstract. Dense matching plays an important role in 3D modeling from satellite images. Its purpose is to establish pixel-by-pixel correspondences between two stereo images. This study presents a learning-based dense matching approach that integrates selfsupervised learning with a multi-head attention mechanism to achieve feature fusion. Since stereo matching in satellite datasets is restricted by the disparity range, the pixel-by-pixel method can reduce the limitation. In the feature extraction module, we have performed attention-based in-depth learning on the smallest-scale feature using the self-supervised DINO. In addition, a CEP (Context-Enhanced Path) module is added outside the main matching path, and continuously enhanced position embedding is used to improve relative position encoding. The effectiveness of this method has been demonstrated through experiments on the US3D and WHU-Stereo datasets.

Stereo Matching of High-Resolution Satellite Images via Hierarchical ViT and Self-Supervised DINO

Useful Links

Useful External Links

Our Contact