Stereo Matching Network with Transformer-CNN Feature Fusion and ConvGRU Refinement for High-resolution Satellite Stereo Images

Yang, Mengran; Jiang, San; Jiang, Wanshou; Li, Qingquan

doi:https://doi.org/10.5194/isprs-annals-X-G-2025-995-2025

Articles | Volume X-G-2025

https://doi.org/10.5194/isprs-annals-X-G-2025-995-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-X-G-2025-995-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume X-G-2025

14 Jul 2025

| 14 Jul 2025

Stereo Matching Network with Transformer-CNN Feature Fusion and ConvGRU Refinement for High-resolution Satellite Stereo Images

Mengran Yang, San Jiang, Wanshou Jiang, and Qingquan Li

Keywords: Satellite stereo images, Disparity estimation, Transformer, Gate recurrent unit, Convolutional neural network

Abstract. In photogrammetry and remote sensing, disparity estimation of satellite images has been a significant and challenging task, holding crucial importance for research and applications in this domain. Recent years have seen substantial progress in stereo matching methods, but challenges remain significant in ill-posed regions. Although deep learning-based stereo matching methods outperform traditional approaches in terms of performance and speed, their limited receptive field makes it difficult for networks to establish long-distance dependencies. This poses challenges in ill-posed areas such as textureless regions, repetitive patterns, and occluded areas. This paper proposes an end-to-end model for high-resolution satellite remote sensing images. First, in the feature extraction stage, we use two independent Transformer and CNN modules to extract global and local features of stereo image pairs. Subsequently, by designing effective fusion strategies, we merge these two types of features to obtain richer and more accurate feature representations. Next, we utilize multi-scale features to construct multi-level cost volumes, supervising each level of cost volume from coarse to fine. This allows lower-level cost volumes to provide prior knowledge to higher-level cost volumes, guiding them to acquire richer and more accurate information. Finally, we employ a ConvGRU-based recurrent module in the refinement module on geometrically encoded cost volumes containing geometric and contextual information to iteratively update disparity maps with finer details and structures. In experiments, we validate our approach using publicly available datasets and compare it with traditional methods. Experimental results demonstrate significant performance improvements in stereo matching tasks, proving the effectiveness of our proposed method.

Stereo Matching Network with Transformer-CNN Feature Fusion and ConvGRU Refinement for High-resolution Satellite Stereo Images

Useful Links

Useful External Links

Our Contact