ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-2-2026-455-2026

Zero-shot Vision-Language Reranking for Cross-View Geolocalization

Erzurumlu

Yunus Talha

¹ Anderson

John E.

² Shuart

William J.

² Toth

Charles

https://orcid.org/0000-0001-9461-4887

³ Yilmaz

Alper

Dept. of Electrical and Computer Engineering, The Ohio State University, 281 W Lane Ave, Columbus, Ohio, USA

US Army Corps of Engineers Geospatial Research Lab, Corbin Field Station, Woodford, Virginia, USA

Dept. of Civil Engineering, The Ohio State University, 281 W Lane Ave, Columbus, Ohio, USA

03 07 2026

XI-2-2026 455 461

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/455/2026/isprs-annals-XI-2-2026-455-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/455/2026/isprs-annals-XI-2-2026-455-2026.pdf

Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.