ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-2-2026-857-2026

From Pixels to Semantics: Can a Single Instruction-Tuned VLM Unify Geospatial Building Analysis?

Mutreja

Guneet

https://orcid.org/0000-0002-2070-4860

¹ Harikumar

Harisankar

² Amrullah

Chaikal

¹ Bittner

Ksenia

German Aerospace Center (DLR), Münchener Straße 20, Weßling, Germany

Dept. of Civil Engineering, Karlsruhe Institute of Technology, Kaiserstraße 12, Germany

03 07 2026

XI-2-2026 857 864

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/857/2026/isprs-annals-XI-2-2026-857-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/857/2026/isprs-annals-XI-2-2026-857-2026.pdf

The analysis of buildings from aerial imagery is fundamental for urban planning and disaster response, yet it traditionally requires separate specialized models for tasks such as segmentation, detection, and semantic querying. Generalist Vision-Language Models (VLMs) offer a promising alternative, but adapting them to high-resolution remote sensing remains challenging. This paper proposes and investigates a data-centric methodology for adapting Google’s PALIGEMMA2 into a unified geospatial building analyzer. The main contribution is a pipeline that converts single-modality building polygon annotations into a multi-task instruction-tuning dataset of 16,500 samples spanning segmentation, detection, Visual Question Answering (VQA), and captioning. We conduct a rigorous study addressing three questions: (1) Can a single instruction-tuned VLM outperform specialized models in a multi-task setting? (2) What are the synergistic benefits of multi-task learning? (3) How data-efficient is this adaptation process? Results show that the unified model substantially outperforms the zero-shot PaliGemma2 baseline and strong single-task fine-tuned variants on three of four tasks, while remaining competitive on the fourth. We also observe a strong synergistic effect: multi-task training on visual localization and semantic tasks improves performance on individual localization tasks. Furthermore, high performance is achieved with a surprisingly small instruction dataset. This work provides a complete methodology for efficiently adapting VLMs to multi-task geospatial analysis, suggesting a path toward generalist models in remote sensing. To support further research and fair comparison, the dataset is available at: <code>https://chaikalamrullah.github.io/RoofVIP/</code>