<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher">ISPRS-Annals</journal-id>
<journal-title-group>
<journal-title>ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences</journal-title>
<abbrev-journal-title abbrev-type="publisher">ISPRS-Annals</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2194-9050</issn>
<publisher><publisher-name>Copernicus Publications</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5194/isprs-annals-XI-2-2026-857-2026</article-id>
<title-group>
<article-title>From Pixels to Semantics: Can a Single Instruction-Tuned VLM Unify Geospatial Building Analysis?</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mutreja</surname>
<given-names>Guneet</given-names>
<ext-link>https://orcid.org/0000-0002-2070-4860</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Harikumar</surname>
<given-names>Harisankar</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Amrullah</surname>
<given-names>Chaikal</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Bittner</surname>
<given-names>Ksenia</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group><aff id="aff1">
<label>1</label>
<addr-line>German Aerospace Center (DLR), Münchener Straße 20, Weßling, Germany</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Dept. of Civil Engineering, Karlsruhe Institute of Technology, Kaiserstraße 12, Germany</addr-line>
</aff>
<pub-date pub-type="epub">
<day>03</day>
<month>07</month>
<year>2026</year>
</pub-date>
<volume>XI-2-2026</volume>
<fpage>857</fpage>
<lpage>864</lpage>
<permissions>
<copyright-statement>Copyright: &#x000a9; 2026 Guneet Mutreja et al.</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access">
<license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri"  xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p>
</license>
</permissions>
<self-uri xlink:href="https://isprs-annals.copernicus.org/articles/XI-2-2026/857/2026/isprs-annals-XI-2-2026-857-2026.html">This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/857/2026/isprs-annals-XI-2-2026-857-2026.html</self-uri>
<self-uri xlink:href="https://isprs-annals.copernicus.org/articles/XI-2-2026/857/2026/isprs-annals-XI-2-2026-857-2026.pdf">The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/857/2026/isprs-annals-XI-2-2026-857-2026.pdf</self-uri>
<abstract>
<p>The analysis of buildings from aerial imagery is fundamental for urban planning and disaster response, yet it traditionally requires separate specialized models for tasks such as segmentation, detection, and semantic querying. Generalist Vision-Language Models (VLMs) offer a promising alternative, but adapting them to high-resolution remote sensing remains challenging. This paper proposes and investigates a data-centric methodology for adapting Google&amp;rsquo;s PALIGEMMA2 into a unified geospatial building analyzer. The main contribution is a pipeline that converts single-modality building polygon annotations into a multi-task instruction-tuning dataset of 16,500 samples spanning segmentation, detection, Visual Question Answering (VQA), and captioning. We conduct a rigorous study addressing three questions: (1) Can a single instruction-tuned VLM outperform specialized models in a multi-task setting? (2) What are the synergistic benefits of multi-task learning? (3) How data-efficient is this adaptation process? Results show that the unified model substantially outperforms the zero-shot PaliGemma2 baseline and strong single-task fine-tuned variants on three of four tasks, while remaining competitive on the fourth. We also observe a strong synergistic effect: multi-task training on visual localization and semantic tasks improves performance on individual localization tasks. Furthermore, high performance is achieved with a surprisingly small instruction dataset. This work provides a complete methodology for efficiently adapting VLMs to multi-task geospatial analysis, suggesting a path toward generalist models in remote sensing. To support further research and fair comparison, the dataset is available at: &lt;code&gt;https://chaikalamrullah.github.io/RoofVIP/&lt;/code&gt;</p>
</abstract>
<counts><page-count count="8"/></counts>
</article-meta>
</front>
<body/>
<back>
</back>
</article>