From Pixels to Semantics: Can a Single Instruction-Tuned VLM Unify Geospatial Building Analysis?
Keywords: Vision–Language Models, Instruction Tuning, Remote Sensing, Building Analysis, Multi-Task Learning
Abstract. The analysis of buildings from aerial imagery is fundamental for urban planning and disaster response, yet it traditionally requires separate specialized models for tasks such as segmentation, detection, and semantic querying. Generalist Vision-Language Models (VLMs) offer a promising alternative, but adapting them to high-resolution remote sensing remains challenging. This paper proposes and investigates a data-centric methodology for adapting Google’s PALIGEMMA2 into a unified geospatial building analyzer. The main contribution is a pipeline that converts single-modality building polygon annotations into a multi-task instruction-tuning dataset of 16,500 samples spanning segmentation, detection, Visual Question Answering (VQA), and captioning. We conduct a rigorous study addressing three questions: (1) Can a single instruction-tuned VLM outperform specialized models in a multi-task setting? (2) What are the synergistic benefits of multi-task learning? (3) How data-efficient is this adaptation process? Results show that the unified model substantially outperforms the zero-shot PaliGemma2 baseline and strong single-task fine-tuned variants on three of four tasks, while remaining competitive on the fourth. We also observe a strong synergistic effect: multi-task training on visual localization and semantic tasks improves performance on individual localization tasks. Furthermore, high performance is achieved with a surprisingly small instruction dataset. This work provides a complete methodology for efficiently adapting VLMs to multi-task geospatial analysis, suggesting a path toward generalist models in remote sensing. To support further research and fair comparison, the dataset is available at: https://chaikalamrullah.github.io/RoofVIP/
