Advancing Mixed Land Use Detection by Embedding Spatial Intelligence into Vision-Language Models
Keywords: Mixed Land Use, Urban Analytics, Vision-Language Models, Spatial Intelligence, Spatially Explicit AI, GeoAI
Abstract. Embedding spatial intelligence into vision-language models (VLMs) has offered a promising avenue to improve geospatial decision-making in complex urban environments. In this work, we propose a novel framework that augments the architecture of Contrastive Language-Image Pretraining (CLIP) with the techniques of spatial-context aware prompt engineering and spatially explicit contrastive learning. By leveraging a diverse set of geospatial imagery (e.g., street view, satellite, and map tile images), paired with contextual geospatial text generated and curated via GPT-4, our approach constructs robust multimodal representations that capture visual, textual, and spatial insights. The proposed model, termed GeospatialCLIP, is specifically evaluated for urban mixed land use detection, a critical task for sustainable urban planning and smart city development. Results demonstrate that GeospatialCLIP consistently outperforms traditional vision-based few-shot models (e.g., ResNet-152, Vision Transformers) and exhibits competitive performance with state-of-the-art models such as GPT-4. Notably, the incorporation of spatial prompts, especially those providing city-specific cues, significantly boosts detection accuracy. Our findings highlight the pivotal role of spatial intelligence in refining VLM performance and provide novel insights into the integration of geospatial reasoning within multimodal learning. Overall, this work establishes a foundation for future spatially explicit AI development and applications, paving the way for more comprehensive and interpretable models in urban analytics and beyond.