TextSCD: Leveraging Text-based Semantic Guidance for Remote Sensing Image Semantic Change Detection
Keywords: Semantic change detection, Vision-language representation learning, Multi-task learning, Remote sensing
Abstract. Semantic change detection (SCD) in remote sensing image aims to identify semantic alterations between bi-temporal images captured at the same geographic location. SCD is extensively applied in fields such as environmental monitoring and disaster assessment. Despite significant advancements in deep learning leading to numerous successful approaches, most existing methods primarily rely on visual representation learning, thereby overlooking the potential benefits of multimodal data. Recently, vision-language models have demonstrated outstanding performance across various downstream tasks. In this paper, we propose a novel framework named TextSCD that leverages text-based semantic information to guide the generation of semantic change maps. Our approach integrates Gemini to generate change descriptions between bi-temporal images and employs a multi-level semantic extraction method to capture features from both images and their corresponding captions. Furthermore, we introduce a semantic text-guided interaction module that facilitates the effective integration of visual and textual features, enhancing multimodal knowledge transfer and the extraction of discriminative features. This design effectively reduces false detections and omissions. We validate the effectiveness of our model on the SECOND dataset, achieving notable improvements in overall accuracy for semantic change detection.