Elevation Guided Global and Local Smoothness for Unsupervised Semantic Segmentation in Remote Sensing Imagery
Keywords: Multimodal Training, Self-Supervision, NDSM, Energy Minimization, Conditional Random Fields
Abstract. Unsupervised and self-supervised deep learning networks for semantic segmentation of images have made impressive progress in the last years. They can be trained without any labelled data and yet are able to effectively segment RGB images into meaningful semantic groups. In remote sensing, supplementary information, such as elevation, improves class separation by differentiating classes based to their height above ground. We take SmooSeg, a recently developed, state-of-the-art unsupervised network for semantic segmentation, and guide its training process by infusing elevation information into its projector and smoothness prior. This ensures global label consistency across the entire dataset and improves the segmentation performance, since patches of the same semantic group often exhibit similar elevation characteristics. We also extend the Conditional Random Field (CRF) to refine the low-resolution segmentation results in a post-processing step with elevation information. We introduce a second pairwise potential that encourages neighboring pixels with similar elevation to have the same label, ensuring local label consistency. Our multi-modal training strategy remains unsupervised and improves the segmentation performance on the ISPRS Potsdam-3 dataset by +4.0% in mIoU over the RGB-only SmooSeg baseline and by 4.4% when also using the multi-modal CRF post-processing. Collectively, our approach surpasses all state-of-the-art unsupervised segmentation networks that rely solely on RGB data for the Potsdam-3 dataset, highlighting the important role of elevation data in label-free segmentation for remote sensing applications.