ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Publications Copernicus
Articles | Volume V-2-2022
17 May 2022
 | 17 May 2022


D. Laupheimer and N. Haala

Keywords: Urban Scene Understanding, Semantic Segmentation, Multi-Modality, Textured Mesh, Point Cloud

Abstract. The semantic segmentation of the huge amount of acquired 3D data has become an important task in recent years. Meshes have evolved into a standard representation next to Point Clouds (PCs) – not least because of their great visualization possibilities. Compared to PCs, meshes have commonly smaller memory footprints while jointly providing geometrical and high-resolution textural information. For this reason, we opt for semantic mesh segmentation, which is a widely overlooked topic in photogrammetry and remote sensing yet. In this work, we perform an extensive ablation study on multi-modal handcrafted features adapting the Point Cloud Mesh Association (PCMA) (Laupheimer et al., 2020) which establishes explicit connections between faces and points. The multi-modal connections are used in a two-fold manner: (i) to extend per-face descriptors with features engineered on the PC and (ii) to annotate meshes semi-automatically by propagating the manually assigned labels from the PCs. In this way, we derive annotated meshes from the ISPRS benchmark data sets Vaihingen 3D (V3D) and Hessigheim 3D (H3D). To demonstrate the effectiveness of the multi-modal approach, we use well-established and fast Random Forest (RF) models deploying various feature vector compositions and analyze their performances for semantic mesh segmentation. The feature vector compositions consider features derived from the mesh, the PC or both. The results indicate that the combination of radiometric and geometric features outperforms feature sets of a single feature type only. Besides, we observe that relative height is the most crucial feature. The main finding is that the multi-modal feature vector integrates the complementary strengths of the underlying modalities. Whereas the mesh provides outstanding textural information, the dense PCs are superior in geometry. The multi-modal feature descriptor achieves the best performance on both data sets. It significantly outperforms feature sets that incorporate only features derived from the mesh by +7.37 pp and +2.38 pp for mF1 and Overall Accuracy (OA) on V3D. The registered improvement is +9.23 pp and +4.33 pp for mF1 and OA on H3D.