ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-1-2026-255-2026

A Category-Specific Prompt Strategy for Semantic 3D Indoor Mapping Using RGB-D Camera

Hou

Jiwei

¹ Volland

Vivien

² Karam

Samer

¹ Iwaszczuk

Dorota

Remote Sensing and Image Analysis, Department of Civil and Environmental Engineering, Technical University of Darmstadt, 64287 Darmstadt, Germany

Geodetic Measurement Systems and Sensor Technology, Department of Civil and Environmental Engineering, Technical University of Darmstadt, 64287 Darmstadt, Germany

03 07 2026

XI-1-2026 255 262

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-1-2026/255/2026/isprs-annals-XI-1-2026-255-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-1-2026/255/2026/isprs-annals-XI-1-2026-255-2026.pdf

Semantic 3D indoor mapping often depends on supervised learning and large annotated datasets, limiting scalability across diverse environments. This work introduces a category-specific prompt strategy for semantic 3D mapping using RGB-D cameras, integrating RGB-D SLAM with the Segment Anything Model 2 (SAM2) to enable annotation-efficient reconstruction. Keyframes and trajectories extracted from SLAM provide spatial references, while SAM2 performs zero-shot segmentation guided by a Category- Wise Prompt Segmentation Strategy (CPSS), which segments structural and functional elements (e.g., floors, doors, staircases) by category to reduce prompt interference and manual effort. The segmented keyframes are then fused with depth and pose data to produce instance-level semantic point clouds. Experiments on custom RGB-D sequences and selected ScanNet scenes demonstrate centimeter-scale geometric consistency and strong semantic consistency, with mIoU values up to 0.89 on the custom dataset and 0.98 on ScanNet. The resulting semantic point clouds are clean, structured, and require minimal post-processing, showing that the proposed strategy provides an efficient and scalable solution for semantic 3D indoor mapping without retraining or environment-specific supervision.