CityLangSplat: Integrating CityGML Semantics into 3D Language Gaussian Splatting for Urban Scene Understanding
Keywords: Gaussian Splatting, Open-vocabulary 3D Understanding, CityGML, Semantic Fusion, Urban Scene Understanding
Abstract. Combining visual semantics with language representations has made 3D interpretation more flexible and intuitive. Recent advances in Gaussian Splatting extend this to efficient 3D language fields supporting open-vocabulary queries. However, existing approaches show limited generalization in large urban scenes, especially for detailed building segmentation. Semantic 3D city models such as CityGML, by contrast, provide hierarchical and geometry-aligned structural semantics that complement appearance-driven visual cues. We introduce CityLangSplat, which integrates CityGML semantics into 3D Language Gaussian Splatting for urban environments. CityLangSplat rasterizes CityGML into pixel-aligned semantic maps, extracts vision-language features from SAM-derived segments and CityGML regions, and compresses both sources into a shared latent space via a lightweight autoencoder. 3D Gaussians are then optimized with a coverage-aware loss that balances accurate, building-focused CityGML supervision with broader SAM supervision, enabling geometry-aligned open-vocabulary reasoning in urban scenes. Experiments on TUM2TWIN and ZAHA datasets show consistent gains over LangSplat, with relative improvements of 22.9% in 2D and 15.1% in 3D evaluation while preserving real-time rendering. CityLangSplat provides a practical framework for combining semantic city models with language-embedded 3D Gaussian Splatting for geometry-aligned urban scene interpretation. Code will be released at https://github.com/zqlin0521/CityLangSplat.
