Automatic Detection Models for Building Exterior Wall Cracks in Drone Imagery Based on CNN And Transformer

Shang, Yaoling; Ge, Ying; Ma, Yuqing; Zhang, Yingying; Lv, Shilin

doi:10.5194/isprs-annals-XI-2-2026-729-2026

Articles | Volume XI-2-2026

https://doi.org/10.5194/isprs-annals-XI-2-2026-729-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-XI-2-2026-729-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XI-2-2026

03 Jul 2026

| 03 Jul 2026

Automatic Detection Models for Building Exterior Wall Cracks in Drone Imagery Based on CNN And Transformer

Yaoling Shang, Ying Ge, Yuqing Ma, Yingying Zhang, and Shilin Lv

Keywords: Building Facades, Drone Imagery, Crack Detection, U-Net Model, Transformer Architecture, Image Segmentation

Abstract. This study constructs a comprehensive evaluation framework comprising six representative models: standard U-Net, Resnet34-U-Net, UNet-Attention, UNet-Residual, HybridUNet, and TransUNet. We performed systematic ablation experiments to analyse the contributions of different architectural components, including residual connections, attention mechanisms, and Transformer modules. The models were trained and validated on a dedicated dataset of building exterior crack images captured by drones, with careful consideration of the challenges posed by complex backgrounds, varying lighting conditions, and fine crack features. Multiple loss functions - F1 Loss, Focal-Dice-Loss, and BCE-Dice-Loss - were evaluated to determine their impact on model performance. The evaluation employed comprehensive metrics including Accuracy, F1 Score, IoU, Precision, Recall, and Loss values to ensure thorough performance assessment.

Experimental results demonstrate that TransUNet achieved the best overall performance with F1 Score of 87.66%, Precision of 90.43%, and Recall of 89.99%, leveraging its Transformer module's global context modelling capability. In loss function comparisons, F1 Loss yielded the most balanced performance on TransUNet with F1 Score of 87.50%, while Focal-Dice-Loss showed exceptional optimization stability with the lowest loss value (0.1008) and high recall (96.05%). Interestingly, the performance gap among the six models was relatively small, with the difference in F1 Score between the optimal TransUNet and baseline standard U-Net being less than 0.5%. Qualitative analysis revealed that while complex models like TransUNet excel in overall metrics, simpler architectures like UNet-Attention and UNet-Residual demonstrate better robustness in challenging scenarios with complex textures, highlighting the importance of context-specific model selection.

This research provides comprehensive insights into deep learning approaches for building exterior crack detection. TransUNet with F1 Loss emerges as the optimal solution for high-accuracy requirements, while standard U-Net and its attention-enhanced variants offer cost-effective alternatives for large-scale applications. The minimal performance gap among different architectures suggests that model complexity alone doesn't guarantee superior performance for this specific task. The study emphasizes the importance of balancing accuracy needs with computational efficiency in practical engineering applications. These findings offer valuable guidance for model selection in real-world building maintenance scenarios and contribute to the advancement of intelligent detection technologies in structural health monitoring. Future work should focus on enhancing model robustness across diverse environmental conditions and optimizing computational efficiency for broader implementation.

Automatic Detection Models for Building Exterior Wall Cracks in Drone Imagery Based on CNN And Transformer

Useful Links

Useful External Links

Our Contact