ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-X-4-W8-2025-721-2026

The CNN-Transformers Crossroads, Comparing RT-DETR and YOLOv12 for Small object detection in remote sensing images

Solatinia

Behnam

¹ Niazmardi

Saeid

¹ Alipour Fard

Tayeb

https://orcid.org/0000-0003-4777-0128

Department of Surveying Engineering, Faculty of Civil and Surveying Engineering, Graduate University of Advanced Technology, Kerman, Iran

29 05 2026

X-4/W8-2025 721 727

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/X-4-W8-2025/721/2026/isprs-annals-X-4-W8-2025-721-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/X-4-W8-2025/721/2026/isprs-annals-X-4-W8-2025-721-2026.pdf

Detecting small objects in remote sensing images has always been a challenge. The Convolutional Neural Network (CNN) and Transformer-based networks are two prominent categories of deep learning models used to address this challenge. Recently, combining both architectures has emerged to improve detection performance. However, a direct comparison between the leading standard models representing these architectures has yet to be conducted. In this study, we provided a performance comparison of two state-of-the-art detectors: YOLOv12, a CNN-based model with an attention mechanism, and RT-DETR, a transformer-based model built on a CNN backbone. We fine-tuned both algorithms on a custom remote sensing dataset containing small objects (airplanes and cars) and evaluated their performance based on precision, recall, F1-score, and training time. The results showed that YOLOv12 was significantly faster to train and achieved higher precision. These qualities make it a better choice for applications where minimizing false positives is critical. RT-DETR, with high recall and F1-score, was more effective at detecting a larger number of small objects. This analysis offers valuable insights into the trade-offs between these two architectures and serves as a guideline for selecting the appropriate model for each specific remote sensing task.