ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Share
Publications Copernicus
Download
Citation
Share
Articles | Volume X-4/W8-2025
https://doi.org/10.5194/isprs-annals-X-4-W8-2025-389-2026
https://doi.org/10.5194/isprs-annals-X-4-W8-2025-389-2026
29 May 2026
 | 29 May 2026

Deep Learning for Autonomous UAV Navigation: Multi-Modal Visual-Inertial Pose Estimation in GPS-Denied Environments

Shadi Izadbakhsh and Maryam Khademi

Keywords: visual-inertial odometry, deep learning, UAV navigation, GPS-denied environments, sensor fusion

Abstract. This paper presents DeepVIONet, a novel deep learning framework for autonomous UAV navigation in GPS-denied environments using multi-modal visual-inertial pose estimation. The system addresses the critical challenge of accurate localization and navigation when traditional GPS signals are unavailable or unreliable, such as in indoor spaces, urban canyons, or adversarial conditions.

Our approach integrates synchronized camera imagery and inertial measurement unit (IMU) data through a dual-encoder architecture. The visual encoder employs a convolutional neural network followed by long short-term memory (LSTM) layers to extract spatio-temporal features from image sequences, while the IMU encoder processes accelerometer and gyroscope data using recurrent networks to capture motion dynamics. A fusion module combines these complementary modalities to predict relative pose transformations consisting of 3-DOF translation and 3-DOF rotation parameters.

The network is trained and evaluated on the TUM Visual-Inertial dataset, demonstrating robust performance in challenging scenarios. Our multi-modal fusion strategy leverages the complementary strengths of visual and inertial sensing: cameras provide rich environmental context and feature-based localization, while IMU data offers high-frequency motion estimates and robustness to visual degradation. The architecture incorporates temporal modeling through LSTM networks to capture motion dynamics across sequential frames.

Experimental results show significant improvements in pose estimation accuracy compared to single-modality approaches, with translation MAE of 0.089 m and rotation MAE of 0.043 radians. The system achieves real-time performance suitable for autonomous navigation applications while maintaining computational efficiency. This work advances the state-of-the-art in vision-based UAV navigation and demonstrates the effectiveness of deep learning for robust localization in GPS-denied environments.

Share