Deep Learning for Autonomous UAV Navigation: Multi-Modal Visual-Inertial Pose Estimation in GPS-Denied Environments

Izadbakhsh, Shadi; Khademi, Maryam

doi:10.5194/isprs-annals-X-4-W8-2025-389-2026

Articles | Volume X-4/W8-2025

https://doi.org/10.5194/isprs-annals-X-4-W8-2025-389-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-X-4-W8-2025-389-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume X-4/W8-2025

29 May 2026

| 29 May 2026

Deep Learning for Autonomous UAV Navigation: Multi-Modal Visual-Inertial Pose Estimation in GPS-Denied Environments

Shadi Izadbakhsh and Maryam Khademi

Keywords: visual-inertial odometry, deep learning, UAV navigation, GPS-denied environments, sensor fusion

Abstract. This paper presents DeepVIONet, a novel deep learning framework for autonomous UAV navigation in GPS-denied environments using multi-modal visual-inertial pose estimation. The system addresses the critical challenge of accurate localization and navigation when traditional GPS signals are unavailable or unreliable, such as in indoor spaces, urban canyons, or adversarial conditions.

Our approach integrates synchronized camera imagery and inertial measurement unit (IMU) data through a dual-encoder architecture. The visual encoder employs a convolutional neural network followed by long short-term memory (LSTM) layers to extract spatio-temporal features from image sequences, while the IMU encoder processes accelerometer and gyroscope data using recurrent networks to capture motion dynamics. A fusion module combines these complementary modalities to predict relative pose transformations consisting of 3-DOF translation and 3-DOF rotation parameters.

The network is trained and evaluated on the TUM Visual-Inertial dataset, demonstrating robust performance in challenging scenarios. Our multi-modal fusion strategy leverages the complementary strengths of visual and inertial sensing: cameras provide rich environmental context and feature-based localization, while IMU data offers high-frequency motion estimates and robustness to visual degradation. The architecture incorporates temporal modeling through LSTM networks to capture motion dynamics across sequential frames.

Experimental results show significant improvements in pose estimation accuracy compared to single-modality approaches, with translation MAE of 0.089 m and rotation MAE of 0.043 radians. The system achieves real-time performance suitable for autonomous navigation applications while maintaining computational efficiency. This work advances the state-of-the-art in vision-based UAV navigation and demonstrates the effectiveness of deep learning for robust localization in GPS-denied environments.

Deep Learning for Autonomous UAV Navigation: Multi-Modal Visual-Inertial Pose Estimation in GPS-Denied Environments

Useful Links

Useful External Links

Our Contact