Classification of Satellite Image Time Series and Aerial Images Based on Multiscale Fusion and Multilevel Supervision

Kanyamahanga, Hubert; Dorozynski, Mareike; Rottensteiner, Franz

doi:https://doi.org/10.5194/isprs-annals-X-G-2025-477-2025

Articles | Volume X-G-2025

https://doi.org/10.5194/isprs-annals-X-G-2025-477-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-X-G-2025-477-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume X-G-2025

10 Jul 2025

| 10 Jul 2025

Classification of Satellite Image Time Series and Aerial Images Based on Multiscale Fusion and Multilevel Supervision

Hubert Kanyamahanga, Mareike Dorozynski, and Franz Rottensteiner

Keywords: Land Cover Classification, Transformers, FCN, Multi-stage Supervision, Multiscale Data Fusion

Abstract. A large variety of sensors can be used for monitoring processes on the Earth’s surface. Different sensors can capture complementary information of the same observed region. For instance, aerial images offer a high spatial resolution but at a low temporal resolution, whereas satellite image time series (SITS) capture temporal variations with a high repetition rate, e.g. seasonal changes, but with limited spatial resolution. This paper presents a method to jointly exploit the strengths of SITS and aerial images for land cover classification. In this context, it is a challenge to train a classifier given the large difference in resolutions. We utilise convolutions to extract spatial information and consider self-attention in the temporal dimension for SITS. Additionally, a multi-resolution supervision strategy is proposed, applying auxiliary losses at different stages of the SITS decoder to enhance feature learning. Features extracted from SITS data are fused via a cross attention module with features determined from aerial images at the same spatial resolution by a SegFormer network before predicting land cover at the geometrical resolution of the aerial image. We perform comparative experiments on an existing benchmark dataset, showing that the convolution- and attention-based fusion of a SITS from Sentinel-2 with aerial image improves the classification results by +1.9% in the mean IoU and +2% in the OA compared to a method based on aerial images only.

Classification of Satellite Image Time Series and Aerial Images Based on Multiscale Fusion and Multilevel Supervision

Useful Links

Useful External Links

Our Contact