TRANSFORMER MODELS FOR MULTI-TEMPORAL LAND COVER CLASSIFICATION USING REMOTE SENSING IMAGES
Keywords: land cover classification, remote sensing, Swin Transformer, FCN, multi-temporal images
Abstract. The pixel-wise classification of land cover, i.e. the task of identifying the physical material of the Earth’s surface in an image, is one of the basic applications of satellite image time series (SITS) processing. With the availability of large amounts of SITS it is possible to use supervised deep learning techniques such as Transformer models to analyse the Earth’s surface at global scale and with high spatial and temporal resolution. While most approaches for land cover classification focus on the generation of a mono-temporal output map, we extend established deep learning models to multi-temporal input and output: using images acquired at different epochs we generate one output map for each input timestep. This has the advantage that the temporal change of land cover can be monitored. In addition, features conflicting over time are not averaged. We extend the Swin Transformer for SITS and introduce a new spatio-temporal transformer block (ST-TB) that extracts spatial and temporal features. We combine the ST-TB with the swin transformer block (STB) that is used in parallel for the individual input timesteps to extract spatial features. Furthermore, we investigate the usage of a temporal position encoding and different patch sizes. The latter is used to merge neighbouring pixels in the input embedding. Using SITS from Sentinel-2, the classification of land cover is improved by +1.8% in the mean F1-Score when using the ST-TB in the first stage of the Swin Transformer compared to a Swin Transformer without the ST-TB layer and by +1,6% compared to fully convolutional approaches. This demonstrates the advantage of the introduced ST-TB layer for the classification of SITS.