Real-GDSR: Real-World Guided DSM Super-Resolution via Edge-Enhancing Residual Network

A low-resolution digital surface model (DSM) features distinctive attributes impacted by noise, sensor limitations and data acquisition conditions, which failed to be replicated using simple interpolation methods like bicubic. This causes super-resolution models trained on synthetic data does not perform effectively on real ones. Training a model on real low and high resolution DSMs pairs is also a challenge because of the lack of information. On the other hand, the existence of other imaging modalities of the same scene can be used to enrich the information needed for large-scale super-resolution. In this work, we introduce a novel methodology to address the intricacies of real-world DSM super-resolution, named REAL-GDSR, breaking down this ill-posed problem into two steps. The first step involves the utilization of a residual local refinement network. This strategic approach departs from conventional methods that trained to directly predict height values instead of the differences (residuals) and utilize large receptive fields in their networks. The second step introduces a diffusion-based technique that enhances the results on a global scale, with a primary focus on smoothing and edge preservation. Our experiments underscore the effectiveness of the proposed method. We conduct a comprehensive evaluation, comparing it to recent state-of-the-art techniques in the domain of real-world DSM super-resolution (SR). Our approach consistently outperforms these existing methods, as evidenced through qualitative and quantitative assessments.


INTRODUCTION
Elevation data are essential for a wide range of applications across multiple sectors.They also contribute significantly to our improved understanding and management of the Earth's resources and environment.These data are collected through various techniques, typically aerial data capture and light detection and ranging (LiDAR) sensors.A digital elevation model (DEM) is a digital representation of the Earth's topography, specifically focusing on the bare ground surface.DEMs play a crucial role in applications like terrain analysis, hydrological modeling, geological studies, precision agriculture, and infrastructure planning.A digital surface model (DSM) in the other hand is a representation which includes all objects on it, such as vegetation and buildings.DSMs are widely applied in topographic mapping, environmental simulations (Aktaruzzaman and Schmitt, 2009), 3D city modeling and planning.Recent remote sensing technology provides several ways to measure the 3D urban morphology.Conventional ground surveying, stereo airborne or satellite photogrammetry, interferometric synthetic aperture radar (InSAR), and LiDAR are the main data sources used to obtain high-resolution DSM.However, each of these scenarios has its own set of advantages and disadvantages.
Data derived from terrestrial and airborne systems offer high spatial resolution but have limited coverage and can encounter precision-related issues.Spaceborne missions, for example Cartosat 1, provide global coverage but may lack the finest level of resolution achieved by terrestrial and airborne methods due to challenges in capturing high-resolution data from space.The resolution of such data have a substantial impact in different fields of operation.Improving measurement equipment precision is the most straightforward way to acquire high-resolution elevation data, but it is a difficult, costly, and time-consuming procedure.Therefore, generating high-resolution data without extra cost becomes a key concern of researchers from various fields.A practical way to solve this problem is to enhance the resolution of easily obtained Figure 1: Guided DSM super-resolution: given a low-resolution dsm and a high-resolution guide image, our method predicts a high-resolution DSM.The figure shows an example output of the proposed method on low-resolution DSM with a factor of 10. low-resolution data (Xu et al., 2015a).In computer vision, various algorithms have been developed for single image super-resolution task.Some of the algorithms have been explored for DEM superresolution.Both depth estimation (Godard et al., 2017(Godard et al., , 2019;;Bhat et al., 2021) and depth super-resolution task (Voynov et al., 2018;He et al., 2021) are two intriguing applications that can be applied to DEMs and DSMs.Recently a method termed guided depth map super-resolution (GDSR) is gaining popularity.The idea is that a different imaging modality of the same object can be used as a guide for super-resolving the low-resolution image by injecting the missing high-frequency content.Research into GDSR has a long history (Patterson et al., 1992;Izraelevitz, 1994).The proposed solutions range from classical, entirely hand-crafted schemes (Ham et al., 2017) to fully learning-based methods (Hui et al., 2016a), while some recent works have combined the two with promising results (Lutio et al., 2022;Metzger et al., 2023).However, generally both single image super-resolution (SISR) and GDSR methods restrict themselves to super-resolving images downsampled by a simple and uniform degradations (i.e, bicubic downsampling).Real low-resolution DSMs, in the other hand, do not preserve as much information (see Fig. 2) and applying these methods on real-world DSMs is becoming a challenge (Cai et al., 2019;Wang et al., 2021a).In this paper, we focus on superresolving DSMs guided by their corresponding optical images, which is depicted in Fig. 1.We focus our research on urban DSMs because they contain richer information which hard to be restored.This focus also aligns with the evolving interest in leveraging DSMs for building reconstruction (Bittner et al., 2018;Partovi et al., 2019;Bittner et al., 2020;Wang et al., 2021b;Gui and Qin, 2021;Stucker and Schindler, 2022;Stucker et al., 2022).
In summary, our contributions are as follows: 1. We propose a novel super-resolution framework guided by optical images on real-world low-resolution DSMs.To the best of our knowledge, we are the first to develop a guided super-resolution network trained on real-world DSMs.
2. We achieve this by utilizing both local and global context.We improve the conventional methods in two ways: local refinement and edge-enhancing diffusion.A local refinement network with small receptive field is utilized to improve the low-resolution DSM.This shallow network can repair some missing regions and structures, according to the surrounding local regions after a coarse bicubic-interpolation stage.An edge-enhancing diffusion network is used to further smoothen the super-resolved result.This network can further improve the visual quality using the global edge information, especially for removing outliers and preserving height discontinuities.
3. We demonstrate the effectiveness of our approach by comparing it to other state-of-the-art networks, trained in the same setting.Furthermore, we show that our approach is feasible of achieving a 10x super-resolution for low-resolution DSMs from satellite data.We also found that our proposed approach outperformed the other works in terms of both qualitative and quantitative results on unseen data.

Single Image Super-Resolution
Single image super-resolution refers to generation of a highresolution image from a low-resolution image.Recently SISR methods has shifted towards example-based approaches.However in those early years interpolation-based techniques like linear, bicubic or Lanczos are widely used.The idea is new pixels are estimated by interpolating given pixels.But these methods suffers from blurry results on high-frequency regions.Learning or example-based methods aim to gather insight information from paired low and high-resolution images to understand missing details in low-resolution images.SRCNN (Dong et al., 2015) is one of the first approaches to demonstrate the use of neural networks to learn the nonlinear mapping of the images in the image space.
The method utilize bicubic interpolation of low-resolution image followed by high-dimensional vector representation and ended by reconstruction of the vectors to the pixel space.VSDR (Kim et al., 2016a) was designed to increase the efficiency of SRCNN, by predicting the residuals rather than the actual pixel values and, to boost the overall performance by adding more layers.Wang et al. (2015) introduced sparse coding to the training which enable the model to enlarge the images to the desired scale factor progressively. Deep recursive layers are introduced in DRCN (Kim et al., 2016b) to reduce the number of parameters.
Furthermore, the introduction of generative adversarial networks (GANs) inspired the implementation of GAN for super-resolution.
In GAN the generator produces high-resolution image and the discriminator is trained to distinguish between patch of the original image and patch which are produced by the generator.Two components are designed to defeat each other in a zero-sum game (Goodfellow et al., 2014).SRGAN (Ledig et al., 2017) is the first one to demonstrate the ability to pay more attention to visual effects, introducing adversarial and perceptual losses.ESRGAN (Wang et al., 2019) goes another step by introducing relativistic discriminator and removing batch normalization layers.
But different than images, depth maps contain piecewise affine regions that have sharp depth discontinuities and no textures (Riegler et al., 2016b).Besides, they are sensitive to artifacts (Xie et al., 2015).In this paper, we focus on the training strategy and framework of super-resolution networks when dealing with real low resolution depth maps.

Guided Depth Map Super-Resolution
GDSR has become an essential topic in multi-modal image processing and super-resolution.The idea behind it is that there are statistical co-occurrences between the texture edges of RGB images and the discontinuities of depth maps (Xie et al., 2015).Hence, information in RGB images can be utilized to restore low-resolution depth maps.Three categories have been identified recently for traditional methods for GDSR: learning-free, learningbased and hybrid approaches.Early work on GDSR consisted mostly of filter-based and optimization methods which require no training.Filter-based methods focus on preserving sharp depth edges under the guidance of the intensity image.For example, Deep learning-based methods also derived from the advances in SISR domain.D-SRCNN (Chen et al., 2016) was proposed based on SRCNN (Dong et al., 2015).Later the same author implemented EDSR (Lim et al., 2017) for the same purpose (Xu et al., 2019).So far none of the methods targeted specifically for urban DSMs.Because of their characteristics, performing large-scale superresolution on such data is still a challenge.First, these mentioned models sofar trained only on the data generated by bicubic interpolation which only work well on clean low-resolution data with simple degradations.This is inconsistent with real-world needs, where low-resolution data have more complex degradations.Second, urban DSMs provide more details which even harder to reconstruct.To address this conflict, we proposed a practical solution, where we collect real low and high-resolution DSMs pairs and include guiding information in both training and inference.Optical images of the same scenes are obtained to act as guidance.Other than that, We utilize both local and global context in our approach to handle the characteristics of DSMs.In the following, we explain in more detail how we achieve this.

METHODOLOGY
Real-GDSR consists mainly of two steps: First as a pre-step, a pre-trained model is used to extract features from both optical image and low-resolution DSM.Then, we use a residual convolutional neural network (CNN)-based network, namely the local refinement network, whose objective is to take as input highdimensional features and transform them into a high-resolution DSMs.Second, we use an anisotropic diffusion network to further enhance the refined DSMs focusing on the edge features from the high-resolution optical image.Figure 3 shows an overview of our proposed pipeline.In the following subsections, we describe each component of our pipeline in more details.

Local Refinement
In prior works in DEM super-resolution, we observe that many existing works often follow a common design concept for image super-resolution, where their networks built of upsampling layers and have very large receptive field, for example, a U-Net like architecture or using multiple dilation convolution or attention layers.In this work, taking inspiration from image inpainting task, we see super-resolution as an image-to-image translation task where we assume that an initial coarse high-resolution DSM of the observed scene has already been generated with existing traditional method like bicubic and later refined within the training.We highlight that for DSMs a network with small receptive field is enough to fulfill the task.For the local refinement, we create a shallow network with four residual blocks and two upsampling procedures (see middle part of Fig. 3).Therefore, this network has small receptive fields.This architecture eliminates the effect of distant and unsuccessful filling contents and allows for the restoration of missing sections or buildings utilizing local information around them.Additionally, we incorporate a long skip connection that adds the initial DSM straight to the final downsampling layer's output, allowing the network to regress residuals rather than absolute height.

Edge-Enhancing Diffusion
After the local refinement process, structural features are restored with guidance of surrounding local regions.However, different than images, DSMs are sensitive to outliers.For this purpose, we introduce a diffusion-based edge-enhancement which helps broaden the scope of information captured.Anisotropic diffusion can be understood as an adaptive filtering technique aimed at smoothing while preserving inter-region content such as edges or boundaries, which are crucial for image interpretation.This is achieved by applying inhomogeneous diffusion, where the diffusivity is guided by a scalar function or diffusion tensor derived from the gradients of the evolving image.The concept of calculating diffusivity from a guide image has been investigated in the area of edge enhancement and semantic segmentation.
Inspired by prior work in guided depth super-resolution (Metzger et al., 2023), we implement a similar network, where the diffusion component mirrors traditional optimization approaches solved via an iterative diffusion loop.For each iteration, multiple steps of anisotropic diffusion are conducted.Here, the diffusion weights are influenced by the guide to minimize diffusion at boundaries with high contrast and enhance diffusion within regions that are homogeneous.A convolutional feature extractor is used to set diffusion weights by passing the guide through.Therefore, the process can transfer edge information from the guide to preserve depth discontinuities in the target image.
Given a source image S a guide G ∈ H × W × C, where C = 3 for RGB images or a larger number for deep features, the first step is to initialize X0 ∈ H × W with an upsampled version of S. The diffusion step is defined as where x p t denotes the pixel value of Xt at location p (and similarly for g p ). N 4(p) denotes the four-neighbours of pixel p. λ controls the rate of diffusion.For four-neighbours, λ should be set to < 0.25 to ensure numerical stability.Diffusion coefficients for the neighboring pairs of pixels are calculated by function c which formulated based on their values in the guide.A higher diffusion coefficient means that information spreads more freely across neighboring pixels, resulting in stronger smoothing effects.Conversely, a lower diffusion coefficient restricts the spread of information, preserving edges and fine details in the image.This method follows prior work (Perona and Malik, 1990), which defines where K controls the sensitivity to the gradients in G.
We adapt the work of Metzger et al. (2023) by removing the adjustment step.The adjustment step was utilized by the authors to constrain the output of the diffusion to always match the source image when downsampled.This is done to preserve adherence between the input and output.However, in our approach, we opt to forego the adjustment step.This decision is motivated by the recognition that low-resolution DSMs often contain minimal information, and their use in the adjustment step may hinder the diffusion process.
To this end, our proposed network is trained in an end-to-end manner, and the final training loss is the summation of losses of two sub-networks.

Datasets and Implementation
Imagery and Study Area In this study, we generate a dataset based on real-world DSMs.We evaluate our method on DSMs acquired over two main cantons of Switzerland: Zurich and Bern.We use low-resolution DSMs with a ground sampling distance (GSD) of 5 m generated from Cartosat-1 stereoscopic satellite as input instead of conventional bicubic-downsampled.We take highresolution DSMs and their corresponding RGB orthoimage (GSD = 0.5 m and 0.1 m) provided by The Federal Office of Topography on the Swisstopo Portal1 .We downsample the orthoimages to match the resolution of the DSMs using bicubic interpolation.The dataset consists of 2200 patches of size (256,256).We use 2000 samples for training and 200 for testing.The study area includes widely spaced, detached residential buildings, allotments, and high commercial buildings.We accomplish this by utilizing the Esri global land cover map (Karra et al., 2021) to filter the dataset, specifically focusing on the Built Area class.

Implementation Details
We randomly load training patches during training.At inference time, we reconstruct large-scale scenes by applying the learned model in a sliding window.We follow best practices to normalize the data.Every DSM is normalized such that all height points averaged to 0 using its local mean and global standard deviation computed from all training samples.Optical images are normalized with the mean and standard deviation over the intensity values of all training pixels.In all experiments, we use a hidden feature dimension of 64 for the feature extractor and the refinement decoder.ResNet-50 (He et al., 2016) backbone pretrained on ImageNet (Deng et al., 2009) is used as feature extractor.Before extracting the features, low-resolution DSMs are upsampled to the resolution of the ground truth DSMs using bicubic interpolation and concatenated with the guide.For training, we train all methods, including our own, with the L1 loss.For the diffusion network, we adopt the same setup and strategy outlined in (Metzger et al., 2023), with K and λ set to 0.001 and 0.24.The number of diffusion steps with and without gradients in training phase, are set to 8000 and 1024, respectively.Additionaly in our refinement network and DSRGAN, perceptual loss is added.We employ the ADAM optimizer with a base learning rate of 5 × 10 −5 and no weight decay.We set the batch size to 2 for training and 1 for testing.We stop training once the RMSE on the test set have converged.We implemented our model in PyTorch and run it on a NVIDIA Titan RTX GPU.

Baselines
We compare Real-GDSR against the following baselines: • Bicubic: The upsampled low-resolution DSMs using bicubicinterpolation.
• DADA: Deep Anisotropic Diffusion-Adjustment network is a hybrid framework for guided super-resolution that combines deep feature learning and anisotropic diffusion.The approach achieved edge-enhancing properties from the diffusion boosted by the contextual reasoning capabilities of large pre-trained models and a strict adjustment step guarantees perfect adherence to the source image.Our diffusion network is similar to this approach without the adjustment step.
• D-SRGAN: DEM Super-Resolution with Generative Adversarial Networks implemented ESRGAN for single DSM super-resolution.To compare it with our models we modify the input of the model by concatenating the low-resolution DSM with extracted features from the optical images using the same feature extractor as our model.We make no changes to the GAN.

Evaluation Metrics
We evaluate the models' performance by examining the root mean square error (RMSE), the normalized median absolute deviation (NMAD), and the median absolute error (MedAE), which are all derived from per-pixel differences between prediction and ground truth.

Comparisons with Prior Works
We start by assessing the performance of Real-GDSR, so as to quantify the impact of our framework.reduced to 2.6 m and 1.5 m.Beside the quantitative improvement, visually we can see that the reconstructed 3D geometry is clearly recovered.Buildings have sharp lines, and there are fewer visible artifacts and bumps on the terrain.The most notable finding is the recovery of detailed building structures like a cluster of buildings on Fig. 4, 4th row, 1st column.Furthermore, Real-GDSR reconstructs realistic scenes even in the presence of previously unseen building shapes and arrangement of buildings.The reconstruction of height and arrangement of buildings in urban areas are more accurate, where most of the baselines failed (see Fig. 5).In such cases, the model predict the height using information extracted from the optical image.

Refinement and Diffusion Networks
We conduct an evaluation on the contribution of each component in Real-GDSR .Our refinement network outperforms other baselines in NMAD and MedAE with 1.8 m and 1.4 m, respectively (Table 1, 1st row).
Our diffusion network performed better than DADA because of the adjustment-step removal.Adding diffusion component the RMSE is lower but increasing both NMAD and MedAE (Table 1, 3rd row).It indicates that the diffusion network, while effectively regularizing the model's structure and filtering out outliers, may inadvertently remove some of the intricate details captured by the refinement network.

CONCLUSION
We have presented Real-GDSR, a practical yet effective approach to guided super-resolution of DSMs.The approach is trying to solve the super-resolution problem in two steps: local refinement and an edge-enhancing diffusion.We highlight that the combination of CNN-based network and diffusion process bring the best of both worlds.Moreover, the local refinement network follows a residual learning strategy, i.e., it is trained to refine an imperfect bicubic-upsampled digital surface model (DSM) by predicting correction to the height, using both the DSMs and optical images as input.Together with the diffusion step, it can leverage information from optical images not only for the local distribution of the heights but also globally preserving the edges effectively.
Our approach learns to restore substantial geometry such as sharp building lines and smooth height discontinuities.It also successfully restore missing shape details in the low-resolution DSM with information from the optical images.
In our experiments, Real-GDSR reaches top performance and outperforms state-of-the-art networks, including GANs and hybrid methods.We also found that Real-GDSR is fairly robust in terms of generalization that it can generate accurate high-resolution DSMs of unseen test samples.However, we acknowledge the trade-off between bias and variance in our approach and believe that more diverse training data can be gathered for a more extensive and representative analysis on different section of cities.Hence, our approach can be applied universally across various regions.On a conceptual level, we hope that our work motivates further research on DSM super-resolution.

Figure 2 :
Figure 2: Examples of real low-resolution DSMs compared to the bicubic-downsampled and high-resolution DSMs.Note that real low-resolution DSMs preserve less information in comparison to their bicubic-dowsampled counterparts.

Figure 3 :
Figure 3: Summary of the proposed architecture.Real-GDSR comprises mainly a two-step process: Initially, high-dimensional features are extracted from both bicubic-upsampled low-resolution DSM and high-resolution optical image by a pre-trained model.Subsequently, a local refinement network refines the upsampled DSM by incorporating residual blocks and upsampling operations, followed by a diffusion network, which iteratively enhancing the refined upsampled DSMs, emphasizing edge features from the high-resolution optical image.RGB images guide the acquisition of bilateral weights(Kopf et al., 2007).Optimization-based approaches use diverse data priors to construct energy functions, with data-fidelity regularization limiting the solution space.Several other methods(Xie et al., 2015;Diebel and Thrun, 2005) employ random field models.Other than that Liu and Gong (2013) demonstrated an early learning-free application of anisotropic diffusion.Initial learning-based techniques, such as bimodal co-sparse analysis(Kiechle et al., 2013) and joint dictionary learning(Tosic and Drewes, 2014), learn the relationship of RGB and depth information.Deep learning models are introduced to utilize neural networks to learn the mapping from low-resolution to high-resolution images.MSG-Net(Hui et al., 2016b) built on a U-Net deep network architecture and learns the residual errors of bicubic interpolations by embedding the source at the smallest scale.Kim et al. (2021) propose deformable kernel networks (DKN) and its fast implementation (FDKN), that learn sparse and spatially-invariant filter kernels.He et al. (2021) employ a high-frequency guidance module to embed the guide details into the depth map.Deep learning techniques have been used recently by researchers to improve the ability of predicting outputs of given inputs that have not encountered within formal frameworks.Riegler et al. (2016a) train a neural network by unrolling the optimization phases of a first-order primal-dual algorithm, enabling them to train their deep feature extractor throughout the training process.Using a graph-based, MRF-style optimizer, Lutio et al. (2022) apply the implicit function theorem.DADA (Metzger et al., 2023) achieved state-of-the-art performance by adapting the concept of guided anisotropic diffusion with deep convolutional networks.2.3 DSM Super-ResolutionResearch on DSM super-resolution is limited, despite its significance in remote sensing.The majority of research focuses on DEM superresolution.This includes interpolation-based methods like bicubic, and bilinear are used for DEM enhancement which results in smooth terrain models.Xu et al. (2015b) proposed a super-resolution algorithm based upon non-local means.It operates by using a predetermined equation to search for similar patches across the training set.Weights determined by the searching phase are then used to upscale the resolution of the target DEM.Deep learning-based methods also derived from the advances in SISR domain.D-SRCNN(Chen et al., 2016) was proposed based on SRCNN(Dong et al., 2015).Later the same author implemented EDSR(Lim et al., 2017) for the same purpose(Xu et al., 2019).Demiray et al. proposed a DEM super-resolution model, Demiray et al. proposed a DEM super-resolution model,   namely D-SRGAN (Demiray et al., 2021a), with the implementation of SRGAN(Ledig et al., 2017), and EffecientNetV2(Tan and Le, 2021) for DEM SR(Demiray et al., 2021b).

Figure 4 :
Figure 4: Visual comparison of RealGDSR with selected baselines.Our approach demonstrates its accuracy while producing regularized and smooth DSMs.All examples are taken from the test set.

Figure 5 :
Figure 5: Line profile analysis of RealGDSR and other baselines.