Full-scale semantic segmentation of hyperspectral imaging based on spatial-spectral joint network

: Hyperspectral images contain dozens or even hundreds of spectral bands, which contain rich spectral information and help distinguish different ground objects. Hyperspectral images have a wide range of applications in urban planning, environmental monitoring, and other fields. The semantic segmentation of hyperspectral images is one of the current research hotspots. The difficulty lies in the rich spectral information and strong correlation of hyperspectral images. Traditional semantic segmentation methods cannot fully extract information, which affects the accuracy of classification. This article utilizes an encoding decoding structure to simultaneously extract deep and shallow features of images. A REGCS convolution module was constructed using the idea of group convolution to extract spectral and spatial features of images. We compared the Salinas Valley dataset and MUUFL dataset with various classification algorithms. The experimental results show that compared with other classification models, the RESSU model has achieved stable and excellent results in hyperspectral image classification experiments. Among them, in the classification experiment of the Salinas Valley dataset, the accuracy of single class classification reached over 92%. In the effectiveness analysis experiment, we calculated different model parameter quantities to verify the performance of our method, and ultimately achieved good results.


Introduction
Hyperspectral imaging (HSI) consists of tens to hundreds of spectral bands (Du et al., 2016).The rich spectral information endows hyperspectral images with strong ground feature differentiation ability, and has a wide range of applications in environmental monitoring, urban planning, military reconnaissance, crop yield estimation, and other fields.In recent years, many methods have been applied to hyperspectral image classification tasks, including threshold based segmentation, support vector machine (SVM) (Bazi et al., 2006)､random forest (RF) (Yu et al., 2019), and polynomial logistic regression (Li et al., 2010).These methods have certain limitations, as they can only extract shallow features and lack research on extracting deep feature information.More and more scholars are using deep learning methods to study hyperspectral image classification, and convolutional neural networks play an important role in the field of image processing.The U-Net model (Ronneberger et al., 2015) has also achieved excellent results in image classification.. U-Net adopts an encoder decoder architecture.The encoding part is divided into multiple layers, and each layer performs convolution, normalization, activation, and pooling on the image before downsampling.In the decoding part, the FCN skip connection idea is improved, and the feature maps obtained from each downsampling layer in the encoder are merged with the corresponding feature maps obtained from the decoder layer for upsampling.To solve the problems that traditional semantic segmentation networks encounter when handling details and edges.Repeat the process of upsampling, fusion, and upsampling until a segmentation image with the same size as the input image is obtained.
The advantage of U-Net network's ability to quickly capture image features has made it widely used and improved in image segmentation processing.However, for the classification research of hyperspectral images, the U-Net network model still needs to improve its information processing for logarithmic tens or even hundreds of bands.Meanwhile, there is no clear and effective combination method for the deep and shallow semantics of hyperspectral images, resulting in unsatisfactory classification results in practical applications.This article absorbs the ideas of the former's full scale skip connections and full scale deep supervision, and uses a U-shaped network structure to construct a model suitable for classifying hyperspectral images -RESSU.Its advantage lies in: (1) effectively fusing semantic information of different layers of the image using full scale skip connections.
(2) In image downsampling, residual networks are used to first perform preliminary feature enhancement on the image and accurately extract feature information.
(3) Design a new convolutional module REGCS to simultaneously process hyperspectral images in both spectral and spatial domains.(4) Closely combining spectral attention mechanism with newly designed convolutional module to improve the accuracy of feature information extraction.

RESSU
RESSU adopts a U-shaped network structure, incorporating the ideas of full scale skip connections and deep supervision to reduce network depth.The overall network structure is shown in Figure 1.

Upsampling
Full-scale intra skip connections Traditional U-Net network models lack the ability to explore sufficient information from the full scale, making it difficult to clearly determine the position and boundaries of each classification.Each decoder layer in RESSU integrates feature maps encoded by same layer and shallow layer encoders and feature maps from deep decoders, known as full scale skip connections.They have good performance in finding the boundaries of classification objects and their positions in the image.Add res blocks during the network encoding process, use residual learning to alleviate feature degradation, and complete frequency band expansion.In the first convolution, the number of input frequency bands is m and the number of output frequency bands is n.After being activated by relu, it goes through 3x3 convolutional layers, and finally, the input image is dimensionalized using 1x1 convolution and added to the feature information of the residual module.
Reduce image resolution by maximizing pooling.In the process of network decoding, the feature fusion is achieved by using full scale skip connections to obtain the fused feature map of the encoding and decoding parts, fully utilizing multi-scale features to improve the extraction accuracy and efficiency of the network.Adopting a full scale deep supervised design, learning images from comprehensive aggregated feature maps.The RESSU full scale skip connection realizes the connection between different parts of the encoder and decoder, as well as the inherent connection between the decoder subnetworks.Taking the node 3 D X in Figure 2 as an example, its information comes from two aspects: one is the encoder that is shallower (including the same level), and the other is the decoder that is deeper than it.N is the number of bands input into the image.Each decoder layer in RESSU contains feature maps from different scales of the encoder and decoder, which capture fine-grained details and coarse-grained semantics at the full scale.The formula is as follows: 1 )) , ( ) , ( ( )) Function C represents convolution operation, function R represents REGCS module convolution operation, functions D and U-represent upsampling and downsampling operations, respectively, [ ] represents channel dimension concatenation fusion.
At the same time, RESSU adopts Deep Supervision as shown in Figure 3, and the feature maps of each decoding layer are processed by the REGCS convolution module for band fusion.We perform 3 x 3 convolution on the generated feature maps , and bilinear upsampling is used to restore the resolution of the feature map to the input image level.After sigmoid processing, it enters the loss function calculation and obtains the loss value for backpropagation, which is used to optimize model parameters.

REGCS convolutional module
The spectral band range of hyperspectral images is wide, extending from visible light to shortwave infrared, and even mid infrared, with hundreds of bands forming approximately continuous spectral curves.However, the bandwidth of a single band is relatively narrow, usually around 5-20nm.Therefore, when extracting features from images, it is necessary to consider all bands.This article designs a convolutional module REGCS, which adopts a twice grouped convolution form to divide a large number of bands into different groups for analysis and feature extraction.Compared with 3D convolution (Yu et al., 2020), it reduces computational complexity while achieving ideal results.
The specific process of the REGCS module is shown in Figure 4.
As shown in the figure, the REGCS module located in the decoding part of the decoder includes the fusion of deep and shallow image feature information, and adds spectral and spatial attention mechanisms to increase the weight of important feature information, thereby obtaining new feature maps.Firstly, input the encoded hyperspectral image with a size of h w c (c is the number of bands) into the REGCS convolution module.Firstly, according to the number of bands, they are divided into k groups.Different datasets have different k values, and in the experiment, the specific value of k is specifically tested.Each group contains four different feature information layers, therefore, r=4.In each Sp group, it contains m bands, representing the number of bands in the same feature level, where m x k represents the total number of bands in a single level c.In each group, the deep and shallow feature information of the same band is convolved separately, and cascaded to obtain the complete feature description of the image in that band group.Perform spectral attention calculation on the completed bands within the group, extract spectral weight information of similar images, and perform spatial attention calculation on all grouped bands after stitching to obtain image spatial weight information.Finally, multiply the weight parameters containing all information with the input image to obtain the final feature map.The spectral attention mechanism and spatial attention mechanism will be introduced in the next section.

Spectral and spatial attention mechanisms
The attention mechanism can enable the model to focus on information that is more critical to the current task among numerous input information, reduce attention to other information, solve the problem of information overload, improve the efficiency and accuracy of task processing (Woo et al., 2018), and improve classification accuracy.This article proposes a new spectral attention mechanism (SA) suitable for hyperspectral image classification, and utilizes the existing spatial attention mechanism (SPA) (Fang et al., 2021) to closely link it with the REGCS convolutio module.It can allocate computing resources to more important tasks in feature extraction of hyperspectral image datasets.
First, perform spectral attention model calculations on the feature maps, and then connect the new feature maps obtained and input them into the spatial attention model for calculation.The calculation formula is: () M  .Spectral attention is calculated as follows: In the formula: σ Represents the sigmod activation function; The spatial attention submodule focuses on the spatial position relationship of adjacent pixels in hyperspectral images, generating different weights related to spatial features.As shown in Figure 6.Perform a channel based global max pooling operation and a global average pooling operation on the feature map, and connect these two results based on the channel.Then, a convolution operation is performed to compress the dimensions and reduce them to one channel.Afterwards, the weight , which is ' () s MF (where s represents the result of attention processing on the spatial dimension of the hyperspectral image), is redistributed through convolution operations and activation functions.The formula for calculating spatial attention is: In the formula: σ Indicates the activation function; Backpropagation is used to optimize model parameters and improve the accuracy of hyperspectral image classification.

Dataset introduction
The experiment used two hyperspectral datasets to test the classification performance of our model, and the specific information is shown in

Experimental environment
In order to improve the generalization ability of classification, the dataset is randomly divided proportionally and the partitioned data is randomly selected as the input for the image.Epoch is set to 100, the model optimizer uses Adam optimizer, and the initial learning rate of the network is set to 0.001.To ensure experimental fairness, all methods do not preprocess the training set and are implemented on the same graphics workstation.The detailed configuration parameters are shown in Table 2

The impact of the number of packets on the network
Compared to conventional convolution, group convolution can improve the learning efficiency and performance of the model, affecting the classification performance of RESSU networks.This article conducted experiments on the impact of the number of groups on the classification accuracy of RESSU on those datasets.The optimal input size of the image in section 2.3 was selected as the input size.In addition, in the setting of network parameters, the number of channels in group convolution must be divided by the number of groups.Therefore, in Tables 3-4, Observe the impact of grouping number on model performance in three datasets using g= [2,3,4,6,12]/g= [2,4,8,16,32].
According to experimental data tables 3 and 4, classification experiments were conducted on the MUUFL dataset.The number of groups in the network convolutional layer was 4, while in the Salinas Valley dataset classification experiment, the number of groups in the network convolutional layer was 3.

Comparative experimental analysis
To verify the classification performance of the proposed method, the RESSU method was compared with various hyperspectral image classification methods, including 3D-Unet(É i ç ek et al., 2016),pResNet (Paoletti et al., 2019),VIT (Dong et al., 2022), SSGCN (Qin et al., 2017),WFCG (Dong et al., 2022), and CBAM-Res-HybridSN (Yang et al., 2023).Compare the accuracy based on commonly used accuracy indicators in hyperspectral analysis, including overall accuracy (OA), average accuracy (AA), Kappa coefficient, and accuracy for each category, i.e. the number of correctly judged samples in that category/the number of samples in that category.
The test results of the Salinas Valley dataset are shown in Figure 8 and Table 5.The Salinas dataset itself has a relatively balanced sample distribution, and there are significant differences in spectral and spatial dimension information between different categories, which is beneficial for classification.Most methods, such as WFCG, pResNet, SSGCN, SSRN, CBAM-Res-HybridSN, etc., have achieved good results in this dataset experiment.Compared with other algorithms, our method achieved the highest accuracy results on OA/AA/Kappa during the testing process.Compared with 3D-Unet, pResNet, VIT, SSGCN, WFCG, SSRN, MS3D-CNN-A, and CBAM-Res-HybridSN, the OA values of RESSU on the Salinas Valley dataset were 2.66%, 3.93%, 8.91%, 4.17%, 2.51%, 1.18%, 6.17%, and 1.13%, respectively.The VIT method has achieved poor results in classification experiments with limited sample data.Although the MS3D-CNN-A method can extract depth and multi-scale features, it has not achieved ideal results due to the same reasons.
The overall classification accuracy of pResNet, SSRN, and CBAM-Res-HybridSN methods with residual structures exceeds 91%, achieving good results in small sample datasets.However, their accuracy in specific classifications is not stable.
For example, in Vinyard_vertical_trellis, pResNet only has the 69.39% about Accuracy.In Grapes_untrained, SSRN is only 83.57% about Accuracy.we adopts spectral band grouping and adopts a residual structure of res blocks in the encoding part, effectively alleviating the phenomenon of gradient vanishing.At the same time, using the full scale skip method for feature extraction significantly improves classification accuracy.Stable and effective results were achieved in various categories, with a classification accuracy of over 92%.
Finally, experiments were conducted on MUFFL datasets with higher spatial resolution and lower spectral dimensions.The results are shown in Figure 9 and Table 6.Except for the 3D-Unet model, the OA values of all other models are above 90, achieving ideal results.However, in the classification of water and architectural shadows, their accuracy is often poor.In terms of OA value, RESSU leads CBAM-Res-HybridSN with the second highest classification performance by 1.13% in Overall Accuracy, which is 7.96% higher than 3D-Unet, which is also a U-Net network model framework.

Validity Analysis
In order to evaluate the performance of the network model in this article, the number of parameters and FLOP calculations were performed on the MUUFL dataset, mainly compared with 3D-Unet.

Parameter quantity and FLOPs analysis
The number of parameters (Params) and floating-point operations per second (FLOPs) are two crucial indicators for assessing the complexity of deep learning networks.In this study, we selected the MUUFL dataset for classification experiments and presented the parameter results in Table 7, which include non-attention model (RESTU), non-spectral attention model (UnSA-RESS), non-space attention model (UnSPA-RESS), our proposed RESSU model, and 3D-Unet.Among them, the first four models achieved the best results in terms of group numbers.The settings of 3D-Unet were consistent with those used in our experiment.Our proposed model and its different combinations have significantly fewer parameters and FLOPs than 3D-Unet, indicating a clear advantage in overall classification accuracy..

Conclusion
We propose a new semantic segmentation model, RESSU, which can make full use of the spatial and spectral features of hyperspectral images to solve the problem of poor performance of classification models with limited training samples.We conducted experiments using two hyperspectral datasets with different characteristics and compared them with other methods, and the classification results were superior to the other methods.
At the same time, in the validity analysis, different combinations of parameters, FLOP and OA are compared with the 3D-Unet model, and the parameters and flops of this model and its different combinations are much smaller than the 3D Unet model.
Compared with other classification methods, the RESSU model proposed in this paper has significant advantages in learning small hyperspectral data sets, improving classifier performance and classification accuracy.However, there are also shortcomings, and future studies will continue to improve the network structure so that the model has more outstanding classification performance in larger hyperspectral data sets.

Figure 2 .
Figure 2. RESSU full scale jump connection structure

X
Figure 3. RESSU deep supervision structure In terms of selecting the loss function, this article uses the softmax loss function based on the classification features of hyperspectral images to help the model generate predicted values that are close to the true value direction, thus achieving the purpose of learning.The formula is as follows:

FFigure 5 .
Figure 5. Spectral attention mechanism structure In order to improve the computing efficiency of spectral attention, the feature maps were respectively subjected to global maximum pooling operations based on width and height, global standard

Table 1 .
Sample types and quantity of hyperspectral datasets

Table 2 .
. Experimental environment configuration parameters Classification accuracy evaluation of 5% training samples on Salinas Valley images using different methods (%)

Table 6 .
Classification accuracy evaluation of 5% training samples on MUUFL images using different methods (%)

Table 7 .
The parameter quantities, FLOPs, and OA of the 3D Unet model with different combinations in this mode.