MULTI-SOURCE MULTI-SCALE HIERARCHICAL CONDITIONAL RANDOM FIELD MODEL FOR REMOTE SENSING IMAGE CLASSIFICATION

Fusion of remote sensing images and LiDAR data provides complimentary information for the remote sensing applications, such as object classification and recognition. In this paper, we propose a novel multi-source multi-scale hierarchical conditional random field (MSMSH-CRF) model to integrate features extracted from remote sensing images and LiDAR point cloud data for image classification. MSMSH-CRF model is then constructed to exploit the features, category compatibility of multi-scale images and the category consistency of multi-source data based on the regions. The output of the model represents the optimal results of the image classification. We have evaluated the precision and robustness of the proposed method on airborne data, which shows that the proposed method outperforms standard CRF method.


INTRODUCTION
In the fields of photogrammetry and remote sensing, there exist many sources of earth observation data with the different characteristics of targets on the ground.For a long period, integration of the multi-source data reasonably and effectively has been an active topic.Fusion of remote sensing images and LiDAR data provides complimentary information for the remote sensing applications, such as object classification and recognition.
Many methods have been developed for the fusion of remote sensing images and LiDAR data.In general those methods are classified into three categories, namely image fusion (Parmehr et al., 2012), feature fusion (Dalponte et al., 2012, Deng andSu, 2012), and decision fusion (Huang et al., 2011, Shimoni et al., 2011).The methods for image fusion include different resolution data sampling and registration, so the processing is timeconsuming, and the accuracy is affected by the accuracy of registration, which reduces the performance of the subsequent image classification.In the feature fusion methods, the features are usually extracted independently from different data source, and the fusion lacks consideration of correspondence of location and contextual information, by which the classification could be improved.
In order to overcome the limitations of the aforementioned methods, we present a novel multi-source multi-scale hierarchical conditional random field (MSMSH-CRF) model to fuse features extracted from remote sensing images and LiDAR point cloud data for image classification.In this paper, the major contribution is that both the category compatibility of the multi-scale image in a hierarchical structure and the category consistency of multisource data are considered in the MSMSH-CRF model.The following sections are organized as follows.The related work is discussed in Section 2..In Section 3., the MSMSH-CRF model is presented in detail.In Section 4., experimental results are presented.Finally, this contribution of this paper is concluded and the future work is discussed in Section 5..

RELATED WORK
In order to make full use of multi-source data for image classification and object recognition, many feature-based fusion methods have been proposed.One of the classic tools are graphical models (Bishop, 2006), i.e. probabilistic models defined on a graph describing the conditional dependence structure between random variables.As the one branch of the graphical model, Markov Random Fields (MRFs) have been used for image interpretation since 1986 (Besag, 1986), and their limiting factor only allowing for local image features has been overcome by Conditional Random Fields (CRFs) (Lafferty et al., 2001), where arbitrary features can be used for classification.CRFs have the ability to discriminatively model contextual dependencies, conditioned on observations, for capturing global as well as local image context, which makes them suitable for accurate labeling (Perez et al., 2012).Therefore, they have been receiving more and more attention in recent years (Yang and Förstner, 2011b, Zhang et al., 2012, Niemeyer et al., 2014).(Schindler, 2012) gives a systematic overview of image classification methods, which impose a smoothness prior on the labels.Both local filtering-type approaches and global random field models developed in other fields of image processing are reviewed.He shows a detailed experimental comparison and analysis of the methods, using two different aerial data sets from urban areas with known ground-truth.Based on the standard CRF model (Shotton et al., 2009), (Yang and Förstner, 2011a) introduce a hierarchical conditional random field to deal with the problem of image classification by modeling spatial and hierarchical structures.(Perez et al., 2012) formulate a multi-scale CRF model to deal with the problem of region labeling in multispectral remote sensing images.(Zhang et al., 2013) propose the multi-source hierarchical conditional random field (MSHCRF) model to fuse features extracted from remote sensing images and LiDAR point cloud data for image classification.Hierarchical pairwise potentials are introduced to consider category consistency of multi-source data based on regions.(Niemeyer et al., 2014) integrate a random forest classifier into a CRF framework, which is a flexible method for obtaining a reliable 3D classification in complex urban scenes.These methods exploit both spatial and hierarchical structures of objects in images.Considering the limitation of visual feature information from the images, the classification results could be potentially improved by incorporating information from different source data, such as the elevation information in LiDAR data and the spectral information in the hyperspectral images.

MSMSH-CRF MODEL FOR AUTOMATIC CLASSIFICATION
In this section, we start by presenting the graphical model to integrate an image and LiDAR data, so-called MSMSH-CRF model, with corresponding energy function.Then, we describe the model construction process.Afterward, we will derive the features from each region obtained from the unsupervised segmentation algorithm.Then, we will give particular formulations for each of the unary, pairwise, hierarchical potentials respectively.Finally, we will discuss the learning and inference of this graphical model.

MSMSH-CRF model
In the field of image analysis, the regions of interest are usually detected independently, but considering the relative position between regions in single source data and the correspondence between regions from multi-source data, the labeling of every region should not be independent.The CRF model is an effective way to solve the problem of prediction of the non-independent labeling for multiple outputs, and in this model, all the features can be normalized globally to obtain the global optimal solution.
Based on the standard CRF model, we propose the MSMSH-CRF model to learn the conditional distributions over the class labeling given an image and corresponding LiDAR data, and the model allows us to incorporate different features and correspondence information in a single unified model, as illustrated in Figure 3.The conditional probability of the class labels c given an image X and LiDAR data L, which has a distribution of the Gibbs form, is defined as follows And the energy function where θ = {θ1, θ2, θ3, θ4} is the vector of model parameters, Z(θ, X, L) is the partition function, i, j and k respectively index regions xi, xj and x k in the image, which correspond to nodes in the graph, and t index regions lt in the LiDAR data, which also correspond to nodes in the graph.S is the set of all the nodes in image level of the graph, N is the set of corresponding pairs collecting neighborhood in both images and LiDAR data, M is the set of pairs collecting parent-child relations between regions with neighboring scales, and H is the set of corresponding pairs collecting neighborhood in both images and LiDAR data.E1 is the unary potentials, which represent relationships between class labels and the observed data, E2 is the pairwise potentials, representing relationships between class labels of neighboring regions within each scale.E3 is the multi-scale hierarchical pairwise potential, which represents corresponding relationships between regions in neighboring scales of images.E4 is the multi-source hierarchical pairwise potential, representing corresponding relationships between images and LiDAR data.

Model construction
In order to integrate features extracted from multi-source data for image classification, the MSMSH-CRF graphical model is consist of two levels: Image level and LiDAR level.In Image level, Texton is utilized to distinguish between different regions effectively and obtain the different segmented regions, which form all the nodes in image level of the graph.Meanwhile, we can change the amount of channels of the Texton filter (Shotton et al., 2009) to get different results which are similar to the multi-scale segmentation, and Figure 1 shows the example results of our algorithm.The neighborhood in Image level is defined as the relationship of two regions which have the common edge.In LiDAR level, the mean shift algorithm is used to get the flat regions corresponding to continuous planes of different targets in LiDAR data, which form all the nodes in LiDAR level of the graph.For describing the consistency of multi-source data, we firstly choose the optimal scale of images to match with the LiDAR data.
Assuming that there is a registration of multi-source data acquired on the same airborne platform, such as the algorithm introduced in literature (Mastin et al., 2009), and we calculate the center of each region (or line)RLi in the depth image converted from LiDAR data, and the center should be inside the region (or line) and at the symmetric axis.Then based on the relative position of the centers, the corresponding regions (or lines) RLia in multiscale images can be selected.The procedure of choosing optical scale images is illustrated in Figure 2. Therefore, for each pixel s in the region (or line) RLi, we obtain the optimal scale of images by a * = arg min where i index the sequence number of all regions (or lines) in the depth image converted from the Mean Shift Feature (MSF) or Alpha Shape Feature (ASF) of LiDAR data.
Therefore, the MSMSH-CRF graphical model is constructed as follows, illustrated in Figure 3. Firstly, typical features are derived from the interest regions in multi-source data, where the regions are generated by an unsupervised segmentation algorithm.
In the graphical model, the nodes correspond to regions.The blue edges represent the dependencies between neighboring regions, and the orange edges indicate the hierarchical relations between regions at different scales in a multi-scale segmentation.
Purple edges indicate the hierarchical relations between regions from multi-source data, where the optimal scale of images is selected to match the LiDAR data.The MSMSH-CRF model is constructed to exploit the features and category compatibility of multi-scale images as well as the category consistency of multisource data based on regions.The output of the model represents the optimal results of the image classification.
Figure 2: The example image of illustrating the procedure of choosing the optimal scale image to match the LiDAR data.

Features
Four types of features are extracted, namely the line features (LF), the texture features (TF), the mean shift features (MSF), and alpha shape features (ASF).The line features (LF) and the texture features (TF) are extracted from remote sensing images, whereas the mean shift features (MSF) and alpha shape features (ASF) are from LiDAR data.
Line Features (LF) Shape features, in particular line features, not only describe the structures of targets directly, but also are stable to light change, color change, etc.As a new and effective one of line features, the LSD (Line Segment Detector) (Grompone and Randall, 2010) can be used to give accurate results extracted, a controlled number of false detections, and requires no parameter tuning.In the method, the level-line orientation is defined and calculated by gradient magnitude, and then the pixels with the same level-line orientation are merged to cover the so-called line support regions, in which all the pixels are regarded as a long

Mean Shift Features (MSF)
The mean shift method (Comaniciu and Meer, 2002) is a robust clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters.The number of clusters is obtained automatically by finding the centers of the densest regions in the space, so this method is widely used for clustering of discrete points.In our model, the specific process of achieving the MSF is introduced in (Georgescu et al., 2003), all the LiDAR points are clustered in different regions, and the elevation of all points in one region are assigned as the same value which is the mean of all the ones.

Alpha Shape Features (ASF)
There are many methods for extracting the boundary of LiDAR data.Compared with other algorithms (Berger, 2012, Kong et al., 2012), Alpha Shapes algorithm works effectively in inner and outer boundaries extraction from LiDAR data with convex and concave polygon shape.Moreover, it can keep fine features of buildings adaptively and filter the footprints of non-building.Based on the MSF regions obtained, the alpha shape algorithm is used to extract the boundary contour of each region, and then the Delaunay triangulation is used to get the line feature.The extraction of the ASF refers to (Shen et al., 2011), similar to the MSF, all the points in one lines have the same elevation which is the mean of all the ones.

Unary potentials
The unary potentials consist of two element: LF and TF potentials, predict the label ci of the region xi based on the image X where LF (ci, xi, θLF ) is the LF potential and T F (ci, xi, θT F ) is the TF potential, and θ1 = {θLF , θT F } is the vector of model parameters.

LF Potentials
The LF potentials capture the (relatively weak) dependence of the class label and the boundaries of targets on the response value of LSD and absolute location of the pixel in the image.We can get the line segment image LF I(s) by calculating the response value of LSD LFs of each pixel s in the region xi.
The LF potentials take the form of a look-up table with an entry for each class ci and value of LSD LFs and pixel location s where the parameter θLF represents the relationship among the value of each pixel LFs, namely LF I(s), the pixel location s and the label ci.
TF Potentials Based on the Joint Boost algorithm, an adapted version of boosting learning algorithm, we can obtain the classifier of Texton, to which the responses are used directly as a potential in the MSMSH-CRF model, so that where T Fs corresponds to the response of classifier at each pixel s, and P (ci|T Fs) is the normalized distribution given by the classifier using the learned parameters θT F .

Pairwise potentials
The pairwise potentials describe category compatibility between neighboring regions xi and xj obtained from the line segment image LF I(s), and the responses of Texton classifier on the image X. Pairwise Potentials of TF Similar to the pairwise potentials of LF, the pairwise potentials of TF take the form of the contrastsensitive Potts model:

E2(ci
where θP T F is the weight factor, t(xi, xj) is the Euclidean metric of the value of Texton classifier at each pixel between regions xi and xj in the results of marked images, and the number 4 in Eq. ( 10) is set empirically.The pairwise potentials P T F are scaled by Ni and Nj to compensate for the irregularity of the graph.

Multi-scale hierarchical pairwise potentials
From the pairwise potentials in Section 3.5, there is a lack of longer range contextual relationship in the graphical modeling.
To overcome those local restrictions, we analyze the image at multiple scales to enhance the model by evidence aggregation on a local to global level.Furthermore, we integrate multi-scale pairwise potentials to regard the hierarchical structure of the regions.
Based on results of multi-scale segmentation, the multi-scale hierarchical pairwise potentials describe category compatibility between hierarchically neighboring labels ci and c k given the image X, which take the form of the contrast-sensitive Potts model: where θ3 is the weight factor, m(xi, xj) is the Euclidean metric of the value of Texton classifier between regions xi and xj in the results of marked images, and the number 4 in Eq. ( 11) is set empirically.Multi-scale hierarchical pairwise potentials act as a link across scale, facilitating propagation of information in the model.

Multi-source hierarchical pairwise potentials
Compared to the remote sensing images, LiDAR data is sparse.
The features extracted from multi-source data are different.In order to enhance the fusion performance, we introduce the hierarchical pairwise potentials, which represent correspondences between the data from different source in our MSMSH-CRF model.The hierarchical pairwise potentials describe category consistency between the corresponding regions in multi-source data, from which we can obtain the TF and MSF, which are named as planar features, and the LF and ASF, which are named as linear features.
In order to enhance the fusion performance, we refer to the category consistency with the planar and linear features separately, denoted as HP P (ci, ct, xi, lt, θp) and HP L(ci, ct, xi, lt, θ l ) respectively.So there is E4(ci, ct, xi, lt, θ4) = HP P (ci, ct, xi, lt, θp) where θ4 = {θp, θ l } is the vector of model parameters.
Hierarchical pairwise potentials of planar features Based on the TF results of the optimal scale image, we firstly normalize the value T Fs(xi) of Texton classifier of each pixel s in the region xi to get N T Fs(xi): where T Fmax is the maximum value of Texton classifier of each pixel in the image.
In the MSF results of LiDAR data, elevations of different regions are obtained, and the normalized elevation N M SF (lt) of all points in the regions lt extracted is calculated: where M SF (lt) is the elevation of all points in the region lt, and M SFmax is the maximum elevation of all flat regions in the LiDAR data.
So based on the normalized value N T Fs(xi) and N M SF (lt), the hierarchical pairwise potentials of planar features is defined by where p = (2 < |N T Fs(xi) − N M SF (lt)| 2 >) −1 is the comparative item, < • > is the averaging operator, and θp is the weight.

Hierarchical pairwise potentials of linear features
The hierarchical pairwise potentials of linear features take the form as where t = (< 2|N LFs(xi) − N ASF (lt)| 2 >) −1 is the comparative item, and θ l is the weight.N LFs(xi) is the normalized value from the LF results of the optimal scale image, and N ASF (lt) is the normalized value from the ASF of LiDAR data.

Parameter Learning
In this paper, piecewise training method (Sutton and McCallum, 2005) is adopted for the learning of the parameters of MSMSH-CRF model.This method divides the MSMSH-CRF model into pieces corresponding to the different terms in Eq. ( 2).Each of these pieces is then trained independently, as if it were the only term in the model.

Parameters of LF Potentials
The formula for calculating the parameters of LF Potentials respectively for each image is defined as (17) where the small positive integer wLF is set to 0.1 in practice.

Parameters of TF Potentials
The learning of parameters of TF Potentials is based on Joint Boost algorithm, and an excellent detailed treatment of the learning process is given in literature (Shotton et al., 2009), but we briefly describe it here for completeness.Parameters of other potentials The parameters of other potentials of MSMSH-CRF model,θP LF , θP T F , θ3, θp and θ l , are selected manually such that the classification error is minimized on the training set.

Model Inference
Given a set of parameters learned for the MSMSH-CRF model, the optimal labeling c * , which minimizes the energy function in Eq. ( 2), is found by applying the alpha-expansion graph-cut algorithm (Boykov et al., 2001, Boykov andJolly, 2001).

EXPERIMENTS
In this section, experiments are performed on the Beijing Airborne Data (Zhang et al., 2013), to evaluate the performance of the proposed method.

Dataset
We conduct experiments to evaluate the performance of the MSMSH-CRF model on the Beijing Airborne Data (Zhang et al., 2013), which include remote sensing images with a resolution of 0.12m and LiDAR data with a point density of 4 points/m 2 , as illustrated in Figure 4.The objects in all images correspond to one of three classes: Building, Road and Vegetation.These classes are typical objects appearing in airborne images.In the experiments, we take the ground-truth label of a region to be the majority vote of the ground-truth pixel labels, and randomly divide the images into a training set with 50 images and a testing set with 50 images.Method Accuracy (%) (Shotton et al., 2009) 64.2 (Zhang et al., 2013) 73.6 Ours 83.7 Table 1: Average pixelwise accuracy of three methods on the Beijing Airborne Data.

Results
Figure 5 shows the example results of MSMSH-CRF classification method.The average pixelwise accuracy on the testing set is given in Table 1.The average classification accuracy of our method is 83.7%, which has 10.1% gain w.r.t. the accuracy of the MSHCRF model (Zhang et al., 2013) and 19.5% gain w.r.t. the accuracy of the standard CRF model (Shotton et al., 2009).The parameter, learned by cross validation on the training set, are θP LF = 0.22, θP T F = 0.18, θ3 = 0.15, θp = 0.2, and θ l = 0.25.For the fairness of comparison, both the training set and the testing set are same for MSMSH-CRF, MSHCRF and standard CRF respectively.Figure 6 shows the classification accuracy with different parameters, with only one parameter is changing while the others are fixed.

Figure 1 :
Figure 1: The example region images of Texton segmentation results at scale 1, 2, 3 respectively.The color of each region is assigned randomly that neighboring regions are likely to have different colors.Top row left: Original image, Top row right: segmentation result at scale 3. Bottom row left: segmentation result at scale 1, Bottom row right: segmentation result at scale 2.

Figure 3 :
Figure 3: Illustration of the MSMSH-CRF model architecture.In Image level, red nodes (# 1) correspond to image regions, blue edges (# 2) linking red nodes represent the dependency between neighboring regions, and orange edges (# 3) linking red nodes in multi-scale indicate the hierarchical relation between regions at different scales corresponding to the multi-scale segmentation.In LiDAR level, green nodes represent the extracted regions.Purple edges (# 4) linking red and green nodes indicate the hierarchical relation between regions from multi-source data, where the optimal scale of images is selected to match the LiDAR data.
Each training example s (a pixel in a training image) is paired with a target value Z c s ∈ {−1, +1} (+1 if the example s has ground truth class c, −1 otherwise) and assigned a weight ω c s specifying its classification accuracy for class c after iteration of boosting.Each round of iteration chooses a new weak learner by minimizing an error function incorporating the weights.The training examples are then re-weighted ω c s to reflect the new classification accuracy.This procedure emphasizes poorly classified examples in subsequent rounds of iteration, and ensures that over many rounds, the classification for each training example approaches the target value and the parameters are optimal.

Figure 4 :Figure 5 :
Figure 4: The example images of the Beijing Airborne Data.Left: LiDAR data, Right: remote sensing images of the surveying area.

ISPRS
Figure 6: The classification accuracy with different parameters, with only one parameter is changing while the others are fixed.
(Boykov and Jolly, 2001)F Based on the line segment image LF I(s), we can calculate the pairwise potentials of LF as the form of the contrast-sensitive Potts model(Boykov and Jolly, 2001)P LF (ci, cj, xi, xj, θP LF ) = LF is the weight factor, l(xi, xj) is the Euclidean metric of the pixel value between regions xi and xj in the LF images, Ni is the number of regions neighbored to region i, Nj is the number of regions neighbored to j, and σ(•) is a 0-1 indicator function, and the number 6 in Eq. (9) is set empirically.The pairwise potentials P LF (ci, cj, xi, xj, θP LF ) are scaled by Ni and Nj to compensate for the irregularity of the graph.

Table 2
shows the confusion matrix obtained by applying standard MSMSH-CRF model to the whole test dataset.Accuracy values in the table are computed as the percentage of image pixels assigned to the correct class label, ignoring pixels labeled as void in the ground truth.Compared to the confusion matrices of standard CRF model and MSHCRF model in Table3 and Table 4respectively, the MSMSH-CRF model yields significant improvement on all three classes for integrating multi-scale hierarchical information of the regions in the images.Table5shows the performance comparison when dropping one types of potentials in the MSMSH-CRF model.
5. CONCLUSIONSIn conclusion, this paper presents a novel multi-source multiscale hierarchical conditional random field model for automatic classification of remote sensing images.The main contributions of this work are summarized as follows: a novel CRF-based modeling scheme exploiting the complementarity of multi-source data

Table 2 :
Pixelwise accuracy of the MSMSH-CRF classification on the Beijing Airborne Data.The confusion matrix shows classification accuracy for each class (rows) and is row-normalized to sum to 100%.Row labels indicate the true class, and column labels indicate the predicted class.

Table 3 :
The confusion matrix: pixelwise accuracy of the standard CRF classification on the Beijing Airborne Data.

Table 4 :
The confusion matrix: pixelwise accuracy of the MSHCRF classification on the Beijing Airborne Data.

Table 5 :
The performance comparison when dropping one types of potentials in the MSMSH-CRF model.