THE APPLICATION OF A CAR CONFIDENCE FEATURE FOR THE CLASSIFICATION OF CROSS-ROADS USING CONDITIONAL RANDOM FIELDS

The precise classification and reconstruction of crossroads from multiple aerial images is a challenging problem in remote sensing. We apply the Conditional Random Fields (CRF) approach to this problem, a probabilistic model that can be used to consider context in classification. A simple appearance-based model is combined with a probabilistic model of the co-occurrence of class label at neighbouring image sites to distinguish classes that are relevant for scenes containing crossroads. The parameters of these models are learnt from training data. We use multiple overlap aerial images to derive a digital surface model (DSM) and a true orthophoto without moving cars. From the DSM and the orthophoto we derive feature vectors that are used in the classification. Within our framework we make use of a car detector based on support vector machines (SVM), which delivers car probability values. These values are used as additional feature to support the classification when the road surface is occluded by static cars. Our approach is evaluated on a dataset of airborne photos of an urban area by a comparison of the results to reference data. The evaluation is performed for images of different resolution. The method is shown to produce promising results when using the car probability values and higher image resolution.


INTRODUCTION
The automatic detection and reconstruction of roads has been an important topic of research in Photogrammetry and Remote Sensing for several decades.Considerable progress has been made, but the problem has not been finally solved.The EuroSDR test on road extraction has shown that road extraction methods are mature and reliable under favourable conditions, in particular in rural areas, but they are far from being practically relevant in more challenging environments as they exist in urban or suburban areas (Mayer et al., 2006).One of the main reasons for failure of road extraction algorithms in that test was the existence of crossroads, due to the fact that model assumptions about roads (e.g., the existence of parallel edges delineating a road) are hurt there.For this reason, specific models for the extraction of crossroads from images have been developed.(Barsi and Heipke, 2003) used neuronal networks for a supervised per-pixel classification of greyscale orthophotos in order to detect areas corresponding to crossroads, combining radiometric and geometric features.However, only examples for rural areas were shown.(Ravanbakhsh et al., 2008b, Ravanbakhsh et al., 2008a) used a model based on snakes to delineate outlines of road surfaces at crossroads, including the delineation of traffic islands.The main reasons for failure of that method were occlusion of the road surface by cars and a complex 3D geometry, e.g. at motorway interchanges.Occlusions were also a major problem in (Grote et al., 2012), which also gives an overview over other current road detection techniques.The problem of occlusion by cars could be overcome if the position of cars were known in the images.
Conditional Random Fields (CRF) can be used for a raster-based classification of images (Kumar and Hebert, 2006).CRF offer probabilistic models for including context in the classification process by considering the statistical dependencies between the class labels at neighbouring image sites.Nevertheless, their ap-plication is restricted because of oversmoothing (Schindler, 2012), which is most likely to occur with small classes such as cars.In our previous work (Kosov et al., 2012) we tried to overcome this problem by integrating a car confidence feature into a CRFbased classification of image data together with a digital surface model (DSM).This feature was based on a probabilistic car detector, but the use of this feature did not contribute very much to improve the classification of cars because there were too many false positive car detections.It is one of the goals of this paper to overcome these problems by applying a more advanced car detector.Most recent approaches for car detection from aerial imagery use implicit models.In (Grabner et al., 2008) rotational invariant Histogram of Oriented Gradients (HOG), local binary pattern and Haar-like features are utilized.They apply an online boosting procedure for efficient training data collection.Another interesting approach is show in (Kembhavi et al., 2011), where new types of image features for vehicle detection are introduced.The feature includes color probability maps and pairs of pixels.The latter are used to extract symmetric properties of image objects.In this paper we propose a method to predict probabilties for vehicles based on rotation invariant features and Support Vector Machines.Thus, the number of false positives can be reduced.The second problem to be tackled in this paper is occlusion.We will address this problem by building a twin CRF, introducing two layers of class labels for each pixel.Partially occluded objects were also detected in (Leibe et al., 2008).The objects in the scene are represented as an assembly of parts.The method is robust to the cases where some parts are occluded and, thus, can predict labels for occluded parts from neighbouring unoccluded sites.However, it can only handle small occlusions, and it does not consider the relations between the occluded and the occlusion objects.Methods including multiple layers of class labels in a CRF mostly use part-based models, where the additional layer does not explicitly refer to occlusions, but encodes another label structure.In (Kumar and Hebert, 2005) and (Schnitzspan et al., 2009), multiple layers represent a hierachical object structure, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W3, 2013CMRT13 -City Models, Roads and Traffic 2013, 12 -13 November 2013, Antalya, Turkey This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.43 i.e. each object on higher level interacts with its smaller parts on lower level.In (Winn and Shotton, 2006), the part-based model is motivated by the methods potential to incorporate information about the relative alignment of object parts and to model longerange interactions.However, occluded objects are not explicitly reconstructed.The spatial structure of such part-based models is not rotation-invariant and, thus, requires the availability of a reference direction (the vertical in images with a horizontal viewing direction), not available in aerial imagery.In (Wojek and Schiele, 2008), a CRF having several layers is used, but the additional layer is related to a label for object identity, used to track an object detected by a specific object detector over several images.
In (Kosov et al., 2013) we did already propose a two-layer CRF to deal with occlusions, but the classifier used for the association potentials was based Gaussian mixture models and no car confidence feature was applied.The method presented in this paper applies a better base classifier for the association potentials, namely Random Forests (RF), and again includes the car confidence features.The main advantage of separating two class labels is a better potential for correctly classifying partly occluded areas while maintaining the occluding objects such as cars or trees.Our method is evaluated using 90 crossroads of the Vaihingen data set of the German Society of Photogrammetry, Remote Sensing and Geoinformation (DGPF).We use image and DSM data having a ground sampling distance (GSD) of 8 cm.The focus of the evaluation is on the impact of the car confidence feature, the context model, and the image resolution on the results.

CONDITIONAL RANDOM FIELDS (CRF)
We assume an image y to consist of M image sites (pixels or segments) i ∈ S with observed data y i , i.e., y = (y 1 , y 2 , . . ., y M ) In Eq. 1, ϕi(xi, y) are the association potentials linking the observations to the class label at site i, ψij(xi, xj, y) are the interaction potentials modelling the dependencies between the class labels at two neighbouring sites i and j and the data y, Ni is the set of neighbours of site i (thus, j is a neighbour of i), and Z is a normalizing constant.Applications of the CRF model differ in the way they define the graph structure, in the observed features, and in the models used for the potentials.Our adaptations of the framework will be explained in Section 3.

METHOD
The goal of our method is the pixel-based classification of urban scenes containing crossroads.The primary input consists of multiple aerial images and their orientation data.We require at least fourfold overlap of each crossroads from two different image strips in order to avoid occlusions as far as possible.In a preprocessing stage, these multiple images are used to derive a DSM by dense matching.The DSM is used to generate a true orthophoto from all input images, taking advantage of the multiple views to eliminate moving cars.More details about the preprocessing stage can be found in (Kosov et al., 2012).The DSM and the combined orthophoto are the input for extracting the features, which provide the input to the CRF-based classifier.

Twin CRF
In this paper we split objects corresponding to the base level, i.e. the most distant objects that cannot occlude other objects but could be occluded, and objects corresponding to the occlusion level, i.e. all other objects.This implies that, two class labels In Eq. 2, the association potentials ϕ l i , l ∈ {o, b} link the data y with the class labels x l i of image site i at level l.The interaction potentials ψ l ij , l ∈ {o, b}, model the dependencies between the data y and the labels at two neighbouring sites i and j at each level.This model implies that the two levels do not interact.Training the parameters of the potentials in Eq. 2 requires fully labelled training images.The classification of new images is carried out by maximizing the probability in Eq. 2.  (Kumar and Hebert, 2006), where the image data are represented by site-wise feature vectors fi(y) that may depend on all the observations y.Note that the definition of these feature vectors may vary with the dataset.We use a Random Forest (RF) (Breiman, 2001) in the implementation of (OpenCV, 2012) for the association potentials both of the base and for the occlusion levels, i.e. ϕ b i (x b i , y) and ϕ o i (x o i , y).A RF consists of NT decision trees that are generated in the training phase.In the classification, each tree casts a vote for the most likely class.If the number of votes cast for a class c is Nc, the probability underlying our definition of the association potentials is p(xi = c | fi(y)) = Nc/NT .

Interaction Potential:
This potential describes how likely a pair of neighbouring sites i and j is to take the labels (Kumar and Hebert, 2006).We generate a 2D histogram h ψ (xi, xj) of the co-occurrence of labels at neighbouring sites from the training data; h ψ (xi = c, xj = c ) is the number of occurrences of the classes (c, c ) at neighbouring sites i and j.We scale the rows of h ψ (xi, xj) so that the largest value in a row will be one to avoid a bias for classes covering a large area in the training data, which results in a matrix h ψ (xi, xj).We obtain ψij(xi, xj, y) ≡ ψij(xi, xj, dij) by applying a penalization depending on the Euclidean distance dij = fi(y) − fj(y) of the feature vectors fi and fj to the diagonal of h ψ (xi, xj): ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W3, 2013 CMRT13 -City Models, Roads andTraffic 2013, 12 -13 November 2013, Antalya, Turkey This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.44 In Eq. 3, λ1 and λ2 determine relative weight of the interaction potential compared to the association potential.As the largest entries of h ψ (xi, xj) are usually found in the diagonals, a model without the data-dependent term in Eq. 3 would favour identical class labels at neighbouring image sites and, thus, result in a smoothed label image.This will still be the case if the feature vectors fi and fj are identical.However, large differences between the features will reduce the impact of this smoothness assumption and make a class change between neighbouring image sites more likely.This model differs from the contrast-sensitive Potts model (Boykov and Jolly, 2001) by the use of the normalised histograms h ψ (xi, xj) in Eq. 3. It is also different from methods such as those described in (Rabinovich et al., 2007), who use the co-occurrence of objects in a scene to define a global prior to make the detection of small objects in a scene more likely if related larger objects are found.We use the co-occurrence of neighbouring objects to favour local label transitions that occur more frequently in the training data.Again, the training of the models for the base and the occlusion levels, , respectively, are carried out independently from each other using fully labelled training data.

Car Detection
The presence of vehicles in optical images is a strong indicator for roads.Thus a seperate classifcation of cars seem to be very useful for reconstruction of crossroads.A very similar idea was already shown in (Hinz, 2004).There, hierachical wire-frame models were used for the verification of already detected roads.In general, vehicle detection is performed either using implicit or explicit models.Extensive overviews of previous work can be found in (Stilla et al., 2004) and (Hinz et al., 2006).
The directions of the roads are unknown in advance.Thus, we also use HOG features.These image features can be calculated very efficiently by integral histograms (Porikli, 2005) for the sliding classification windows.The window size is 80 × 80 pixels.We calculate histograms with 9 bins for 100 non-overlapping blocks of 8 × 8 pixels each.Training and classification is performed using nonlinear Support Vector Machines (SVM) with soft margins and radial basis functions as kernel.The kernel parameter and error weight of slack variables is determined by cross-validation on the training data.The membership of each pixel i to class car given its feature vector y i is calculated by where w is the normal vector and b the vertical distance to feature space origin of the seperating hyperplane in the tranformed feature space.Transformation of feature vectors is given by the tranform ϕ (y i ).This function only gives a binary decision, which is not suitable as an input for the CRF.Thus, posteriori probabilities P (xi|yi) for each pixel i are estimated.For that purpose, the posterior is approximeted by a sigmoid function as proposed by (Platt, 2000): (5) The parameters A and B are estimated by the algorithm given in (Lin et al., 2007), which is more robust than the original algorithm of (Platt, 2000).

Definition of the Features
As stated in Section 3.1.1,we derive a feature vector fi(y) for each image site i that consists of seven features derived from the orthophoto (image features) collected in a vector fimg, a feature derived from the DSM (fDSM ) and, optionally, the car confidence feature (fcar), defined as the posterior in Eq. 5. We also make use of multi-scale features, collected in a vector fMS.The site-wise feature vectors are fi(y) T = (f T img , fDSM , f T M S ) or fi(y) T = (f T img , fDSM , f T M S , fcar), depending on whether the car confidence feature is used or not.For numerical reasons all features are scaled linearly into the range between 0 and 255 and then quantized by 8 bit.
We do not use the colour vectors of the images directly to define the site-wise image feature vectors fimg.The first three features are the normalized difference vegetation index (N DV I), derived from the near infrared and the red band of the CIR orthophoto, the saturation (sat) component after transforming the image to the LHS colour space, and image intensity (int), calculated as the average of the two non-infrared channels.We also make use of the variance of intensity (varint) and the variance of saturation (varsat), determined from a local neighbourhood of each pixel (7 × 7 pixels for varint, 13 × 13 pixels for varsat).The sixth image feature (dist) represents the relation between an image site and its nearest edge pixel; this feature should model the fact that road pixels are usually found in a certain distance either from road edges or road markings.We generate an edge image by thresholding the intensity gradient of the input image.Then, we determine a distance map from this edge image.The feature used in classification is the distance of an image site to its nearest edge pixel, taken from the distance map.Thus, the image feature vector for each pixel is fimg = (N DV I, sat, int, varsat, varint, dist) T .
A coarse Digital Terrain Model (DT M ) is generated from the DSM by applying a morphological opening filter with a structural element whose size corresponds to the size of the largest off-terrain structure in the scene, followed by a median filter with the same kernel size.The DSM feature is the difference between the DSM and the DT M , i.e., fDSM = DSM − DT M .This feature describes the relative elevation of objects above ground such as buildings, trees, or bridges.The multi-scale features fMS comprise the N DV I, fDSM and sat features, calculated at two coarser different scales as average values in squares of 21 × 21 and 49 × 49 pixels, respectively.

Training and Inference
Training of a CRF is computationally intractable if to be carried out in a probabilistic framework (Kumar and Hebert, 2006).Thus, approximate solutions have to be used for training.In our application, we determine the parameters of the association and interaction potentials separately based on fully labelled training images.The RF classifier used in the association potentials are trained using the site-wise feature vectors of the training images.The interaction potentials are derived from scaled versions of the 2D histograms of the co-occurrence of class labels at neighbouring image sites in the way described in Sec.3.1.2,taking into account all image sites in the training data.The parameters λ1 and λ2 in the Eq. 3 are set manually to values 2.0 and 0.01, respectively.Exact inference is also computationally intractable for CRFs.We use Loopy Belief Propagation (LBP), a standard technique for probability propagation in graphs with cycles that has shown to give good results in the comparison reported in (Vishwanathan et al., 2006).

Experimental Setup
To evaluate our model we used a part of the aerial images of the Vaihingen data set (Cramer, 2010).We selected 90 crossroads ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W3, 2013CMRT13 -City Models, Roads and Traffic 2013, 12 -13 November 2013, Antalya, Turkey This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.45 for our experiments.For each crossroads, a true orthophoto and a DSM were available, each covering an area of 80 × 80 m 2 with a GSD of 8 cm.The DSM and the orthophoto were generated from multiple aerial CIR images in the way described in (Kosov et al., 2012).They provide the original input to our CRF-based classifier.We defined each image site to correspond to image pixels, thus in the full resolution each graphical model consisted of 1000 × 1000 nodes.The neighbourhood Ni of an image site i in Eq. 1 is chosen to consist of the direct neighbours of i in the data grid.
We defined six classes that are characteristic for scenes containing crossroads, namely asphalt (asp.), building (bld.), tree, grass (gr.), agricultural (agr.) and car, so that C b = {asp., bld., gr., agr.} and C o = {tree, car, void}.The two-level reference was generated by manually labeling the orthophotos using these 6 classes, using assumptions about the continuity of objects such as road edges in occluded areas to define the reference of the base level.
For the evaluation we used cross validation.In each test run, 45 images were used for training, and the remaining 45 for testing.This was repeated two times so that each image was used first for training and second for testing.The results were compared with the reference; we report the completeness and the correctness of the results per class as well as the overall accuracy (Rutzinger et al., 2009).

Car Detection
Classification gives the probability for vehicles for each pixel.
In case of cleary seperated cars, the approach delivers results as illustrated in Fig. 1.During image generation moving vehicles should be eliminated.Still, several "blurred" vehicles are still visible.These vehicles also give response during classification, even so, the probabilties are smaller than 1 due to low contrast.An example is given in Fig. 2. Furthermore, objects of similar dimension recieve high probalities as it can be seen in Fig. 3.
In Fig. 4 the completeness versus correctness for different thresholds on the estimated vehicle probalities are shown.For this evaluation, the centre point of connected pixel having a larger value than the threshold is compared to the regions of the reference (e.g.first row of Fig. 5).Thus, connected regions which cover multiple vehicles (e.g. last row of Fig. 5) are only counted once and lead to a signifcant reduction of completeness.Therefore, the given value for completeness in Fig. 4 are quite pessimistic.Nevertheless, the overall correctness still needs further improvement, which could be achieved by additional features and an additional classification of the connected regions.This is planed for future work.

Results and Discussion
We carried out eight experiments.In the first four experiments (RF 5 car , RF 5 , CRF 5 car , CRF 5 ) we used a version of the Vaihingen dataset with a reduced GSD of 40 cm (corresponding to 5 × 5 pixels of the original images), so that the CRF only consisted of 200 × 200 nodes.In the second set of experiments (RF 1 car , RF 1 , CRF 1 car , CRF 1 ) we used the images at their full resolution of 8 cm.In the experiments RF 1 car and RF 5 car , we only used the Random Forest classifier for a local classification of each node, neglecting the interaction potentials.In the experiments CRF 1 car and CRF 5 car , the twin CRF model in Eq. 2 was used, including the interactions.The experiments RF 5 car , CRF 5 car , RF 1 car and CRF 1 car were performed using the car confidence feature, while for the experiments RF 5 , CRF 5 , RF 1 and CRF 1 the car confidence feature was not applied.The completeness and the correctness of the results achieved in these experiments are shown in Tab. 1 and 2. For the occlusion layer we also report the quality (Rutzinger et al., 2009), which is a measure for the trade-off between completeness and correcntess.
In Tab. 1 the overall accuracy for the base layer does not differ much between the experiments.Considering the interactions increases the overall accuracy by slightly more than 1% in the full resolution and slightly less in the lower resolution experiments.Partly this may be explicable by a good performance of the RF classifier and the inclusion of multiscale features, but a stronger setting of the weights for the interaction potentials might have lead to a larger differces.Using the car feature leads to an even lower increase in the overall accuracy in all experiments, which is, however, to be expected because only a very small area is covered by cars, and the car confidence is low in most of the areas where cars occur.Tab. 2 shows that the occlusion layer, containing the class car, shows a larger variation of the quality metrics between the different experiments.The most obvious improvement is achieved by considering local context: the overall accuracy achieved in the experiments based on CRF is 5%-6% better than the one achieved in the RF experiments.This is mainly due to an improvement of the completeness of class void, an indicator that in the RF scenario there are more false positive car and, in the lower resolution, tree objects, which is confirmed by the correctness numbers of these objects in the RF setting.Whereas the overall accuracy is similar between the experiments at full resolution and those at a reduced resolution, it becomes evident that the oversmoothing in the latter leads to a particularly poor performance for the smallest objects in our classification schemes, i.e. cars.For these objects, a classification at full resolution seems to be required.Look- ing at the results achieved for the images at full resolution, in the CRF setting, a better trade-off between completeness and correctness is achieved for the class car, indicated by the higher quality scores (Q. in Tab. 2) compared to the RF experiments.Tab. 2 also shows that indeed the car feature helps in the classification of cars.Experiment CRF 1 car achieves the highest quality score for car, though there is still considerable room for improvement.Fig. 5 illustrates two scenes with a high number of cars.Its third row presents the results of CRF 1 , while the fourth row shows results of the CRF 1 car experiment.In these scenes, using the car confidence feature improves the classification rate for cars considerably.In comparison to the reference (second row of Fig. 5), cars are oversmoothed and hardly recognizable in the results of CRF 1 .CRF 1 car delivers the results with the car regions in the correct positions and nearly without false positives.

CONCLUSION
In this paper, a method for the classification of crossroads using CRF was proposed.It considered occlusions explicitly by determining two class labels per pixel.A car confidence feature to avoid problems with occlusions of the road surface by cars.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W3, 2013CMRT13 -City Models, Roads and Traffic 2013, 12 -13 November 2013, Antalya, Turkey This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.47 Distinguishing 7 classes relevant in the context of crossroads, an overall accuracy of 79 -85% could be achieved.The car confidence feature, which is based on the output of our car detector, is shown to increase the accuracy of classification especially for the class car.In the future we want to improve our method by integrating more expressive features, e.g.features related to car trajectories.Furthermore, the interactions between the two levels need to be modelled in a way similar to (Kosov et al., 2013).
determined for each image site i.They correspond to the base and occlusion levels, respectively; C b and C o are the corresponding sets of class labels with C b C o = ∅.In our application, C b consists of classes such as road or building, whereas C o includes classes such as car and tree.C o includes a special class void ∈ C o to model situations where the base level is not occluded.We model the posterior probabilities p(x b | y), p(x o | y) directly, expanding the model in Eq. 1: 3.1.1Association Potential: Omitting the superscript indicating the level of the model, the association potentials ϕi(xi, y) are related to the probability of a label xi taking a value c given the data y by ϕi(xi, y) = p(xi = c | fi(y))
This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.46 asp.bld.