AN INCREMENTAL MAP BUILDING APPROACH VIA STATIC STIXEL INTEGRATION

This paper presents a stereo-vision based incremental mapping approach for urban regions. As input, we use the 3D representation called multi-layered Stixel World which is computed from dense disparity images. More and more, researchers of Driver Assistance Systems rely on efficient and compact 3D representations like the Stixel World. The developed mapping approach takes into account the motion state of obstacles, as well as free space information obtained from the Stixel World. The presented work is based on the well known occupancy grid mapping technique and is formulated with evidential theory. A detailed sensor model is described which is used to determine the information whether a grid cell is occupied, free or has an unknown state. The map update is solved in a time recursive manner by using the Dempster‘s Rule of Combination. 3D results of complex inner city regions are shown and are compared with Google Earth images.


INTRODUCTION
Precise ego vehicle localization in urban regions is one of the major challenges in autonomous driving applications (Kammel et al., 2008), (Thrun, 2010).For this task, it is not sufficient to only use GPS information, because well known limitations like multi-path effects or signal shadowing occur in urban regions.Thus, precise self-localization approaches are often based on prior map information.Next to the (re)detection of 3D landmarks (Lategahn and Stiller, 2012), another possibility is the use of geometrical information of the static environment, like building facades or traffic infrastructures (Larnaout et al., 2012).With the help of this information, the accuracy of self localization increases.Furthermore, modern driver assistance systems like lane change systems or collision avoidance systems (Muffert et al., 2013) would also benefit from detailed prior map information to support trajectory planning for the ego vehicle.
In this paper, the focus is on the incremental map building process in urban regions given the absolute pose and orientation of the ego vehicle.The goal is to update the map sequentially to achieve real time capability in future.For the mapping approach, the results should contain static environment information (e.g.building facades) since only this information is useful for further tasks such as ego vehicle localization.In contrast to the use of ultrasonic sensors (Pagac et al., 1998) or laser scanners (Thrun, 2010), a stereo camera system is applied which went into series production in the Mercedes-Benz S-and E-class in summer 2013.It is assumed that the approximated absolute pose of the experimental vehicle is known, e.g. by using an inertial measurements unit in combination with a GPS sensor.
For the first time we use the 3D scene representation called Stixel World (Pfeiffer and Franke, 2011) as input data for the mapping approach.It is computed from dense disparity images (Fig. 2(a)) at each acquisition time step.A Stixel is defined by its 2D position and 2D velocity, its width, and its height referring to the coordinate system of the ego vehicle (Fig. 2(b)).Hereby, the Stixel World efficiently describes dynamic and static objects, as well as free space information.In contrast to the raw input data (up to 500.000disparity values), the 3D scene is represented by a few hundred Stixels only.As a result, this step reduces the computational burden for the following steps significantly.Due to named properties research groups of Driver Assistance Systems increasingly rely on 3D representations like the Stixel World.Subsequently, the multi-layered Stixel World is segmented into static and dynamic object classes using the approach by Erbs et al. (Erbs et al., 2012), as seen in Fig. 2(c).This allows to exclude dynamic objects like driving vehicles or bicyclists, so that static environment information is only considered for the mapping approach.In Fig. 2(d), the final set of Stixels used for the mapping approach is shown.
Because of the fact that the Stixels include height information, a height layer of the complete 2D occupancy grid map is also computed.The combination of the 2D map with the height layer allows to estimate a 3D environment model which is visualized using the OctoMap framework (Wurm et al., 2010) The remainder of this paper is organized as follows: Section 2 gives a brief overview about related work.Then, Section 3 points out the mapping process which is based on the idea of Cartesian occupancy grid maps (Moravec and Elfes, 1985) in combination with evidential theory (Shafer, 1976).Finally, results are shown in Section 4 and Section 5 summarizes the paper.

RELATED WORK
Elfes and Movarec (Elfes, 1987), (Moravec and Elfes, 1985) introduced the occupancy grids at first using wide angle sonar sensors.Thrun et al. (Thrun et al., 2005) give a detailed overview about probabilistic methods for 2D occupancy grid mapping techniques given the robot's pose.At first, Thrun et al. describe an incremental ad hoc approach which is based on a binary Bayes filter.A drawback is the assumption that all grid cells are independent and mostly initialized with a probability of 0.5.Furthermore, the described sensor model is considered unsuitable for stereo vision sensors in the authors view: It returns fixed probability values for occupied (e.g.0.8) or free (e.g.0.2) which does not capture the measurement characteristics of a stereo camera system.In addition, Thrun et al. formulate a maximum a posteriori approach with a descriptive sensor model which considers the dependence of all grid cells.This is a golden standard model, but the disadvantage is the non incremental map update which is a requirement for the developed mapping approach.A detailed discussion of both methods can also be found in (Merali and Barfoot, 2012).
To reconstruct the complete 3D environment of urban scenes Gallup et al. (Gallup et al., 2010) and Zheng et al. (Zheng et al., 2012) present probabilistic methods using street-level videos or photo collections.The required depth maps are estimated using either structure from motion or dense stereo techniques.
Gallup et al. (Gallup et al., 2010) use a n-layer height map which allows the representation of overhanging structures like balconies or bridges.Zheng et al. (Zheng et al., 2012) take up this idea and present an efficient incremental depth map fusion framework using wavelet based compression techniques.
The developed mapping process is formulated with evidential theory (Shafer, 1976).We take up the idea from (Moras et al., 2011) and (Yang and Aitken, 2006) which use ultrasonic and range sensors in contrast to a stereo camera system.
Yang et al. (Yang and Aitken, 2006) give a detailed overview of evidential map building techniques: Each grid cell state is described by its power set which is defined by the subsets free, occupied and unknown.Furthermore, the conflict in a cell is described.For each subset a probability assignment function is defined which formulates a detailed characterization of the sensor model.The Dempster's Rule of Combination (Shafer, 1976) makes an incremental cell update possible.

MAPPING PROCESS
Assuming a planar surface of the environment, a 2D Cartesian reference grid map M = {mj}, j ∈ {1, .., J} with the cells mj is given.For each cell hypotheses of free (F) and occupied (O) are made.As described in (Moras et al., 2011) and (Yang and Aitken, 2006), in evidential theory the power set is defined by (1) The subset Ω = {O, F } describes the ignorance of a cell and is denoted as the unknown state U .The subset ∅ is the empty set.Following the definition of Eq. ( 1), for each element A of the power set a belief mass function m * j (A) is specified with the property m

Input data
With the help of the rectified image sequences of the stereo camera, dense disparity images are estimated using semi-global matching (SGM) (Gehrig et al., 2009) as seen in Fig. 2(a).Then, the multi-layered Stixel World (Pfeiffer and Franke, 2011) is computed.The Stixels represent the relevant information of the current 3D scene such as free space and dynamic and static obstacles (see Fig.This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.56 Referred to the 3D Cartesian space, each Stixel Si = [s, ṡ, H, W ] T i with i ∈ {1, .., I} is parametrized by its position vector s = [X, Z] T with the lateral and longitudinal components X and Z, its height H and its width W . Consequently, obstacles are described by planar, vertical oriented surfaces which is a common assumption in urban regions.As a result, facades behind parking vehicles can be mapped, as shown in Fig. 5.A well known challenge in mapping approaches is that dynamic objects have to be detected to be not taken into account in the fusion step.To overcome this problem, the Stixels are tracked over time which is achieved by the 6D-Vision principle (Franke et al., 2005).This scheme uses Kalman filter to estimate both the position and velocity of each Stixel.Performing this step makes it possible to obtain a motion state ṡ = [ Ẋ, Ż] T .
The separation of the scene into moving or stationary obstacles is achieved by a multi-class traffic scene segmentation (Erbs et al., 2012) which is based on a conditional random field framework.With the help of this segmentation, from now, Si describes only the static Stixels which are used as input data.An example is shown in Fig. 2(d).
With the help of the pinhole camera projection equation (projection from the disparity(d)-column(u)-space to the 3D space with given variances σ 2 d and σ 2 u ) and using Gaussian error propagation it could be shown that the covariance matrix Σ (s) i of the position si of a Stixel is estimated by including the baseline b and the focal length f of the stereo rig.
Due to the assumption that the ego vehicle's position and orientation is known the Stixels are transformed into the reference coordinate system of the map M at each acquisition time step.

Sensor Model
The sensor model describes in which way the Stixel position si influences a set of cells Ms i ∈ M. Due to the fact that each Stixel obtains information about free space and obstacles (see Section 3.1), the cells Ms i are partitioned into cells which support a set of occupied cells M (3) The As shown in Fig. 3, the space between the ego vehicle and the Stixels Si ∈ Si which represents the first obstacles in the image rows is free space.A set of K rays R s i = {r k } s i with k ∈ {1, .., K} from the ego vehicle to the Stixel are defined where K depends on the width of the Stixel Wi.All cells intersecting with those rays are defined as M (F ) s i .Subsequently, the rays are partitioned into ray segments as illustrated in Fig. 3.
In a more mathematical way, the intersection mj ∩ r k of an arbitrary cell mj with a ray r k returns the adequate ray segment with its length | mj ∩ r k |.The assumption is made that the more a cell includes rays segments, the greater is the evidence of free.
With the summarized length L = K | r k s i | of all rays the belief mass function m * j (F ) for the subset F is defined by Due to the properties of the belief mass function, m * j (U ) and m * j (∅) are defined as and p * m j (∅) = 0.This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.57

Incremental Map-Updating via Dempster's Rule of Combination
Up to this point, the sensor model and the definition of the belief mass function are described.Now, an incremental map update between consecutive time steps t − 1 and t is realized to recursively merge new Stixel measurements into the existing map.The Dempster's Rule of Combination is used which combines two independent states m * j,t (A) and m * j,t−1 (A) at consecutive time steps.The general formulation is defined by with In this contribution, the fusion step is formulated as For a better understanding of the update step from Equation 7, results of a simulation are shown in Fig. 4: At the beginning (t = 0), all cells are initialized with an unknown state which is visualized by a dark blue map.Because of this, the power set of a cell is defined as m * j,0 (O) = m * j,0 (F ) = m * j,0 (∅) = 0 and m * j,0 (U ) = 1.At the next time step (t = 1), the figure points out the integration of one Stixel measurement into the map.The evidence for occupied grows in these cells which are directly hit by the Stixel.In the cells between the ego position (black cross) and the Stixel the subset of free grows slowly, visualized by the color green.
The influence of an additional measurement and the recursive filter characterization can be seen at the following time step t = 2: In overlapping areas the evidence of free and occupied grows which is particularly shown by the free space area.
During the update step K is not equal to zero, as long as a conflict occurs in the cells.As an example, this case happens if a current Stixel measurement overlaps free space area from the last time step.Specifically, the conflict is significant if Stixel outliers occur.This effects a robustness of our algorithm.
As seen in Eq. 7 the subsets O, F and U are normalized by the term (1 − K) at each time step to satisfy the properties of the belief mass function.
To obtain the mentioned height layer, the Stixel height Hi,t is taken into account.For each cell a height Hj,t is estimated over time by Note that the height Hj,0 is initialized by zero.For the visualization of the 3D environment model, the open source framework OctoMap (Wurm et al., 2010) is used which is based on a 3D Octree representation (see Fig. 1).
Figure 4: Example of the time recursive update step: At t = 0, the map is initialized as unknown (dark blue).At time step t = 1 the integration of one Stixel into the map is shown.The cell which is directly hit is red, the immediate cell neighbors are purple.The field of view between ego position (black cross) and the Stixel position is free space which is encoded with the color green.With an additional Stixel measurement at time step t = 2 history and current measurements are merged.The stronger the color saturation the stronger the evidence of free, occupied or unknown.

RESULTS
In this work, mapping results of a 3.000 image sequence of an urban drive (see Fig. 1) are presented.The used stereo camera system is mounted behind the windshield of the experimental vehicle.It has 1024 × 440 px image sensors and records with 25 Hz.With the help of an inertial measurement unit from iMar 1 in combination with a GPS sensor, a global vehicle ego motion was determined.An image Stixel width of 5 px and a grid cell size of 0.04 m 2 is chosen.We assume a disparity uncertainty of σ d = 0.5 px and a column uncertainty of σ d = 0.25 px.To reduce the number of outliers only Stixels with a distance less than 25 m to the ego vehicle's position are used.
Fig. 5 shows the occupancy grid results of a 20 m×30 m example map which reveals occupied (red), free (green) and unknown (blue) grid cells.The stronger the color saturation the stronger the evidence of free, occupied or unknown.On the right side of the map an entrance gate, two parking vehicles and the building facades are visualized.The area between the vehicles up to the building facade is unknown because this area is not visible.On the left side the poles of the parking area are mapped correctly which underlines the high level of detail of the Stixel World.In addition, Fig. 5 shows the height layer for the map.The building facades are about 4 m high; vehicles, the entrance gate and the poles have heights between 0.5-1.5 m which are conclusive results.
In Fig. 6 and Fig. 7 detailed results from the complete image sequence are shown.For a consistency check of the pipeline of the mapping approach the results are overlaid on Google Earth images.Both figures point out that mapped corners of buildings, facades and parking areas are consistent with the Google Earth images.Note that we have not take into account uncertainties in the ego vehicle's position.
As shown in Fig. 6, the oncoming vehicle is ignored for the mapping approach thanks to the fact that the named Stixel segmentation step (Erbs et al., 2012) allows to ignore moving objects.
Fig. 7 points out that the poles next to the right driving corridor are mapped correctly.Because of the fact that the velocity of the waiting vehicle at the intersection is zero, the vehicle is classified as a static object.
In addition, the comparison with the Google Earth images points out a drawback using a stereo camera system: Due to the limited field of view the complete right building in the scene of Fig. 6 was not recorded and, as a consequence, can not be mapped.To overcome this drawback wide angle lenses should be used in future work.

SUMMARY
In this work, a stereo vision based incremental mapping approach for urban regions was presented.For the first time the 3D environment representation multi-layered Stixel World is used as input data.For reliable mapping results the Stixel World was segmented into static and dynamic objects.Thus, only static environment information is used for our technique.
The mapping approach was formulated with evidential theory which allows the explicit representation of free, occupied and unknown regions.A detailed sensor model describes how Stixel measurements influence the state of the cells.To fuse new Stixels into the map incrementally, the Dempster's Rule of Combination was used.In a further step, a recursive height layer for the map was estimated.Detailed mapping results of a 3.000 image sequence were shown.Furthermore, the mapping results were compared with Google Earth images for a consistency check of the pipeline of the mapping approach.

Figure 1 :
Figure 1: The result of the mapping approach for a 3.000 image sequence.For the visualization and the data structure the Oc-toMap Representation (Wurm et al., 2010) is used.The purple line represents the driven path of the ego vehicle.The height of the static environment is color encoded.

Figure 2 :
Figure 2: Pipeline of the generation of the input data. of a cell is given by the Cartesian information of the center point (Xj and Zj) and the belief mass function: mj = [Xj Zj m * j (A)] T .
2(b)).A single Stixel is a vertically oriented rectangle with a fixed width in the image (e.g. 5 px) and a variable height.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W3, 2013 CMRT13 -City Models, Roads and Traffic 2013, 12 -13 November 2013, Antalya, Turkey of free cells M (F ) s s i i , as seen in Fig. 3. Additionally, cells which are not influenced by a Stixel (i.e.M\Ms i ) are defined with an unknown state U. To formulate the mapping process with evidential theory, at first, the belief mass function of the subset m * j (O) has to be defined.Next to the cell which is directly hit by a Stixel a region of cell neighbors is considered in M (O) s i .The size of the region depends on the estimated uncertainties of si.With the help of Σ (s) i , the belief mass function for the subset O is defined by function f O (mj | N2(si, Σ (s) i )) returns an occupancy value of evidence given the 2D normal distribution N2(si, Σ (s) i ) as long as the current cell mj is an element of M (O)

Figure 3 :
Figure 3: The sensor model of the mapping approach.The cell which is directly hit by a Stixel is dark red, the immediate cell neighbors are bright red.For these cells an evidential value of occupied is estimated which is based on a two dimensional Gaussian distribution.The distribution is defined by the position of a Stixel si = [X, Z]i and the covariance matrix Σ (s) i which includes the variances σ 2(X) i and σ 2(Z) i.The cells which support free regions are green.The more rays pass through a cell the greater is the evidence of free.Cells which are not influenced have an unknown state (blue).

Figure 5 :
Figure5: An exemplary mapping result is shown.On the left side the original images with the input data of three time steps are presented.The map in the center shows the results using the described approach.The color reveals occupied (red), free (green) and unknown (blue) areas of the environment.The stronger the color saturation the stronger the evidence of free, occupied or unknown.On the right side of the map an entrance gate, two parking vehicles and the building facades are mapped correctly.Furthermore, poles on the left side are presented, too.The cells "behind" the vehicle have an unknown state.On the right side of the figure the height layer is shown.There, the color encodes the height from zero (dark blue) to about 4 m (dark red).The facades are about 4 m high; vehicles, the entrance gate and the poles have heights between 0.5-1.5 m.

Figure 6 :
Figure 6: The first detailed example of the 3.000 image sequence is shown.The original images with the input Stixels (left), the 2D mapping results overlaid on Google Earth images (center) and the 3D representation (right) using the OctoMap representation are shown.Due to the segmentation of the multi-layered Stixel World into stationary and moving objects, the oncoming vehicle is not taken into account for the mapping approach.Furthermore, walls behind walls are represented.Because of the limited field of view the right building in this scene could not be mapped.

Figure 7 :
Figure 7: The second detailed example of the 3.000 image sequence: Traffic infrastructures like poles are mapped precisely into the map which is shown on the right side of the driving corridor.Because of the fact that the vehicle at the intersection is still waiting, it is classified as a stationary object.As a result, the vehicle is represented in the map.The mapped corners of buildings, the facades and the parking areas are consistent with the Google Earth images.Note that uncertainties in the ego vehicle's position have not been taken into account.