PROBABILISTIC MULTI-PERSON TRACKING USING DYNAMIC BAYES NETWORKS

Tracking-by-detection is a widely used practice in recent tracking systems. These usually rely on independent single frame detections that are handled as observations in a recursive estimation framework. If these observations are imprecise the generated trajectory is prone to be updated towards a wrong position. In contrary to existing methods our novel approach uses a Dynamic Bayes Network in which the state vector of a recursive Bayes filter, as well as the location of the tracked object in the image are modelled as unknowns. These unknowns are estimated in a probabilistic framework taking into account a dynamic model, and a state-of-the-art pedestrian detector and classifier. The classifier is based on the Random Forest-algorithm and is capable of being trained incrementally so that new training samples can be incorporated at runtime. This allows the classifier to adapt to the changing appearance of a target and to unlearn outdated features. The approach is evaluated on a publicly available benchmark. The results confirm that our approach is well suited for tracking pedestrians over long distances while at the same time achieving comparatively good geometric accuracy.


INTRODUCTION
Pedestrian detection and tracking is one of the most active research topics in the fields of image sequence analysis and computer vision.The aim of tracking is to establish correspondences between target locations over time and hence it is widely used for the semantic interpretation of an image sequence.Many available systems apply object detection in single frames, an association step (linking detections to trajectories) and recursive filtering to find a compromise between image based measurements (i.e., automatic pedestrian detections) and a motion model.If the association step is solved, the position of an object, detected in the image, is integrated into the recursive filter as a measurement.If a measurement is imprecise, the generated trajectory is prone to be updated towards a wrong position.While most methods for tracking are concerned with a correct assignment of objects, where an assignment counts as correct if an intersection-over-union score (Everingham et al., 2010) threshold of 50% is exceeded, only few papers address the geometric accuracy of a detection.However, geometric accuracy is essential for many realistic applications like motion analysis in sports sciences, the analysis of interactions of humans in video surveillance and driver assistance systems, where one has to decide whether a pedestrian does actually enter a vehicle path or not.
Detection-based approaches to tracking typically use classifiers to discriminate the considered object classes.Existing approaches differ by the number of classes (binary versus multi-class) and in the way the training is carried out (online vs. offline).Binary classifiers trained offline typically deliver positive detections represented by surrounding rectangles in several nearby positions and scales in the vicinity of the true position of an object.Usually, adjacent rectangles are grouped and non-maximum suppression is applied after the classification step (Dalal and Triggs, 2005), (Felzenszwalb et al., 2010), (Dollár et al., 2010).The actual task of tracking is then to associate the single-frame detections between consecutive time steps, for which a data association problem must be solved.In contrast to these methods, classifiers that are trained online specialise in the appearance of individual targets at runtime.For this purpose classifiers based on variants of Random Forests (Breiman, 2001), (Saffari et al., 2009), (Kalal et al., 2010), Hough Forests (Gall and Lempit-sky, 2013) and boosting (Breitenstein et al., 2011), (Godec et al., 2011) are used.These approaches adapt well to gradual changes in a target's appearance, but depend on additional information about novel pedestrians entering a scene, and they are quickly distracted from the actual target if the training data was derived from mis-aligned samples.Also, the bounding rectangles used in detection-based approaches may easily be misaligned due to partial occlusions, non-rigid body motion, illumination effects and other disturbing effects.In a comprehensive study, Dollár et al. (2011) show that the recall rates of 16 different pedestrian detectors decrease rapidly if the intersection-over-union score threshold is increased.A better alignment of the detection result to the real object boundaries is for instance achieved by finer segmentation, based on pixels (Dai and Hoiem, 2012), superpixels (Shu et al., 2013), interest points (Ommer et al., 2009), (Gall and Lempitsky, 2013), object parts (Felzenszwalb et al., 2010), (Benfold and Reid, 2011) or contour models (Leibe et al., 2005), (Gavrila and Munder, 2007).Such models have the advantage of being more robust against partial occlusions compared to a holistic model.If the relative position of an object part from the reference point of the object is known, a correct localisation of the object is possible, even if only a subset of the parts is visible.
Although the geometric accuracy of single-frame detections is rather low, these methods enable high recall rates, at least if some false positive detections are also taken into account.In this way, object detection is widely used in state-of-the-art papers using the results as evidence for the presence of pedestrians (Schindler et al., 2010), (Milan et al., 2014).The integration of several different observations including single-frame detections is used in Andriluka et al. (2008), Hoiem et al. (2008), Schindler et al. (2010) and Ess et al. (2010).In these papers, the integration of the different observations is carried out using the framework of probabilistic graphical models (Bishop, 2006), (Förstner, 2013).More specifically, the papers mentioned above make use of directed graphical models, i.e.Bayes networks, for the joint inference of unknown parameters that are related, e.g., to the object identity and pose, to the parameters of the camera orientation and the scene.The benefit of using these methods is that different sources of input jointly contribute to the determination of the unknown parameters while taking uncertainties into account.
Most trackers use variants of the recursive Bayes Filter such as the Kalman-or the Particle Filter to find a compromise between image-based measurements and a motion model.Generally, the motion model is a realisation of a first-order Markov chain which considers the expected dynamical behaviour (e.g.constant velocity and smooth motion).In case of an occlusion, i.e. if no measurement can be obtained, the trajectory is continued only by the motion model, and spatio-temporal consistency of the generated trajectory can be preserved.For longer intervals of occlusion, however, a first-order Markov chain is prone to drift away from the actual target position.In this context, (Pellegrini et al., 2009) involve higher order motion models for each object to keep track of the intended destination of the target.To account for physical exclusions of the 3D position of two or more objects, the prediction is based on the current position and velocity estimates of all targets.Leal-Taixé et al. ( 2011) also consider groups of people walking together and try to model the social avoidance and attraction forces between the involved objects.This paper applies global optimisation of the trajectories, which makes the approach unsuitable for real-time applications.
Our main insight is that state-of-the art results can be obtained by methods that use variants of Bayes networks in one of two possible approaches: Either they apply single-frame inference of several variables, or they use multi-temporal models, i.e. a recursive Bayes Filter, over different time steps with single state variables.Our contribution is the proposal and investigation of an approach based on Dynamic Bayes Networks (DBN), see Russell et al. (1995), which unifies the abilities of modelling sequences of variables and state variables in a factorised form.Our method is dedicated to online multi-person tracking in monoscopic image sequences.The DBN combines results of classifiers trained online, category-specific object detectors and recursive filtering.We show that the geometric accuracy can be improved by treating both the state variables in object space and the position of pedestrians in the images as unknown variables.The method is evaluated on a Multiple Object Tracking benchmark dataset, which allows us to compare to other state of the art methods.

METHOD
The proposed method consists of a Dynamic Bayes Network which combines the results of a pedestrian detector, recursive filtering and an instance-specific classifier with online training capability in a single probabilistic tracking-by-detection framework.The hidden variables of the system are the state parameters related to the position and velocity of each pedestrian in world coordinates as well as the pedestrian's position in the image.By modelling the parameters related to the pedestrian's position in the image by hidden variables, our method allows the detection to be corrected before it is incorporated into the recursive filter.In this way, the proposed method carries out the update step of the recursive filter with an improved detection result, leading to a more precise posterior position, which in turn leads to a more precise prediction in the next iteration and decreases the search radius for new trajectory associations and training samples for the online classifier.One such graphical model is constructed for each pedestrian independently of other pedestrians.The approach is made applicable to multi-object tracking by solving an association problem prior to the actual trajectory continuation.To account for static scene elements and to achieve viewpoint independent results, the image-based observations are transferred to a common 3D coordinate system, where the actual filter is applied.The coordinate system is centred at the projection center of the camera (at time k 0 in case of a moving platform) with the X and Z axes pointing in horizontal directions and Y in the vertical direction (right-handed system).To enable monocular tracking in Institut für Photogrammetrie und GeoInformation Orien Groun 3D, we presume a ground plane at a constant height below the camera and expect that pedestrians only move along that plane.

Dynamic Bayes Network
Following the standard notation for graphical models (Bishop, 2006), the network structure of the proposed DBN is depicted in Figure 1.The DBN represents a first-order Markov process, so that each variable has parents only in the same or in the preceding time step.The small solid circles represent deterministic parameters and the larger circles random variables, where the grey nodes correspond to observed and the white ones to unknown parameters.One such graphical model is constructed for each tracked pedestrian.The system state w k,i , the unknown image position z k,i,F of the feet, the image position of the feet c det k,i,F (observed by the person detector) and c RF k,i (observed by the classifier) and the image position of the head c det k,i,H are modelled individually for each person i.All other variables are either defined for an entire image frame (if denoted by a subscript k indicating the time step), or for the entire sequence.The joint probability density function (pdf) of the variables involved can be factorised in accordance with the network structure: (1) In the following the variables considered in our approach are explained in detail.The subscript k is omitted in the remainder of the paper where it is obvious.
Fixed variables.For tracking in 3D world coordinates a ground plane π is defined as the (X,Z) plane at a known distance Yπ below the camera.The pedestrian positions are restricted to the ground plane, which enables monocular tracking in 3D (i.e., the unique conversion from 2D image coordinates to 3D world coordinates using the inverse collinearity equations with constant Y ).Moreover, the parameters C k of the interior and exterior camera orientation are considered to be given for every time step k.
Unknown variables.The state vector wi = [X, Y, Z, H, Ẋ, Ż] T consists of the three-dimensional coordinates X, Y and Z, the height H of the pedestrian and the velocity of the position coordinates X and Z on the ground plane.As the position of a pedestrian in world coordinates cannot be observed directly, the state vector is linked to the position of the feet zi,F = [xF , yF ] in the image, which is also modelled as a hidden variable, and to the position of the head c det i,H = [xH , yH ], which is observed, by conditional probability density functions.For the state vector and for the position of the feet we assume multi-variate normal distributions, so that we have for the initial step in time: where µ z,F is a mean vector, Σzz,F is a covariance matrix, and C are the parameters of the interior and exterior orientation of the camera.The functional relationship between the image and world coordinates is described by the collinearity equations (Eqs.2-5), and an additional fictitious observation m F π (Eq.6) is introduced to model the assumption that pedestrians stand on the ground plane.
In Eqs.2-5 x0 and y0 are the coordinates of the principal point and c is the focal length of the camera.rij are the elements of the rotation matrix between image and reference frame and X0, Y0, Z0 denote the perspective centre of the camera.m F x and m F y denote the measurement functions for the image coordinates of the feet and m H x and m H y those for the image coordinates of the head.The position of the feet is further related to observed variables in the image, see below.c det i,H and zi,F are the top centre and bottom centre position of the rectangle surrounding a person, respectively.The width-to-height ratio of this rectangle is the ratio of the initial detection.We refer to zi,F as the reference point of a person i in the image in the remainder of the paper.Furthermore, the state vector is related to the posterior state vector w k−1,i of the previous time step.For each object a velocity is estimated using the temporal model of a recursive filter that enables a prediction of the future state to narrow down the search space for new detections and to keep the state vector consistent over time.P (w k,i |w k−1,i , π) is given by the temporal model based on a first-order Markov chain (Eq.7).Since the state vector is modelled to follow a multi-variate normal distribution, the same holds true for the predicted state where T is the transition matrix and Σp = GΣuuG T accounts for changes in Ẋ and Ż and Y and H due to unforseen accelerations (aX ,aZ ) and velocities (vY ,vH ).These effects are captured by a zero-mean multi-variate normal distribution over u = [aX , vY , aZ , vH ] T with expectation E(u)=0 and Σuu=diag(σ 2 aX , σ 2 vY , σ 2 aZ , σ 2 vH ) (Welch and Bishop, 1995).These uncertainties are related to the covariance of the predicted state by the matrix G.
Observed variables.Three different observations are incorporated in the model: The accumulated votes of a category-specific classifier trained on persons, voting for the image position of the head and for that of the feet, and the result of an instance-specific classifier trained on individual persons at runtime.
Note that any person-detector usually delivers several adjacent positions around a true position of a person in scale-space.Given a set of rectangles as the result of the classifier, we associate these rectangles either to an existing trajectory or to a hypothesis about a new trajectory.A hypothesis is each detection that does not overlap with an intersection-over-union score larger than 0.5 with any predicted rectangle of a pedestrian that is already tracked, and that has a height of at least 48 pixels.For the association of the (ungrouped) positive classification results, a simple nearest neighbour association in scale-space is applied.The confidence about the position of the head, P (c det i,H |wi, C), and the feet, P (c det i,F |zi,F ), both initially set to zero for all pixels, is computed by means of a Kernel Density Estimation (KDE) with a constant Gaussian kernel (σx=σy= 10 pixels) centred at every top centre position (to vote for the head) and bottom centre position (to vote for the feet) of all rectangles associated to person i, respectively.P (c det k,i,H |w k,i , C k ) denotes the conditional probability density functions of c det k,i,H given that person i attains the state w k,i at time k, We determine the Gaussian parameters of the head position µ c,H = [xH , yH ] as the weighted sample mean of the density estimate given by the KDE with covariance Σcc,H .For the estimation of the reference point of the feet we introduce an additional observation based on an instance-specific classifier, which considers one class for each person and an additional class for the background.By c RF i we denote the position of the feet of person i observed by an instance-specific classifier.We apply an online Random Forest (cf.Saffari et al., 2009).The Random Forest is trained with samples from an elliptic region with the target position as its reference point.The regions are normalised to a constant height of 48 pixels and a width-to-height ratio of 0.5.Because training samples are initially rare, further positive training samples are taken from positions shifted by one pixel up, down, left and right from the reference point.Negative samples (for the background class) are taken from positions translated by half of the size of the ellipse in the same directions.The feature vector is composed of the RGB values inside the ellipse.Each time a trajectory is updated, we take positive training samples from the elliptic region with the new target position as its reference point.To guarantee that the number of training samples is equal for every class, the classifier is trained anew with samples stored in a queue each time a new trajectory is initialised or terminated (see Sec. 2.3).Classification delivers P (c RF i |zi,F ) ∝ n i n 0 , where ni and n0 are the relative frequencies of class i and the background class, respectively, assigned to the leaf nodes of all decision trees in the Random Forest to which the sample zi,F propagates.P (c RF i |zi,F ) is evaluated for every reference point zi,F located within a search region (we take the 99%-confidence ellipse of the predicted state) around the predicted position of the ith person.P (c RF i |zi,F ) and P (c det i,F |zi,F ) are the probabilities to observe c RF i and c det i,F , respectively, if zi,F is the reference point of the ith person in the image.

Inference
Given the observed and fixed variables and having defined all probabilities in Sec.2.1, the aim in this paper is to find the unknown parameters that maximise the joint pdf (Eq.1).We apply an inference scheme similar to that of a recursive Bayes filter, with the difference that the state vector is linked to another yet unknown variable of the system.Therefore, we transform the Bayes Network into a factor graph representation (Kschischang et al., 2001), see Fig. 2, and apply message passing according to Pearl (1988).The most probable state configuration is found in three steps, which are highlighted in colour in Fig. 2 is applied to the state vector using the update equation of an Extended Kalman Filter (EKF) model (Eq.9). where is the predicted state transformed to the observation space by the (non-linear) measurement Equations 2-6 and K is the Kalman Gain matrix (Eq.10) with M the Jacobian (Eq.11) of the measurement equations.

Dynamic Bayes Network
2) 3) Eq. 8 Eq. 7 Eqs. 2-6 det RF  Third, the mean vector and covariance matrix of the corrected state are transformed back to the image domain using the measurement equations and the corresponding Jacobian, where they define the posterior image position of the feet and the head: where Finally, the online Random Forest is updated using new training samples taken from the ellipse with ẑF = [m F x ( ŵi), m F y ( ŵi)] as reference point and a height m F y ( ŵi) − m H y ( ŵi).The EKF update step is executed only if the person is not occluded (see Sec. 2.3).

Intitialisation and termination
At each time step k there exists a set of persons which have already been tracked and a set of hypotheses about new candidates for tracking.The criterion for the generation of hypotheses is explained in Sec.2.1.In order to validate a hypothesis, we evaluate the confidence of a pedestrian detector about the presence of a person, as well as prior knowledge about the scene.We take a set of over-complete object detections (given by an arbitrary object detector), many of which usually yield clusters of positive results around each true position of a person in scale-space.We assume that each detection is generated either by a person which is already tracked, by a new hypothesis or by a false positive detection.In Sec.2.1 we introduced P (c det k,i,F |z k,i,F ) as the likelihood of the observation c det k,i,F given that z k,i,F is the unknown position in the image.We further define P (c hyp k |z hyp k ) as the likelihood of a pedestrian detection c hyp k given that a new tracking candidate is present with z hyp k as its reference point.Every detection is associated either to person i or to a hypothesis, if a nearest neighbour criterion in scale-space is fulfilled and if the detection lies within the search space of the person (given by the confidence of the predicted state) or that of a hypothesis (given by the confidence of an initial state).If the detection is not assigned to any person or to a hypothesis, it is considered as a false positive detection and is discarded.If an assignment is made, either P (c det k,i,F |z k,i,F ) or P (c hyp k |z hyp k ) (both initially set to zero for all pixels) is increased by adding a Gaussian kernel with σx=σy=10 pixels centred at the reference point of the detection.After all detections are either assigned or discarded, we validate each hypothesis h by assigning it a probability P (h|c hyp , z hyp ) for being correct: P (h|c hyp , z hyp ) ∝ P (c hyp , h, z hyp ) = P (c hyp |z hyp )P (z hyp |h)P (h) ∝ P (c hyp |z hyp )P (h|z hyp ).
The probability density P (h=true|z hyp ) is given by prior knowledge about the scene, which is learned from training sequences.We train a binary Random Forest classifier with the image coordinates of the reference point as features and class assignments according to true and false positive detections obtained by a  HOG/SVM detector (Dalal and Triggs, 2005) in a training phase.
The training samples are split into positive and negative samples by validation with reference data, using an intersection-overunion score threshold of 50%.Classification then delivers the probability of a hypothesis to be correct given the position in the image.The distributions learned from the training sequences used in the experiments are visualised in Fig. 3.A hypothesis is accepted if the posterior P (h=true|c hyp , z hyp ) is greater than 0.5.If this is the case, a new trajectory is initialised with the hypothesis-parameters used as starting values.The state parameters are computed from the reference point of the feet and the head in the image using the inverse collinearity equations with the height of the ground plane assigned to Y and the initial height H is computed from the height (yF − yH ) in the image with a scale estimate derived from the focal length of the camera and the 3D distance to the person.If no training data are available for a scene, P (z hyp |h=true) is set to a uniform distribution.
To account for mutual occlusions we evaluate the predicted states of all pedestrians and decide not to update the filter and the classifier if the predicted bounding rectangle of a person overlaps more than 50% with any other person and if the image-row coordinate of the person is lower than that of the occluder (i.e, further behind in the scene).If a person leaves the image frame or if the trajectory is not updated for more than 5 frames in sequence, tracking of that person is stopped.

EXPERIMENTS
This section reports results using the proposed method on the 3D MOT 2015 Benchmark (Leal-Taixé et al., 2015) which includes the PETS09-S2L21 and the AVG-TownCentre2 sequences.
The sensitivity of the method to the omission of single variables is evaluated on the PETS09-S2L1 dataset (available for training in the 3D MOT 2015 Benchmark).The corresponding results of an evaluation in 2D image space (correct detection requires at least 50% intersection-over-union score with the reference) and in 3D world coordinates (correct detection requires at most 1m offset in position) are reported in Tables 1 and 2, respectively.Furthermore, the average tracking results achieved on the test sequences are given in Table 3, where they are compared with related work.The reported metrics include the recall and precision scores, false alarms per frame (FAF), the ratio of mostly tracked (MT, a person is MT if tracked at least 80% of the time being present in consecutive images) and mostly lost (ML, if tracked at most 20%) tracking objects, the numbers of false positive (FP) and false negative (FN) detections, the number of identity switches (IDs), the number of interruptions during the tracking of a person (Frag.)as well as the Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP) of the CLEAR metrics defined by Bernardin and Stiefelhagen (2008).The MOTA metric takes into account FP and FN assignments as well as ID switches.The MOTP metric reflects the geometric accuracy of the tracking results.The initial covariance of the filter state, Σ ww,k=0 , is assigned with σX =σZ =0.3m, σY =0.01m, σH =0.03m and σ Ẋ =σ Ż =0.3ms −1 .To account for the process noise, we set σaX =σaZ =0.5ms −2 , σvY =0.1ms −1 and σvH =0.2ms −1 .σ 2 π is assigned a comparatively small value of 1mm.
In Figure 4 the different probability densities that are part of the model are visualised.Each column in the figure depicts results from a different time step (from frame 1 to 4).Figures 4(a) depict the confidence of the person detector about new hypotheses that are not yet assigned to any trajectory.These confidences are used to validate new trajectories along with the prior scene information (see Sec. 2.3).Note that the confidence at the location of the right-most pedestrian in the image is lower than the confidence assigned to the others and does not exceed a threshold, so there is no trajectory initialised in the first frame.For the sensitivity study about the omission of single variables the full model (a) is compared with modified versions of the model, in which the observations given by the online Random Forest (ORF) classifier (b) and those given by the person detector (c) are omitted.In settings (a) − (c) the initial bounding boxes are given by manually annotated data.The observed variables of the model are computed from the outcomes of a HOG/SVM detector (Dalal and Triggs, 2005) and from an online Random Forest (Saffari et al., 2009) as described in Sec.2.1.Furthermore, the full model is initialised with automatically generated detections (also given by a HOG/SVM) without (d) and with (e) the usage of prior scene information.In case of the sensitivity study the prior information is learned from the PETS09-S1L2 sequence (cf.Fig. 3(a)).The results reflect the benefit of using the full model as proposed in this paper.The outcomes from the evaluation in 2D and in 3D  both lead to the same insights.If the initial position is given, the evaluation in the object space shows that 96.5% of the pedestrians that are present in all frames are detected with at most 1m offset from the true position, while 98.1% of all automatic detections are correct.Furthermore, 94.7% of the pedestrians are tracked in at least 80% of the images in which they are present, and none less than 20%.When not using the ORF to derive an additional observation of the target's feet position (variant (b)), the number of identity switches (IDs) and false assignments (FAF) are higher than those achieved on the basis of the full model, while the other metrics do not change significantly.In variant (c), when the result of the person detector is omitted, the performance becomes worse in terms of all metrics.In variants (d) and (e) the hypotheses about new tracking candidates are derived from automatic pedestrian detections.If all hypotheses are accepted without applying the validation step described in Sec.2.3 (variant (d)), recall rates similar to those of variant (a) are achieved only at the cost of a strong decrease (of about 50%) in the precision.If the validation step is carried out (variant (e)), the precision is superior to that of variant (d), while both recall and precision are only about 10% worse than those achieved on the basis of variant (a).Thus we apply the full model with detections generated automatically for the comparative study.
In the comparative study the proposed method is evaluated against other results reported on the website 3 of the MOT In favour of comparability, detections which are publicly available along with the data set are used to generate object hypotheses.The observed image positions of the feet computed from the detection results, are still computed from the outcomes of a HOG/SVM.The related work includes that of Leal-Taixé et al. (2011), referred to as LPSFM, another yet unreferenced approach by the same principal author based on network flow linear programming, referred to as LP3D, and Pellegrini et al. (2009), referred to as KalmanSFM.The results (Table 3) show that our method yields, with an average ranking of 1.7, the best results in 6 of 10 evaluation metrics.We achieve the best results in the MOTA metric, which takes into account the number of FP detections (and equivalently the rate of false assignments 3 http://motchallenge.net per frame), where our method yields the second best score, and the number of FN detections and identity switches, where our method performs best.Our method also yields comparatively good results w.r.t. the persistence of tracking, which is reflected in the percentage of mostly tracked objects (28.7%) and in the number of fragmentations of the trajectories (418).On the downside, 17.9% of the pedestrians are not tracked for more than 20% of the time being visible in the test sequences.As measured by the MOTP score of 61.0, our methods also yields the highest geometric accuracy among the compared methods.In Fig. 5(a) and 5(b) qualitative results are shown for exemplary images of both test sequences.Note that the rectangles align mostly well to the contours of the pedestrians.
Tracking is performed on a 3.3GHz PC with 8 cores, where the runtime of our method performs with 0.1Hz worst compared to the related work.This is mainly due to the repetitive training of the online Random Forest classifier every time a person enters or leaves the scene, to the pixel-wise classification in the vicinity of potential target positions, and to non-optimised code.

CONCLUSIONS
This paper proposes a probabilistic model designed for visual pedestrian tracking.The pedestrian state (position, height and velocity) in world coordinates and the position of the feet in the image are modelled as hidden variables in a Dynamic Bayes Network.Quantitative results show that the tracking performance w.r.t. the re-identification of a pedestrian as well as the geometric accuracy are superior to those achieved by competing methods.The focus of this work is on the trajectory continuation and correct alignment of single pedestrians.The applicability to multiple object tracking is realised by an association step which is executed prior to processing on the basis of the proposed Bayes network.In crowded scenes, where interactions between pedestrians and mutual occlusions are inherent, the strategy is currently often not capable of resolving ambiguities in the detection-to-track assignment, which is reflected in the MOTA values and in the number of identity switches.As emphasized by many of the related papers, better results can be achieved if the trajectory con-tinuation is applied jointly for all pedestrians.We will extend our model to jointly reason about the states of interacting pedestrians in future work.

Figure 1 .
Figure 1.Dynamic Bayes Network for pedestrian tracking.The nodes represent random variables, the edges model dependencies between them.The meaning of the variables is briefly explained on the right and in detail in the text.
. Each factor node (square) corresponds to a function of the subset of variables that are connected to it.The arrows indicate forward (red) and backward (green) messages sent through the graph.First, we compute the position of the feet µ z,F = [xF , yF ] given the observed variables c RF i and c det i,F as the weighted sample mean of the product of the observed pdfs P (c RF i |zi,F ) and P (c det i,F |zi,F ) with the according sample covariance Σzz,F .P (z i,F |c RF i , c det i,F ) = N (µ z,F , Σzz,F ) ∼ P (c RF i |zi,F )P (c det i,F |zi,F ) (8) Second, the state vector is propagated in time using the temporal model (Eq.7) and corrected by incorporating the estimated position of the feet, the measured position of the head and the fictitious observation Yπ.The vector zi = [xF , yF , xH , yH , Yπ] T with covariance

Figure 2 .
Figure 2. Factor graph representation of the DBN.

Figure 3 .
Figure 3. Prior knowledge about the scene.
Figures 4(b) depict the confidences of the person detector about head positions of tracked persons in the image and Figures 4(c) those of the feet.Figures 4(d) show the confidence of the online classifier about the feet position of tracked pedestrians.Note that the distribution is becoming narrower over time (i.e., from the left subfigure to the right), because further training samples arrive during runtime.Figures 4(e) show the combined confidence about the feet positions given the detection and classification result, and the predicted state, which is used for gating the search area.Figures 4(f) depict the 2.5σ ellipses of the predicted state projected into the image (red), the measurement derived from the densities shown in Figures 4(e) (yellow) and the posterior state (blue).

Figure 5 .
Figure 5. Qualitative results shown for example images from the two test sequences.