Beyond Centers: Bounding-Box Voxel Projection for Multi-View 3D Detection and Tracking
Keywords: Image Sequence Analysis, Multi-View Tracking, Detection and Localization in 3D
Abstract. 3D multi-view, multi-object tracking (3D MV-MOT) makes use of multiple cameras to reduce the number of missed detections and to mitigate occlusions. Most current 3D MV-MOT methods suffer from information loss when associating 3D locations with 2D image features via a 3D-to-2D projection, as they use a discrete grid in 3D and sample image features only at the projected centers of each grid cell. Thus, all other feature information is lost. An additional information loss commonly arises during crossview aggregation when applying max or average pooling: these methods either overemphasize a single view or treat conflicting views, that depict different entities, e.g., due to occlusions, equally. In this work, we introduce two novel modules for 3D MV-MOT, employed to pedestrian tracking, that target these limitations: (i) VoxROI aggregates all image features that fall within the bounding box around a voxel’s projection into each respective image, instead of only sampling features at the projected voxel center. (ii) SimFuse aggregates per-view voxel features into one coherent feature representation per voxel, using similarity weights computed from re-identification (Re-ID) features. Subsequently, they are used to measure cross-view identity similarity. Views with higher Re-ID feature similarity receive larger weights, while inconsistent views are suppressed. Experimental results on the WildTrack dataset confirm our method’s effectiveness for multi-view pedestrian detection and tracking, reaching, and in particular in cross-view scenarios improving, the general state-of-the-art. The approach maintains strong performance across different camera configurations, demonstrating its generalization capability when training and testing on different camera setups.
