ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-2-2026-197-2026

Beyond Centers: Bounding-Box Voxel Projection for Multi-View 3D Detection and Tracking

Ali

Rasho

¹ Mehltretter

Max

¹ Heipke

Christian

https://orcid.org/0000-0002-7007-9549

Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany

03 07 2026

XI-2-2026 197 206

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/197/2026/isprs-annals-XI-2-2026-197-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/197/2026/isprs-annals-XI-2-2026-197-2026.pdf

3D multi-view, multi-object tracking (3D MV-MOT) makes use of multiple cameras to reduce the number of missed detections and to mitigate occlusions. Most current 3D MV-MOT methods suffer from information loss when associating 3D locations with 2D image features via a 3D-to-2D projection, as they use a discrete grid in 3D and sample image features only at the projected centers of each grid cell. Thus, all other feature information is lost. An additional information loss commonly arises during crossview aggregation when applying max or average pooling: these methods either overemphasize a single view or treat conflicting views, that depict different entities, e.g., due to occlusions, equally. In this work, we introduce two novel modules for 3D MV-MOT, employed to pedestrian tracking, that target these limitations: (i) <em>VoxROI</em> aggregates all image features that fall within the bounding box around a voxel’s projection into each respective image, instead of only sampling features at the projected voxel center. (ii) <em>SimFuse</em> aggregates per-view voxel features into one coherent feature representation per voxel, using similarity weights computed from re-identification (Re-ID) features. Subsequently, they are used to measure cross-view identity similarity. Views with higher Re-ID feature similarity receive larger weights, while inconsistent views are suppressed. Experimental results on the WildTrack dataset confirm our method’s effectiveness for multi-view pedestrian detection and tracking, reaching, and in particular in cross-view scenarios improving, the general state-of-the-art. The approach maintains strong performance across different camera configurations, demonstrating its generalization capability when training and testing on different camera setups.