<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher">ISPRS-Annals</journal-id>
<journal-title-group>
<journal-title>ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences</journal-title>
<abbrev-journal-title abbrev-type="publisher">ISPRS-Annals</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2194-9050</issn>
<publisher><publisher-name>Copernicus Publications</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5194/isprs-annals-XI-2-2026-197-2026</article-id>
<title-group>
<article-title>Beyond Centers: Bounding-Box Voxel Projection for Multi-View 3D Detection and Tracking</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ali</surname>
<given-names>Rasho</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mehltretter</surname>
<given-names>Max</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Heipke</surname>
<given-names>Christian</given-names>
<ext-link>https://orcid.org/0000-0002-7007-9549</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group><aff id="aff1">
<label>1</label>
<addr-line>Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany</addr-line>
</aff>
<pub-date pub-type="epub">
<day>03</day>
<month>07</month>
<year>2026</year>
</pub-date>
<volume>XI-2-2026</volume>
<fpage>197</fpage>
<lpage>206</lpage>
<permissions>
<copyright-statement>Copyright: &#x000a9; 2026 Rasho Ali et al.</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access">
<license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri"  xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p>
</license>
</permissions>
<self-uri xlink:href="https://isprs-annals.copernicus.org/articles/XI-2-2026/197/2026/isprs-annals-XI-2-2026-197-2026.html">This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/197/2026/isprs-annals-XI-2-2026-197-2026.html</self-uri>
<self-uri xlink:href="https://isprs-annals.copernicus.org/articles/XI-2-2026/197/2026/isprs-annals-XI-2-2026-197-2026.pdf">The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/197/2026/isprs-annals-XI-2-2026-197-2026.pdf</self-uri>
<abstract>
<p>3D multi-view, multi-object tracking (3D MV-MOT) makes use of multiple cameras to reduce the number of missed detections and to mitigate occlusions. Most current 3D MV-MOT methods suffer from information loss when associating 3D locations with 2D image features via a 3D-to-2D projection, as they use a discrete grid in 3D and sample image features only at the projected centers of each grid cell. Thus, all other feature information is lost. An additional information loss commonly arises during crossview aggregation when applying max or average pooling: these methods either overemphasize a single view or treat conflicting views, that depict different entities, e.g., due to occlusions, equally. In this work, we introduce two novel modules for 3D MV-MOT, employed to pedestrian tracking, that target these limitations: (i) &lt;em&gt;VoxROI&lt;/em&gt; aggregates all image features that fall within the bounding box around a voxel&amp;rsquo;s projection into each respective image, instead of only sampling features at the projected voxel center. (ii) &lt;em&gt;SimFuse&lt;/em&gt; aggregates per-view voxel features into one coherent feature representation per voxel, using similarity weights computed from re-identification (Re-ID) features. Subsequently, they are used to measure cross-view identity similarity. Views with higher Re-ID feature similarity receive larger weights, while inconsistent views are suppressed. Experimental results on the WildTrack dataset confirm our method&amp;rsquo;s effectiveness for multi-view pedestrian detection and tracking, reaching, and in particular in cross-view scenarios improving, the general state-of-the-art. The approach maintains strong performance across different camera configurations, demonstrating its generalization capability when training and testing on different camera setups.</p>
</abstract>
<counts><page-count count="10"/></counts>
</article-meta>
</front>
<body/>
<back>
</back>
</article>