<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher">ISPRS-Annals</journal-id>
<journal-title-group>
<journal-title>ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences</journal-title>
<abbrev-journal-title abbrev-type="publisher">ISPRS-Annals</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2194-9050</issn>
<publisher><publisher-name>Copernicus Publications</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5194/isprs-annals-XI-2-2026-321-2026</article-id>
<title-group>
<article-title>Geometry-aided Video Panoptic Segmentation</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Nguyen</surname>
<given-names>Tuan</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mehltretter</surname>
<given-names>Max</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Rottensteiner</surname>
<given-names>Franz</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group><aff id="aff1">
<label>1</label>
<addr-line>Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany</addr-line>
</aff>
<pub-date pub-type="epub">
<day>03</day>
<month>07</month>
<year>2026</year>
</pub-date>
<volume>XI-2-2026</volume>
<fpage>321</fpage>
<lpage>330</lpage>
<permissions>
<copyright-statement>Copyright: &#x000a9; 2026 Tuan Nguyen et al.</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access">
<license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri"  xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p>
</license>
</permissions>
<self-uri xlink:href="https://isprs-annals.copernicus.org/articles/XI-2-2026/321/2026/isprs-annals-XI-2-2026-321-2026.html">This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/321/2026/isprs-annals-XI-2-2026-321-2026.html</self-uri>
<self-uri xlink:href="https://isprs-annals.copernicus.org/articles/XI-2-2026/321/2026/isprs-annals-XI-2-2026-321-2026.pdf">The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/321/2026/isprs-annals-XI-2-2026-321-2026.pdf</self-uri>
<abstract>
<p>Video panoptic segmentation (VPS) unifies panoptic segmentation and object tracking by assigning each pixel a semantic class label, or for &lt;em&gt;thing&lt;/em&gt; classes, an instance identifier that is consistent across frames. Addressing this task, we propose a novel online VPS method for processing stereoscopic image sequences, which is based on depth-aware kernel-based panoptic segmentation. Specifically, we introduce a geometrical constraint based on predicted bounding boxes into the segmentation of thing instances to overcome the fundamental limitation of kernel-based panoptic segmentation that only appearance information is considered in this step; this regularly leads to panoptic segmentation results in which distinct instances are erroneously merged into one mask. To link detected instances across frames, we propose to extend the commonly employed appearance-based association with a motion-related constraint based on optical flow; this resolves ambiguities in case of instances of similar appearance and, thus, reduces the number of incorrect associations. We experimentally evaluate our method on the publicly available Cityscapes-VPS dataset and compare our results to those of several related methods from the literature. The results demonstrate that our method improves the panoptic quality for a single frame and enhances the instance association across frames, leading to an overall improvement of 3.5% in Video Panoptic Quality on &lt;em&gt;thing&lt;/em&gt; classes compared to the employed baseline.</p>
</abstract>
<counts><page-count count="10"/></counts>
</article-meta>
</front>
<body/>
<back>
</back>
</article>