ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-2-2026-321-2026

Geometry-aided Video Panoptic Segmentation

Nguyen

Tuan

¹ Mehltretter

Max

¹ Rottensteiner

Franz

Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany

03 07 2026

XI-2-2026 321 330

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/321/2026/isprs-annals-XI-2-2026-321-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/321/2026/isprs-annals-XI-2-2026-321-2026.pdf

Video panoptic segmentation (VPS) unifies panoptic segmentation and object tracking by assigning each pixel a semantic class label, or for <em>thing</em> classes, an instance identifier that is consistent across frames. Addressing this task, we propose a novel online VPS method for processing stereoscopic image sequences, which is based on depth-aware kernel-based panoptic segmentation. Specifically, we introduce a geometrical constraint based on predicted bounding boxes into the segmentation of thing instances to overcome the fundamental limitation of kernel-based panoptic segmentation that only appearance information is considered in this step; this regularly leads to panoptic segmentation results in which distinct instances are erroneously merged into one mask. To link detected instances across frames, we propose to extend the commonly employed appearance-based association with a motion-related constraint based on optical flow; this resolves ambiguities in case of instances of similar appearance and, thus, reduces the number of incorrect associations. We experimentally evaluate our method on the publicly available Cityscapes-VPS dataset and compare our results to those of several related methods from the literature. The results demonstrate that our method improves the panoptic quality for a single frame and enhances the instance association across frames, leading to an overall improvement of 3.5% in Video Panoptic Quality on <em>thing</em> classes compared to the employed baseline.