Uncertainty Quantification with Deep Ensembles for 6D Object Pose Estimation

The estimation of 6D object poses is a fundamental task in many computer vision applications. Particularly, in high risk scenarios such as human-robot interaction, industrial inspection, and automation, reliable pose estimates are crucial. In the last years, increasingly accurate and robust deep-learning-based approaches for 6D object pose estimation have been proposed. Many top-performing methods are not end-to-end trainable but consist of multiple stages. In the context of deep uncertainty quantification, deep ensembles are considered as state of the art since they have been proven to produce well-calibrated and robust uncertainty estimates. However, deep ensembles can only be applied to methods that can be trained end-to-end. In this work, we propose a method to quantify the uncertainty of multi-stage 6D object pose estimation approaches with deep ensembles. For the implementation, we choose SurfEmb as representative, since it is one of the top-performing 6D object pose estimation approaches in the BOP Challenge 2022. We apply established metrics and concepts for deep uncertainty quantification to evaluate the results. Furthermore, we propose a novel uncertainty calibration score for regression tasks to quantify the quality of the estimated uncertainty.


INTRODUCTION
Determining the 6D pose of an object in its 3D environment, i.e., its 3D orientation and 3D position, from a camera image or from depth sensor data is a fundamental task in computer vision.Applications like augmented reality (Su et al., 2019), vision-assisted robot manipulation (Steger et al., 2018;Ulrich and Hillemann, 2024), bin picking Drost et al. (2017), and autonomous systems as self-driving cars (Yurtsever et al., 2020) rely on accurate object poses estimated from RGB(-D) images.In the real world, complex scenes arise that lead to typical challenges for 6D object pose estimation: The scene might be cluttered with multiple instances of varying object categories, sometimes with more than one instance of a given object category and occluded instances.Objects might be symmetric or contain inter-object similarity, where one object is built from parts of other objects.In addition, object surfaces can be challenging if they are textureless or reflective, for example.The Benchmark for 6D Object Pose Estimation (BOP) and the associated BOP Challenge 2020 (Hodaň et al., 2020) and 2022 (Sundermeyer et al., 2023) cover these challenges and allow a robust evaluation of state-of-the-art methods.These methods incorporate deep learning at large, taking advantage of the ability of deep neural networks to learn complex patterns on sufficient amounts of data and use RGB images as well as depth information to compute the 6D object pose.Many of the topperforming approaches (Park et al., 2019;Labbé et al., 2020;Haugaard and Buch, 2022;Wang et al., 2021) are composed of three major stages: First, an off-the-shelf object detector locates the target object instance in the image of the scene.Second, a deep neural network predicts the 2D-3D correspondences, and third, a variant of the Perspective-n-Point (PnP) algorithm, often combined with RANSAC, provides the 6D pose of the detected object instances.Optionally, depth information is used for pose refinement.
In safety-critical applications like autonomous driving (McAl- * Corresponding author lister et al., 2017) and demanding industrial applications like quality inspection and automation (Heizmann et al., 2022) , the prediction of an object pose is often not sufficient to make informed decisions.Instead, the associated object pose uncertainty must also be taken into account.For example, consider the task of a robot grasping a cup whose pose is estimated based on a RGB(-D) input image that does not show the handle of the cup.This leads to an ambiguous pose estimate.If the robot grasps the cup based on that pose estimate, the object or the robot might be damaged.In combination with a measure of pose uncertainty, this scenario can be identified and prevented by choosing a different camera angle or, in a bin picking application, choosing another object to grasp that has a lower uncertainty.In deep learning, popular uncertainty quantification (UQ) methods include softmax probability (Hendrycks and Gimpel, 2018), Monte-Carlo Dropout (Gal and Ghahramani, 2016), and Deep Ensembles (Lakshminarayanan et al., 2017).While softmax predictions are only used for classification and segmentation tasks, Monte-Carlo Dropout and Deep Ensembles can be applied to regression tasks as well, and therefore are suited for object pose estimation.Deep ensembles of random initialized networks perform best and are more robust under datashift, compared to dropout methods, post-hoc calibration by temperature scaling, and methods motivated by Bayesian inference (Ovadia et al., 2019).
The application of UQ methods to multi-stage approaches for 6D object pose estimation is not straightforward.These methods are usually designed for segmentation and classification tasks, which often are single-stage approaches in the sense that they are end-to-end trainable.Since 6D object poses have one orientation component in SO(3) and one position component in R 3 , the object pose is often considered in a decoupled fashion, handling orientation and position separately.While it can be generally assumed that the object position in R 3 is normally distributed, modelling the orientation distribution is more complex.Considering this, Deep Ensembles and UQ methods that draw samples from the posterior predictive distribution have the advantage that no assumptions concerning the underlying distributions have to be made.Surprisingly, up to now, there is no work that uses a deep ensemble approach for UQ in object pose estimation.The most closely related method by Shi et al. (2021) uses two to three heterogeneous pretrained pose estimation models to estimate the pose disagreement.However, while this approach reduces the computational cost of training and inferring a large ensemble of models, this approach diverges from the deep ensemble methodology and does not produce uncertainty estimates.
In this work, we propose a method to quantify the uncertainty of multi-stage 6D object pose estimation approaches with the current state-of-the-art deep learning UQ method, namely deep ensembles.For the implementation, we choose SurfEmb (Haugaard and Buch, 2022), a top-performing 6D object pose estimation method.We evaluate the estimated pose results and their uncertainties using reliability diagrams and BOP metrics on the T-LESS (Hodaň et al., 2017) and YCB-V (Xiang et al., 2018) benchmark datasets for object pose estimation.Furthermore, we introduce a novel metric for the evaluation of uncertainty estimates in regression tasks in general.

RELATED WORK
As UQ in deep learning and explainable AI gain more and more interest, works on the reliability of both network predictions and estimated uncertainties have increased in the recent years.In Section 2.1, an overview of popular UQ approaches in deep learning in general is given.In Section 2.2, works on object pose uncertainties and pose distributions are described.

UQ in Deep Learning
UQ in deep learning often distinguishes between different types of uncertainties depending on their source.The predictive uncertainty is often split into aleatoric and epistemic uncertainty.Aleatoric uncertainty captures the uncertainty that is inherent in the input data, e.g.noise in an image, while epistemic uncertainty refers to a lack of knowledge, i.e. the uncertainty of the network parameters (Kendall and Gal, 2017).
Deep-learning-based approaches for object pose estimation integrate large deep neural networks in their pipelines in most cases.Consequently, these networks consist of a large amount of parameters and non-linearities that make the computation of the exact posterior probability distributions of the network's predicted outputs generally intractable (Blundell et al., 2015;Loquercio et al., 2020).As a consequence, approximation approaches are used for UQ.Approaches based on Bayesian inference transform common deterministic networks into stochastic ones by placing probability distributions over either the activations and/or the weight parameters (Jospin et al., 2022), leading to Bayesian neural networks (BNNs).While BNNs have a mathematically sound foundation, the high parameter counts of deep neural networks make a direct solution impossible.
Bayes by Backprop (Blundell et al., 2015) is one work proposing variational inference to learn the parameters of approximate distributions over the weights.At inference time, weights are sampled from the learned distributions resulting in an ensemble of networks that is used to sample the posterior distribution of the predictions.Because BNNs come with a high computational cost, Monte-Carlo Dropout (Gal and Ghahramani, 2016) was proposed where dropout regularization (Srivastava et al., 2014) at inference time approximates a stochastic Gaussian process.Then, the posterior predictive distribution is sampled from multiple forward passes through networks with varying dropout masks requiring additional runtime at inference time.Furthermore, Monte-Carlo Dropout capture only the epistemic uncertainty and, therefore, has been combined with probabilistic networks (Kendall and Gal, 2017) and assumed density filtering (Gast and Roth, 2018;Loquercio et al., 2020) for predictive UQ.Another drawback of Monte-Carlo Dropout is that the estimated uncertainties need to be calibrated (Gal and Ghahramani, 2016).In contrast, Deep Ensembles do not require a calibration (Lakshminarayanan et al., 2017).They present a robust way to estimate predictive uncertainty in computer vision tasks such as classification, semantic segmentation, and depth estimation, and are considered state-of-the-art in deep learning UQ (Ovadia et al., 2019;Gustafsson et al., 2020;Wursthorn et al., 2022).Ensemble distillation can be used to overcome the high memory costs of ensembles while achieving comparable uncertainty results (Landgraf et al., 2023).Recently, (Mukhoti et al., 2023) proposed a deterministic UQ approach that provides similar results to deep ensembles, even on out-of-distribution examples.

UQ for Object Pose Estimation
Despite the importance of reliable pose estimates, there are few works that focus on UQ in the context of 6D object pose estimation (Thalhammer et al., 2023a).Often, UQ for object pose estimation is referred to as the estimation of a pose distribution.Many works use Bingham distributions to model the orientation distribution (Gilitschenski et al., 2020;Okorn et al., 2020;Deng et al., 2022;Sato et al., 2022).Gilitschenski et al. (2020) present a new Bingham loss function for orientation distribution learning and Okorn et al. (2020) propose two methods to quantify the uncertainty of orientations for non-symmetric and symmetric objects, respectively.The first method uses an isotropic Bingham distribution to model orientation distribution while the latter learns a multi-modal non-parametric distribution.Deng et al. (2022) propose Deep Bingham Networks as UQ framework by considering a family of pose hypotheses.Sato et al. (2022) present a simple way how a prediction head estimating the parameters of a Bingham distribution can be incorporated into PoseCNN (Xiang et al., 2018).Manhardt et al. (2019) use the distribution of pose hypotheses to handle object ambiguities, a goal that was also of interest in Deng et al. (2021) where the orientation distribution is considered while tracking object poses in video frames.In turn, Jeon et al. (2023) use the object ambiguities to estimate confidences for keypoint selection.Further approaches for object pose estimation leverage keypoint confidences to improve the performance and to provide a measure of reliability of the pose estimates (Peng et al., 2019;Huang et al., 2022;Yang and Pavone, 2023).Others estimate confidences for the pose hypotheses to increase the accuracy of the final average pose result (Hu et al., 2019;Thalhammer et al., 2023b).Recent works use non-parametric distributions to implicitly model the pose distribution in SE(3).Haugaard et al. (2023) present an efficient way to learn pose distributions at different resolution levels.Recently, Zhou et al. (2023) combine SurfEmb with inverse graphics and provide a log-likelihood scoring for the estimated poses.In context of the UQ methods mentioned in Section 2.1, these approaches for object pose distribution estimation can be considered as single deterministic approaches to pose UQ.In contrast to samplingbased UQ methods, single deterministic approaches do not require multiple forward passes at inference time.However, they are sensitive to the underlying network architecture, training procedure, and training data (Gawlikowski et al., 2023).Ensemble methods like deep ensembles have been shown to be more robust under datashift and outperform other methods like Monte-Carlo Dropout (Ovadia et al., 2019).Furthermore, not all mentioned object pose estimation approaches leveraging uncertainties in their methodology offer uncertainties of the final pose results Brachmann et al. (2016).Also, the quality of the uncertainty estimates is often not explicitly evaluated.In addition, the incorporation of uncertainties mostly comes with complex changes in established pose estimation methodologies.In contrast, deep ensembles offer a simple approach to UQ.In this paper, we show how a deep-learning-based object pose estimation approach is extended to additionally quantify uncertainties with deep ensembles.

BACKGROUND OF SURFEMB AND DEEP ENSEMBLES
Given a multi-stage 6D object pose estimation approach, we evaluate the applicability of the UQ method of deep ensembles to the task of deep 6D object pose estimation.First, in Section 3.1, an introduction to the components of the SurfEmb (Haugaard and Buch, 2022) method is given, which is chosen as the exemplary method for the task of 6D object pose estimation.Section 3.2 explains the prerequisites that have to be fulfilled by the ensemble baseline models in order to ensure the creation of a well-calibrated deep ensemble.

SurfEmb
We conduct our experiments using SurfEmb, which is in the top ten of the best performing methods in the BOP challenge 2022 (Sundermeyer et al., 2023).Like many 6D object pose estimation methods, SurfEmb is a multi-stage approach, which is why the insights gained in this work can be transferred to similar multi-stage approaches as well.It uses the 2D object instance detections produced by CosyPose (Labbé et al., 2020) with Mask R-CNN (He et al., 2017) and trains a deep neural network to predict 2D-3D correspondences that are forwarded to a PnP algorithm that estimates the object poses.More specifically, SurfEmb learns dense and continuous 2D-3D correspondence distributions by using high-dimensional embeddings of the object surface coordinates.The correspondence network is trained in a self-supervised fashion using a contrastive loss.The positive and negative training examples are provided by the so-called key model, a sinusoidal representation network (SIREN) MLP (Sitzmann et al., 2020) that transforms a 3D object surface coordinate into a 12D embedding space.The correspondence network or the so-called query model with a U-Net (Ronneberger et al., 2015) architecture and a ResNet18 (He et al., 2016) backbone is then trained to predict pixel-wise 12D surface embeddings from an input image crop of an object instance.The 2D-3D correspondences are then used in AP3P (Ke and Roumeliotis, 2017), an algebraic P3P algorithm, to obtain object pose hypotheses followed by a pose hypotheses scoring.
Pose hypotheses that exceed a score threshold are locally refined and can be further refined with depth data.(Lakshminarayanan et al., 2017).

Deep Ensembles
Each individual model in the ensemble must fulfill the first two prerequisites.Let N be the ensemble size, i.e. the number of ensemble members that are trained in accordance with the above prerequisites and used to produce the ensemble results at inference time.In Ovadia et al. (2019), it was shown that an ensemble size of N = 5 produces good results.Nevertheless, N is an empirical value and should be determined for each application specifically.The higher the ensemble size, the better the underlying posterior predictive distribution of the ensemble outputs can be approximated.

METHODOLOGY
In this section, we describe how we apply deep ensembles to SurfEmb and how we evaluate it.In Section 4.1, the prerequisites described in Section 3.2 are checked and applied to SurfEmb.The evaluation methodology is described in Section 4.2.

SurfEmb Deep Ensemble
In the following, we explain how the three prerequisites described in Section 3.2 are taken into account when using an ensemble of SurfEmb.Next to these prerequisites, the ensemble size N must be defined empirically for our application, which is discussed in the experiments in Section 5.
Model Weights Initialization Scheme.In the original publication of SurfEmb, the ResNet18 backbone of the U-Net architecture of the query model is initialized with pre-trained weights on ImageNet (Deng et al., 2009).Instead, according to the deep ensemble recipe of Section 3.2, we randomly initialize each ensemble query model with different weights drawn from a normal distribution scheme according to He et al. (2015).As mentioned in Section 3.1, the key model that provides the targets during training is based on a SIREN MLP.Due to the sensitivity of the used sine non-linearities, the SIREN MLP requires the specification of lower and upper bounds of a uniform distribution from which the weights are randomly drawn.By randomly initializing the key models in the ensemble, the models learn different realizations of the 12D embedding space of correspondence distributions.
Scoring Rule.The query model of SurfEmb is trained with a combined loss for the predicted visible object mask and the surface embedding.The object mask is scored by the binary cross entropy while multi-class cross entropy is used as a scoring rule for the surface embeddings.As cross entropy can be considered a proper scoring rule, no modifications are required.
Adversarial Training.SurfEmb does not incorporate an adversarial training.However, it is trained in a self-supervised manor with a contrastive loss taking both negative and positive examples into account.This training regime has a similar effect on the ensemble results as adversarial training.Therefore, and because this prerequisite is optional, we do not perform any further predictive distribution smoothing.
The resulting SurfEmb ensemble consists of N = 10 independent query models of whom each model generates a query, and, based on that, produces an object pose estimate at inference time, forming the ensemble pose estimates.One major advantage of an ensemble approach for UQ is that no assumptions are made about the underlying distribution of predictions.In case of object pose estimation, this especially presents an advantage over UQ methods that predict the parameters of distributions explicitly.As a drawback, one has to overcome the challenging endeavor of extracting meaningful, application specific pose uncertainties.In case of object symmetries, it is not guaranteed that all pose estimates of the ensemble members refer to the same object symmetry axis.For this purpose, we select the pose in the ensemble prediction with the highest score as reference and align all N − 1 other poses to that reference based on the known symmetric transformations of the object model.

Ensemble Evaluation
The pose estimator ensemble is evaluated on the test set of the corresponding training dataset.Let the test dataset D be defined as D = {(xt, yt)} T t=1 where xt is the t-th input data point, i.e. the image crop of an object instance, and yt the corresponding annotated ground truth pose of the depicted object instance in the camera coordinate frame, composed of a 3D rotation matrix R and a translation vector t.For the t-th entry in the test dataset, the n-th ensemble member Hn, with n = {1, 2, ..., N }, outputs a prediction, resulting in a sample of N predictions that are drawn from the posterior predictive distribution.An approximate distribution can be fit to this predictive distribution to get ensemble results that consist of the parameters of the approximate distribution.In the simplest case, a Gaussian distribution is defined by the mean µt and the standard deviation σt of the ensemble predictions on the t-th dataset entry.Despite the assumption made about the underlying posterior predictive distribution, the standard deviation offers the advantage of being easy to interpret.

Uncertainty Evaluation
For a consistent evaluation of the ensemble's means and standard deviations, we compute reliability diagrams or calibration plots (Lakshminarayanan et al., 2017;Kuleshov et al., 2018).The reliability diagram for an ensemble or any forecaster H that predicts a cumulative density function (CDF) Ft of the t-th dataset entry is computed based on the following assumption: H is well calibrated on dataset D if the predicted CDFs match the empirical CDFs when the dataset size T tends towards infinity (Kuleshov et al., 2018).Given M chosen expected confidence levels 0 ≤ p1 < p2 < ... < pM ≤ 1, the corresponding observed confidence level pj to each threshold pj is calculated by computing the empirical frequency (Kuleshov et al., 2018): If H is well calibrated, the values {(pj, pj)} M j=1 form a straight line which passes through the origin and has slope 1 (Kuleshov et al., 2018).In our case, where we estimate the probability density function in terms of the mean and standard deviation of the ensemble predictions on the t-th dataset entry, we first compute the corresponding predicted CDF Ft and get the empirical CDF by drawing the t-th target value yt from Ft.
Based on this reliability diagram, we propose an uncertainty evaluation metric that takes into account the area between the perfect calibration where the target and predictive distributions match and the actually observed confidence levels.We call the metric the uncertainty calibration score (UCS).Whereas in case of a perfect uncertainty calibration the area is zero, the worst case uncertainty calibration corresponds to the possible maximum value of this area which is Amax = 0.25.Therefore, based on these lower and upper bounds, we propose to quantify the calibration quality by where A is the area estimated from the reliability diagram and calculated by using the composite trapezoidal rule: where f (p) = |p − p| is the absolute difference between the observed confidence level pj estimated using Equation ( 1) and the expected confidence level pj.The computation increment is set to ∆p = 0.1.UCS is bound between [0, 1], where a higher value indicates a better calibration.Consequently, the metric is easy to interpret and facilitates a comparison or even a ranking of the different methods.The estimated area A can be interpreted as the calibration error and is similar to the expected calibration error (ECE) (Guo et al., 2017), a popular network calibration error metric for semantic segmentation and classification tasks.Both, the ECE and A in Equation ( 2) are computed based on the differences between the observed accuracy or confidence and the predicted or expected confidence.In the sense that the ECE depends on the number of bins, UCS also depends on the chosen computation increment ∆p in the calculation of the observed confidence levels of the reliability diagram in Equation (1).On the other hand, this dependency implies flexibility with respect to the choice of the assumed underlying distribution.
In combination with sampling-based UQ methods, UCS can be used to find the optimal sample size.Note that in contrast to the ECE, UCS can be applied to any calibration plot, regardless whether the underlying task is regression, classification, or segmentation.The parameterization of the cumulative distribution, which is assumed for the calculation of the observed confidence levels, can be freely chosen, since in the case of deep ensembles the choice is not restricted.Therefore, varying distributions for the orientation and position component can be considered in the computation of UCS for a deep ensemble of an object pose estimator.Here, we use Gaussians to parameterize the distribu- tions of both the orientation and position.
Since the ground truth posterior uncertainty of the ensemble outputs is unknown, we use synthetic data to evaluate the performance of our proposed quality score UCS.Given a dataset size of T = 10000, we sample each t-th target from a uniform distribution.The corresponding output distributions are represented by a mean and the standard deviation σ pred .For each target, the ensemble output mean is sampled from a Gaussian distribution centered on the target and with a standard deviation of σtrue = 0.3.Based on the reliability diagram, we expect that UCS = 100.0%, if σtrue = σ pred for all T dataset entries.The more σ pred differs from the expected value of σtrue, the more should UCS decrease.The results of the simulation are shown in Figure 1 and confirm our expectations.In case that the predicted uncertainty matches the ground truth standard deviation, the U CS is 0.990.For wrong predictions the UCS is smaller.
As the standard deviations do not have an upper bound, UCS slowly converges to zero if the estimated uncertainties are too large.In case of a few outliers (0.01 % of T ) that successfully target the ground truth but where σ pred ̸ = σtrue, UCS is not affected significantly and therefore proves to be robust to outliers.

EXPERIMENTS
We conducted experiments on the T-LESS (Hodaň et al., 2017) and YCB-V (Xiang et al., 2018) datasets.Both are part of the BOP challenge and include photorealistic rendered training images of randomly sampled cluttered scenes that were added to the datasets as part of the BOP challenge 2020 (Hodaň et al., 2020) and are used to train the models.While T-LESS consists of 30 industrial parts that are largely textureless and in many cases symmetric, the YCB-V dataset contains 21 objects of daily life.Both datasets include CAD object models.The BOP test dataset of T-LESS is composed of 20 cluttered scenes, for each of which 50 real test images are provided.Two example RGB images are shown in Figure 2. The BOP test images of YCB-V are sampled from 12 of the 92 video scenes of the original dataset.Because not all annotated ground truth targets are visible, we only consider target objects where at least 10 % of the object instance surface is visible and the visible object instance part is represented by at least 1024 pixels in the image.This results in 6423 and 4121 valid ground truth samples for T-LESS and YCB-V, respectively.As these criteria are also used during model training, this test data subset can be considered as in-domain.In contrast, test data points of object instances that are visible by less than 10 % and whose visible surface masks have fewer than 1024 pixels form an out-of-domain test dataset.In Section 5.1, we evaluate the quality of the pose estimates of the trained ensemble members for T-LESS and YCB-V on their respective BOP test datasets.In Section 5.2, the estimated ensemble distributions are analyzed.

Evaluation of the Ensemble Pose Estimates
To ensure that the overall quality of the pose estimates of the trained ensembles is not affected by the random initialization, we evaluated each of the ensemble members, which we call the baseline models, separately.We also evaluated the pose average as the mean over all ensemble members.For evaluation, we applied the BOP error metrics M SP D, M SSD, and V SD Hodaň et al. (2020).In Table 1, the results with and without depth refinement are compared to the scores of the reproduced SurfEmb model that is provided by the authors, the mean of the scores achieved by the ensemble baseline models, and the scores for the estimated mean poses of the ensemble are reported.Both ensembles trained on T-LESS and YCB-V consist of ten ensemble members each.The scores are defined as the average recall (AR) of the BOP error metrics M SP D, M SSD, and V SD.While the M SP D error measures the perceivable discrepancy and, therefore, is relevant for augmented reality applications, the M SSD error measures the maximum pose error in the 3D space and is especially relevant for robot manipulation (Hodaň et al., 2020).Both metrics take object symmetries into account.The V SD is the visual surface discrepancy Hodaň et al. (2020).Surprisingly, the poses of the randomly initialized ensemble members achieve similar scores as the ones estimated by the provided SurfEmb models for T-LESS and YCB-V that were initialized with pretrained weights on ImageNet.It seems that in this case, pretraining does not improve the quality of the predictions.Furthermore, it can be observed that ensembling the poses slightly improves the quality of the prediction, a phenomenon that is often taken advantage of in knowledge distillation, where the knowledge of an ensemble is compressed into a single model to overcome the ensemble drawback of high computational costs Hinton et al. (2015).

Evaluation of the Ensemble Uncertainty
To eliminate possible influences of the 2D object detection on the evaluation of the pose estimation ensemble, the evaluation of the ensemble results is done on object instance image crops based on the ground truth.In Figure 3  and standard deviations and evaluated on the targets.The observed confidences of the ensemble queries are estimated based on the pixel-wise mean and the standard deviation of each embedding dimension.As the targets produced by the key models and the predicted queries are not in the same range, they were normalized with their minimum and maximum values, respectively.The ensemble outputs are predicted on the ground truth object instance crops of the 6423 visible samples of the T-LESS test dataset.As the area between the plotted curve of the observed confidence levels and the diagonal that represents a perfect calibration is very small, the query model seems to be very well calibrated.Accordingly, it has a high corresponding calibration score of UCS = 96.0%.It can be observed that for low expected confidence levels the observed confidence is slightly larger than the expected value.Analogously, for high expected confidence levels, the observed confidence level is slightly lower than expected.Based on the predicted queries by the ensemble members on the ground truth object instance image crops, the pose results of each ensemble member are computed.Figure 4 shows the reliability diagrams of the estimated entation in form of rotation matrices and position components on T-LESS and YCB-V.It can be observed that, overall, the T-LESS ensemble seems to be better calibrated than the YCB-V ensemble.In Figure 4a shows that the quality of the calibration of the orientation component decreases with the local refinement step, both on T-LESS and YCB-V.The unrefined poses achieve a UCS of 88.5 % and 80.6 % while the refined estimates score 79.6 % and 69.3 % on T-LESS and YCB-V, respectively.In contrast, the position component is only lightly affected by the local refinement and decreases the UCS of the unrefined estimates by 3.1 % on T-LESS and by 0.8 % on YCB-V.While the depth refinement does not influence the orientation estimates, it improves the quality of the estimated position component that also affects its calibration, as it is shown in Figure 4b.While the UCS of the position component on YCB-V increases from 33.6 % to 44.3 % when they are refined with the depth data, the UCS on T-LESS decreases from 65.2 % to 57.6 %.The reason behind this will be part of future work.
In Figures 5a and 5b, the reliability diagrams of different orientation representations of the pose estimates on the ground truth image crops of T-LESS and YCB-V are shown.The poses are unrefined so that the influence of the orientation representations can be better observed and other influences are as much reduced as possible.The four chosen representations are quaternions, Euler angles, Rodriguez' axis-angle representation, and the rotation matrix.These representations were selected based on their importance in orientation and pose estimation tasks and their interpretability.Out of the four representations, Rodriguez' axis-angle representation achieves the highest UCS

DISCUSSION
The reliability diagrams of the T-LESS query model ensemble in Figure 3 and of the orientation components on both datasets in Figure 4 show that the deep ensembles are well calibrated.This is also reflected in the high values for UCS.The almost perfect calibration of the query model ensembles is reduced during the follow-up steps of the pose estimation pipeline of SurfEmb.This leads to the conclusion that, while deep ensembles are easy to apply in general, this notion may not be transferred to an ensemble of a pose estimator with multiple stages and a combination of deep learning and algorithms.It may be circumvented by using an end-to-end trainable pose estimator where the PnP algorithm is implemented as part of the trainable network architecture as it was done in GDRNet (Wang et al., 2021) or by using error propagation.However, Figure 3 shows that the ensemble results of the query model on the T-LESS dataset are well calibrated, demonstrating that in this case for UQ with deep ensembles the stage consisting of the 2D object detector does not need to be included.It has to be noted that the position components of the pose estimates is detrimental to the overall calibration.In the reliability diagrams of different orientation representations on T-LESS and YCB-V, shown in Figures 5a and 5b, it can be seen that the choice of the representation has an influence on the imperfections and thus the calibration.Remarkable is the decrease of the calibration quality in case of a representation as quaternions, which may be due to the fact that the assumed normal distribution with one standard deviation value per element is not sufficient.

CONCLUSION
In this work, we applied the state-of-the-art deep learning UQ method of deep ensembles to SurfEmb, one of the top-performing multi-stage deep 6D object pose estimation approaches, and evaluated the result on the T-LESS dataset.The adaptation of SurfEmb's correspondence network to the deep ensemble methodology is straightforward and we find that the ensemble on T-LESS is very well calibrated.However, the following PnP implementation, pose refinement strategies, and pose representations reduce the quality of the estimated predictive uncertainty.Furthermore, we introduced UCS, a novel metric to quantify the estimated uncertainty for regression tasks.UCS is easy to interpret and facilitates a comparison or even a ranking of the different methods.In future work, we want to extend the experiments to other pose estimation methods.Also, we want to investigate the influence of the error propagation of the network ensemble predictions through the PnP(-RANSAC) stage of the pose estimation pipeline.This may be done by using a differentiable PnP implementation like EPro-PnP (Chen et al., 2022) or a deep learning variant like Patch-PnP (Wang et al., 2021).

Figure 1 .
Figure1.UCS for simulated uncertainty predictions with a ground truth standard deviation σtrue = 0.3 (dashed line).For a perfect calibration, where the simulated and predicted uncertainty match, UCS is close to 1.

Figure 2 .
Figure 2. Two examples of RGB images from different scenes of the BOP test dataset of T-LESS.

Figure 3 .
Figure 3. Reliability diagram of the T-LESS query model ensemble with the optimal ensemble size of eight ensemble members.The perfect calibration as the diagonal is represented by the dashed gray line.
Locally refined ensemble positions with and without depth

Figure 4 .Figure 5 .
Figure 4. Reliability diagrams of the estimated ensemble orientation and position components on T-LESS and YCB-V.The perfect calibration is represented by the dashed gray line.
(Fort et al., 2020)e used during model training, and iii) whether a form of adversarial training is applied.Model Weights Initialization Scheme.The weights of each model in the ensemble are initialized randomly.The randomness causes the models to reach different modes of the loss during training and thus the ensemble can better cover the posterior distribution of predictions.This is one of the main reasons why deep ensembles work well in practice(Fort et al., 2020).
Lakshminarayanan et al. (2017)offer a recipe with three prerequisites that baseline models must fullfill in order to obtain a well-calibrated deep ensemble of models that predicts accurate uncertainties.The prerequisites concern i) the model weights initialization scheme,

Table 1 .
, the reliability diagram described in Section 4 is shown for the T-LESS ensemble query model outputs, meaning the 12D embedded dense correspondence distributions.The reliability diagram can be interpreted as follows: For instance, with an expected confidence level of 0.10, we observe an actual confidence level of 0.11, meaning that for 11 % of the T-LESS test data points the CDFs of the Gaussian distributions output a probability of ≤ 10 %.The Gaussian distributions are parametrized by the ensemble means Evaluation results of the differently trained SurfEmb models on the BOP test datasets of T-LESS and YCB-V, both without (RGB) and with depth refinement (RGB-D).Shown are the reproduced AR of the models trained and provided by the authors of SurfEmb (Pretrained), the mean AR of the randomly initialized ensemble members (Baseline), and the evaluation results of the mean poses of the ensembles (Ensemble).