Geometry-Guided Depth Correction for Metric Relative Pose Estimation

ICMR'26, Amsterdam, Netherlands


Shibin Xie1, Hao Yin1, Shuting Wang1, Xiaokang Fang1, Liang Jin1, Haotian Liu1, Yanting Zhang1, Shen Cai1*,

1Donghua University   

Abstract


In recent years, Monocular Depth Estimation (MDE) has evolved from predicting affine-invariant relative depth to estimating metric-scale (absolute) depth. However, local geometric inconsistencies in single-view depth maps and scale inconsistencies across different views still severely hinder their practical application in 3D matching and relative pose estimation. To address these challenges, we propose a geometry-guided depth correction framework for metric-scale relative pose estimation. Our approach first leverages pre-trained foundation models to extract initial metric depth, semi-dense correspondences, and high-dimensional semantic features from dual-view images. We then introduce a Local Depth Residual Learning (LDRL) module to correct geometric deviations. Finally, the corrected depth of stereo-matched pairs is integrated into a differentiable RANSAC framework to jointly optimize the relative pose with consistent scale. Experiments on ScanNet and 7-Scenes demonstrate that our method achieves superior performance and robustness across various challenging scenarios.



Depth Correction Comparison


The figure below illustrates the key difference between existing depth correction methods (based on affine-invariance assumptions) and our proposed geometry-guided approach. While global affine correction (scale $s$ and translation $t$) or relative scale normalization ($s_g$, $\alpha$, $\beta_1$, $\beta_2$) fails to capture local geometric distortions, our patch-wise depth correction produces smaller errors and stronger local consistency.

[Required Image: Figure 1 - Depth correction comparison (6 columns: reference image, target image, metric depth, least-squares affine correction, MADPose stereo correction, Ours)]

Figure 1: Comparison between depth correction methods under affine-invariance assumptions and our proposed method. Column 3: Metric depth maps estimated by DepthPro; Column 4: Discrepancy with least-squares fitting of scale $s$ and translation $t$; Column 5: Discrepancy with stereo depth corrected using MADPose; Column 6: Discrepancy with our patch-wise depth correction, demonstrating smaller errors and stronger local consistency.



Framework Overview


Our framework leverages the comprehensive perception of multiple vision foundation models to address both intra-view geometric inconsistencies and inter-view scale misalignments in stereo depth estimation, consisting of three core stages:

  • Multi-Prior Extraction: A monocular depth estimation foundation model (DepthPro) obtains initial metric depth maps; DINOv2 extracts high-dimensional semantic feature maps; and a semi-dense matcher (ELoFTR) establishes cross-view pixel correspondences.
  • Local Depth Residual Learning (LDRL): A "Correction Transformer" with alternating Self-Attention and Windowed Cross-Attention (WCA) layers predicts local depth residuals $\Delta s$ for each matched pixel, enabling fine-grained depth correction at the pixel level.
  • Differentiable Geometric Solving: The corrected 3D metric point correspondences are fed into a geometric solver equipped with differentiable RANSAC, enabling end-to-end joint optimization of pose loss and 3D geometric consistency loss.
[Required Image: Figure 2 - Framework overview pipeline showing Multi-Prior Extraction, Depth Correction Transformer with LDRL, and Differentiable RANSAC]

Figure 2: Overview of the proposed framework. First, multiple priors including initial metric depth maps, high-dimensional features, and semi-dense correspondences are extracted using frozen vision foundation models. Subsequently, the depth correction transformer predicts local depth residuals $\Delta s$ to rectify geometric inconsistencies. The corrected metric depths, together with matching patches, are then fed into a differentiable RANSAC solver to estimate the metric relative pose $(\mathbf{R}, \mathbf{t})$.



Local Depth Residual Learning & Windowed Cross-Attention


To overcome geometric inconsistencies in stereo depth, we predict a local multiplicative residual $\Delta s$ for each matched patch instead of global affine parameters. We employ an exponential mapping in log-space to perform non-linear depth rectification, ensuring physical positivity:

$\hat{D} = D^{prior} \cdot \exp(\beta \cdot \Delta s),$

where $\Delta s \in [-1, 1]$ and the scaling constant $\beta = \ln(5)$ provide a versatile correction range of $[0.2, 5.0] \times D^{prior}$. This mechanism offers inherent scale symmetry and substantial local flexibility, enabling the network to selectively rectify geometric inconsistencies in challenging regions such as object boundaries.

For the Windowed Cross-Attention (WCA) mechanism, we explicitly constrain the search space of cross-attention using the matching prior. For a query coordinate $\mathbf{x}_i$ in image $I_0$, we define a local spatial window $\mathcal{W}(\mathbf{x}_i, \sigma)$ centered at the anchor pixel in the reference image $I_1$. This design ensures that feature interactions are explicitly focused on potential correspondence regions:

$\text{WCA}(f_0^i) = \text{Softmax}\left(\frac{f_0^i K_{win}^{i\top}}{\sqrt{d}}\right) V_{win}^i.$



Differentiable Pose Estimation


After obtaining the corrected depth of matched points, the 2D correspondences are lifted into 3D space using the known camera intrinsic matrix $\mathbf{A}$:

$\mathbf{X}_0^i = \hat{d}_0^i \cdot \mathbf{A}^{-1} \bar{\mathbf{x}}_0^i.$

Drawing on the probabilistic selection paradigm of DSAC, we employ a soft-selection mechanism over pose hypotheses, directly utilizing the confidence scores $\omega_i$ provided by the matcher as weights to solve for the optimal relative pose $(\mathbf{R}, \mathbf{t})$:

$\min_{\mathbf{R}, \mathbf{t}} \sum_{i=1}^{N} \omega_i \cdot \rho(\|(\mathbf{R}\mathbf{X}_0^i + \mathbf{t}) - \mathbf{X}_1^i\|_2).$



Experimental Results


1. Metric Relative Pose Estimation on ScanNet-1500

Table 1 reports Pose error AUC (%) at thresholds of $5^\circ$, $10^\circ$, and $20^\circ$. Our method (DepthPro + LDRL) significantly outperforms existing metric-scale methods, achieving 33.83% AUC@5 compared to 25.70% from Reloc3r-metric-pose. Notably, using ELoFTR with the original DepthPro depth map yields only 3.64% AUC@5, demonstrating that LDRL successfully corrects local geometric deviations.

Category Methods Depth Source AUC@5 AUC@10 AUC@20
RelativeSP + LG + MADPoseMoGe23.3643.3961.08
RoMa + MADPoseMoGe34.2656.7773.67
Reloc3r-512None (ViT)34.7958.3775.56
MASt3RPointMap28.0150.2468.83
DUSt3RPointMap23.8145.9165.57
Metric ScaleELoFTRDepthPro3.6413.8631.76
Reloc3r-metric-poseNone (ViT)25.7050.2070.07
OursDepthPro + LDRL33.8353.9870.15

2. Geometric Correction Strategy Analysis on ScanNet-1500

Table 2 compares our LDRL against standard feature matchers paired with different depth processing techniques. While global affine correction improves rotation estimation, its improvement on translation metrics remains limited. Our LDRL achieves the best results across all metrics.

Methods Depth Source Rot. AUC ($5^\circ$/$10^\circ$/$20^\circ$) Trans. AUC (0.1m/0.5m/1m) Med. Rot. ($^\circ$) Med. Trans. (m)
LoFTRMonocular Depth38.755.669.119.758.872.02.80.12
SuperGlue31.749.164.615.353.969.13.80.16
SIFT21.533.744.09.936.846.87.10.30
LoFTR + Global AffineGlobal Affine Correction52.669.179.717.561.075.11.80.13
SuperGlue + Global Affine43.562.075.414.057.573.12.40.15
SIFT + Global Affine24.436.347.17.335.248.58.00.38
Map-free (Regress)-25.344.862.47.447.264.64.70.22
Map-free (Match)-26.546.564.28.647.764.64.30.21
OursDepthPro + LDRL59.874.383.139.374.883.81.40.06

3. Cumulative Depth Error Distribution

Figure 3 compares the percentage of patches satisfying different absolute error thresholds. LDRL (blue bars) outperforms Initial Depth (red bars) across all tolerances, achieving a fourfold improvement at the strict $\leq 0.1m$ threshold (0.44 vs. 0.11).

[Required Image: Figure 3 - Cumulative error distribution bar chart comparing Initial Depth vs with LDRL at thresholds 0.05/0.1/0.15/0.3/0.5/1.0m]

Figure 3: Cumulative error distribution of depth estimation on ScanNet-1500. The bar chart displays the ratio of patches where the absolute depth error falls within specific thresholds. "Initial Depth" represents the raw predictions from DepthPro, and "with LDRL" denotes the corrected depth maps after applying our proposed Local Depth Residual Learning correction.

4. Metric Pose Estimation on 7-Scenes

Table 4 evaluates our method on the 7-Scenes visual localization task. Despite using only a single reference image, our method achieves average translation error of 0.05m and rotation error of 1.2°, significantly outperforming all single-pair methods and approaching state-of-the-art multi-view methods.

Category Methods Chess Fire Heads Office Pumpkin RedKitchen Stairs Average
APRLENS0.03/1.30.10/3.70.07/5.80.07/1.90.08/2.20.09/2.20.14/3.60.08/3.0
PMNet0.03/1.30.04/1.80.02/1.70.06/1.70.07/2.00.08/2.20.11/3.00.06/1.9
DFNet+NeFeS0.02/0.60.02/0.70.02/1.30.02/0.60.02/0.60.02/0.60.05/1.30.02/0.8
Marepo0.02/1.20.02/1.40.02/2.00.03/1.30.04/1.50.04/1.70.06/1.70.03/1.5
Multi-pairEssNet0.13/5.10.27/10.10.15/9.90.21/6.90.22/6.10.23/6.90.32/11.20.22/8.0
NC-EssNet0.12/5.60.26/9.60.14/10.70.20/6.70.22/5.70.22/6.30.31/7.90.21/7.5
Relative PN0.13/6.50.26/12.70.14/12.30.21/7.40.24/6.40.24/8.00.27/11.80.21/9.3
Relpose-GNN0.08/2.70.21/7.50.13/8.70.15/4.10.15/3.50.19/3.70.22/6.50.16/5.2
AnchorNet0.06/3.90.15/10.30.08/10.90.09/5.20.10/3.00.08/4.70.10/9.30.09/6.7
CamNet0.04/1.70.03/1.70.05/2.00.04/1.60.04/1.60.04/1.60.04/1.50.04/1.7
ExReNet (SN)0.06/2.20.09/3.20.04/3.30.07/2.20.11/2.70.09/2.60.33/7.30.11/3.3
ExReNet (SUNCG)0.05/1.60.07/2.50.03/2.70.06/1.80.07/2.00.07/2.10.19/4.90.08/2.5
Reloc3r-2240.03/1.00.04/1.10.02/1.20.05/0.90.07/1.10.05/1.20.12/2.30.05/1.3
Reloc3r-5120.03/0.90.03/0.80.01/1.00.04/0.90.06/1.10.04/1.30.07/1.30.04/1.0
Map-free (Regress)0.09/2.70.13/4.50.11/4.80.11/2.80.16/3.10.14/3.50.18/4.70.13/3.7
Single-pairRelocNet0.12/4.10.26/10.40.14/10.50.18/5.30.26/4.20.23/5.10.28/7.50.21/6.7
Map-free (Match)0.10/2.90.12/5.00.11/5.40.12/3.00.16/3.20.14/3.50.21/4.50.14/3.9
Ours0.04/1.00.04/1.10.02/1.00.05/1.10.06/1.40.05/1.40.07/1.80.05/1.2


Ablation Study


Table 5 demonstrates LDRL's contribution across both ScanNet and 7-Scenes datasets. On 7-Scenes, LDRL improves AUC@10 from 20.76% to 64.57%, with median translation error reducing from 0.14m to 0.05m.

Dataset Method AUC@5 ↑ AUC@10 ↑ Med. Rot. ($^\circ$) ↓ Med. Trans. (m) ↓
ScanNetInitial Depth3.6413.863.480.20
Ours33.8353.981.400.06
7-ScenesInitial Depth6.5720.762.770.14
Ours32.5964.571.210.05


Conclusion


We present a geometry-guided depth correction framework that addresses local geometric inconsistency and cross-view scale misalignment in monocular depth estimation for metric-scale relative pose estimation. By introducing Local Depth Residual Learning with Windowed Cross-Attention and a differentiable RANSAC framework for joint optimization, our method achieves fine-grained depth correction beyond global affine alignment. Extensive experiments on ScanNet and 7-Scenes demonstrate superior accuracy and robustness in metric-scale pose estimation compared to both end-to-end regression and existing alignment methods, providing a practical solution for high-precision visual localization and 3D reconstruction tasks.

Citation


@inproceedings{xie2026geometry,
  title={Geometry-Guided Depth Correction for Metric Relative Pose Estimation},
  author={Xie, Shibin and Yin, Hao and Wang, Shuting and Fang, Xiaokang and Jin, Liang and Liu, Haotian and Zhang, Yanting and Cai, Shen},
  booktitle={Proceedings of the International Conference on Multimedia Retrieval (ICMR)},
  year={2026},
  organization={ACM}
}