1Donghua University
In recent years, Monocular Depth Estimation (MDE) has evolved from predicting affine-invariant relative depth to estimating metric-scale (absolute) depth. However, local geometric inconsistencies in single-view depth maps and scale inconsistencies across different views still severely hinder their practical application in 3D matching and relative pose estimation. To address these challenges, we propose a geometry-guided depth correction framework for metric-scale relative pose estimation. Our approach first leverages pre-trained foundation models to extract initial metric depth, semi-dense correspondences, and high-dimensional semantic features from dual-view images. We then introduce a Local Depth Residual Learning (LDRL) module to correct geometric deviations. Finally, the corrected depth of stereo-matched pairs is integrated into a differentiable RANSAC framework to jointly optimize the relative pose with consistent scale. Experiments on ScanNet and 7-Scenes demonstrate that our method achieves superior performance and robustness across various challenging scenarios.
The figure below illustrates the key difference between existing depth correction methods (based on affine-invariance assumptions) and our proposed geometry-guided approach. While global affine correction (scale $s$ and translation $t$) or relative scale normalization ($s_g$, $\alpha$, $\beta_1$, $\beta_2$) fails to capture local geometric distortions, our patch-wise depth correction produces smaller errors and stronger local consistency.
Figure 1: Comparison between depth correction methods under affine-invariance assumptions and our proposed method. Column 3: Metric depth maps estimated by DepthPro; Column 4: Discrepancy with least-squares fitting of scale $s$ and translation $t$; Column 5: Discrepancy with stereo depth corrected using MADPose; Column 6: Discrepancy with our patch-wise depth correction, demonstrating smaller errors and stronger local consistency.
Our framework leverages the comprehensive perception of multiple vision foundation models to address both intra-view geometric inconsistencies and inter-view scale misalignments in stereo depth estimation, consisting of three core stages:
Figure 2: Overview of the proposed framework. First, multiple priors including initial metric depth maps, high-dimensional features, and semi-dense correspondences are extracted using frozen vision foundation models. Subsequently, the depth correction transformer predicts local depth residuals $\Delta s$ to rectify geometric inconsistencies. The corrected metric depths, together with matching patches, are then fed into a differentiable RANSAC solver to estimate the metric relative pose $(\mathbf{R}, \mathbf{t})$.
To overcome geometric inconsistencies in stereo depth, we predict a local multiplicative residual $\Delta s$ for each matched patch instead of global affine parameters. We employ an exponential mapping in log-space to perform non-linear depth rectification, ensuring physical positivity:
$\hat{D} = D^{prior} \cdot \exp(\beta \cdot \Delta s),$
where $\Delta s \in [-1, 1]$ and the scaling constant $\beta = \ln(5)$ provide a versatile correction range of $[0.2, 5.0] \times D^{prior}$. This mechanism offers inherent scale symmetry and substantial local flexibility, enabling the network to selectively rectify geometric inconsistencies in challenging regions such as object boundaries.
For the Windowed Cross-Attention (WCA) mechanism, we explicitly constrain the search space of cross-attention using the matching prior. For a query coordinate $\mathbf{x}_i$ in image $I_0$, we define a local spatial window $\mathcal{W}(\mathbf{x}_i, \sigma)$ centered at the anchor pixel in the reference image $I_1$. This design ensures that feature interactions are explicitly focused on potential correspondence regions:
$\text{WCA}(f_0^i) = \text{Softmax}\left(\frac{f_0^i K_{win}^{i\top}}{\sqrt{d}}\right) V_{win}^i.$
After obtaining the corrected depth of matched points, the 2D correspondences are lifted into 3D space using the known camera intrinsic matrix $\mathbf{A}$:
$\mathbf{X}_0^i = \hat{d}_0^i \cdot \mathbf{A}^{-1} \bar{\mathbf{x}}_0^i.$
Drawing on the probabilistic selection paradigm of DSAC, we employ a soft-selection mechanism over pose hypotheses, directly utilizing the confidence scores $\omega_i$ provided by the matcher as weights to solve for the optimal relative pose $(\mathbf{R}, \mathbf{t})$:
$\min_{\mathbf{R}, \mathbf{t}} \sum_{i=1}^{N} \omega_i \cdot \rho(\|(\mathbf{R}\mathbf{X}_0^i + \mathbf{t}) - \mathbf{X}_1^i\|_2).$
Table 1 reports Pose error AUC (%) at thresholds of $5^\circ$, $10^\circ$, and $20^\circ$. Our method (DepthPro + LDRL) significantly outperforms existing metric-scale methods, achieving 33.83% AUC@5 compared to 25.70% from Reloc3r-metric-pose. Notably, using ELoFTR with the original DepthPro depth map yields only 3.64% AUC@5, demonstrating that LDRL successfully corrects local geometric deviations.
| Category | Methods | Depth Source | AUC@5 | AUC@10 | AUC@20 |
|---|---|---|---|---|---|
| Relative | SP + LG + MADPose | MoGe | 23.36 | 43.39 | 61.08 |
| RoMa + MADPose | MoGe | 34.26 | 56.77 | 73.67 | |
| Reloc3r-512 | None (ViT) | 34.79 | 58.37 | 75.56 | |
| MASt3R | PointMap | 28.01 | 50.24 | 68.83 | |
| DUSt3R | PointMap | 23.81 | 45.91 | 65.57 | |
| Metric Scale | ELoFTR | DepthPro | 3.64 | 13.86 | 31.76 |
| Reloc3r-metric-pose | None (ViT) | 25.70 | 50.20 | 70.07 | |
| Ours | DepthPro + LDRL | 33.83 | 53.98 | 70.15 |
Table 2 compares our LDRL against standard feature matchers paired with different depth processing techniques. While global affine correction improves rotation estimation, its improvement on translation metrics remains limited. Our LDRL achieves the best results across all metrics.
| Methods | Depth Source | Rot. AUC ($5^\circ$/$10^\circ$/$20^\circ$) | Trans. AUC (0.1m/0.5m/1m) | Med. Rot. ($^\circ$) | Med. Trans. (m) | ||||
|---|---|---|---|---|---|---|---|---|---|
| LoFTR | Monocular Depth | 38.7 | 55.6 | 69.1 | 19.7 | 58.8 | 72.0 | 2.8 | 0.12 |
| SuperGlue | 31.7 | 49.1 | 64.6 | 15.3 | 53.9 | 69.1 | 3.8 | 0.16 | |
| SIFT | 21.5 | 33.7 | 44.0 | 9.9 | 36.8 | 46.8 | 7.1 | 0.30 | |
| LoFTR + Global Affine | Global Affine Correction | 52.6 | 69.1 | 79.7 | 17.5 | 61.0 | 75.1 | 1.8 | 0.13 |
| SuperGlue + Global Affine | 43.5 | 62.0 | 75.4 | 14.0 | 57.5 | 73.1 | 2.4 | 0.15 | |
| SIFT + Global Affine | 24.4 | 36.3 | 47.1 | 7.3 | 35.2 | 48.5 | 8.0 | 0.38 | |
| Map-free (Regress) | - | 25.3 | 44.8 | 62.4 | 7.4 | 47.2 | 64.6 | 4.7 | 0.22 |
| Map-free (Match) | - | 26.5 | 46.5 | 64.2 | 8.6 | 47.7 | 64.6 | 4.3 | 0.21 |
| Ours | DepthPro + LDRL | 59.8 | 74.3 | 83.1 | 39.3 | 74.8 | 83.8 | 1.4 | 0.06 |
Figure 3 compares the percentage of patches satisfying different absolute error thresholds. LDRL (blue bars) outperforms Initial Depth (red bars) across all tolerances, achieving a fourfold improvement at the strict $\leq 0.1m$ threshold (0.44 vs. 0.11).
Figure 3: Cumulative error distribution of depth estimation on ScanNet-1500. The bar chart displays the ratio of patches where the absolute depth error falls within specific thresholds. "Initial Depth" represents the raw predictions from DepthPro, and "with LDRL" denotes the corrected depth maps after applying our proposed Local Depth Residual Learning correction.
Table 4 evaluates our method on the 7-Scenes visual localization task. Despite using only a single reference image, our method achieves average translation error of 0.05m and rotation error of 1.2°, significantly outperforming all single-pair methods and approaching state-of-the-art multi-view methods.
| Category | Methods | Chess | Fire | Heads | Office | Pumpkin | RedKitchen | Stairs | Average |
|---|---|---|---|---|---|---|---|---|---|
| APR | LENS | 0.03/1.3 | 0.10/3.7 | 0.07/5.8 | 0.07/1.9 | 0.08/2.2 | 0.09/2.2 | 0.14/3.6 | 0.08/3.0 |
| PMNet | 0.03/1.3 | 0.04/1.8 | 0.02/1.7 | 0.06/1.7 | 0.07/2.0 | 0.08/2.2 | 0.11/3.0 | 0.06/1.9 | |
| DFNet+NeFeS | 0.02/0.6 | 0.02/0.7 | 0.02/1.3 | 0.02/0.6 | 0.02/0.6 | 0.02/0.6 | 0.05/1.3 | 0.02/0.8 | |
| Marepo | 0.02/1.2 | 0.02/1.4 | 0.02/2.0 | 0.03/1.3 | 0.04/1.5 | 0.04/1.7 | 0.06/1.7 | 0.03/1.5 | |
| Multi-pair | EssNet | 0.13/5.1 | 0.27/10.1 | 0.15/9.9 | 0.21/6.9 | 0.22/6.1 | 0.23/6.9 | 0.32/11.2 | 0.22/8.0 |
| NC-EssNet | 0.12/5.6 | 0.26/9.6 | 0.14/10.7 | 0.20/6.7 | 0.22/5.7 | 0.22/6.3 | 0.31/7.9 | 0.21/7.5 | |
| Relative PN | 0.13/6.5 | 0.26/12.7 | 0.14/12.3 | 0.21/7.4 | 0.24/6.4 | 0.24/8.0 | 0.27/11.8 | 0.21/9.3 | |
| Relpose-GNN | 0.08/2.7 | 0.21/7.5 | 0.13/8.7 | 0.15/4.1 | 0.15/3.5 | 0.19/3.7 | 0.22/6.5 | 0.16/5.2 | |
| AnchorNet | 0.06/3.9 | 0.15/10.3 | 0.08/10.9 | 0.09/5.2 | 0.10/3.0 | 0.08/4.7 | 0.10/9.3 | 0.09/6.7 | |
| CamNet | 0.04/1.7 | 0.03/1.7 | 0.05/2.0 | 0.04/1.6 | 0.04/1.6 | 0.04/1.6 | 0.04/1.5 | 0.04/1.7 | |
| ExReNet (SN) | 0.06/2.2 | 0.09/3.2 | 0.04/3.3 | 0.07/2.2 | 0.11/2.7 | 0.09/2.6 | 0.33/7.3 | 0.11/3.3 | |
| ExReNet (SUNCG) | 0.05/1.6 | 0.07/2.5 | 0.03/2.7 | 0.06/1.8 | 0.07/2.0 | 0.07/2.1 | 0.19/4.9 | 0.08/2.5 | |
| Reloc3r-224 | 0.03/1.0 | 0.04/1.1 | 0.02/1.2 | 0.05/0.9 | 0.07/1.1 | 0.05/1.2 | 0.12/2.3 | 0.05/1.3 | |
| Reloc3r-512 | 0.03/0.9 | 0.03/0.8 | 0.01/1.0 | 0.04/0.9 | 0.06/1.1 | 0.04/1.3 | 0.07/1.3 | 0.04/1.0 | |
| Map-free (Regress) | 0.09/2.7 | 0.13/4.5 | 0.11/4.8 | 0.11/2.8 | 0.16/3.1 | 0.14/3.5 | 0.18/4.7 | 0.13/3.7 | |
| Single-pair | RelocNet | 0.12/4.1 | 0.26/10.4 | 0.14/10.5 | 0.18/5.3 | 0.26/4.2 | 0.23/5.1 | 0.28/7.5 | 0.21/6.7 |
| Map-free (Match) | 0.10/2.9 | 0.12/5.0 | 0.11/5.4 | 0.12/3.0 | 0.16/3.2 | 0.14/3.5 | 0.21/4.5 | 0.14/3.9 | |
| Ours | 0.04/1.0 | 0.04/1.1 | 0.02/1.0 | 0.05/1.1 | 0.06/1.4 | 0.05/1.4 | 0.07/1.8 | 0.05/1.2 |
Table 5 demonstrates LDRL's contribution across both ScanNet and 7-Scenes datasets. On 7-Scenes, LDRL improves AUC@10 from 20.76% to 64.57%, with median translation error reducing from 0.14m to 0.05m.
| Dataset | Method | AUC@5 ↑ | AUC@10 ↑ | Med. Rot. ($^\circ$) ↓ | Med. Trans. (m) ↓ |
|---|---|---|---|---|---|
| ScanNet | Initial Depth | 3.64 | 13.86 | 3.48 | 0.20 |
| Ours | 33.83 | 53.98 | 1.40 | 0.06 | |
| 7-Scenes | Initial Depth | 6.57 | 20.76 | 2.77 | 0.14 |
| Ours | 32.59 | 64.57 | 1.21 | 0.05 |
We present a geometry-guided depth correction framework that addresses local geometric inconsistency and cross-view scale misalignment in monocular depth estimation for metric-scale relative pose estimation. By introducing Local Depth Residual Learning with Windowed Cross-Attention and a differentiable RANSAC framework for joint optimization, our method achieves fine-grained depth correction beyond global affine alignment. Extensive experiments on ScanNet and 7-Scenes demonstrate superior accuracy and robustness in metric-scale pose estimation compared to both end-to-end regression and existing alignment methods, providing a practical solution for high-precision visual localization and 3D reconstruction tasks.
@inproceedings{xie2026geometry,
title={Geometry-Guided Depth Correction for Metric Relative Pose Estimation},
author={Xie, Shibin and Yin, Hao and Wang, Shuting and Fang, Xiaokang and Jin, Liang and Liu, Haotian and Zhang, Yanting and Cai, Shen},
booktitle={Proceedings of the International Conference on Multimedia Retrieval (ICMR)},
year={2026},
organization={ACM}
}