Geometry-Guided Depth Correction for Metric Relative Pose Estimation

ICMR'26, Amsterdam, Netherlands

Shibin Xie¹, Hao Yin¹, Shuting Wang¹, Xiaokang Fang¹, Liang Jin¹, Haotian Liu¹, Yanting Zhang¹, Shen Cai^1*,

¹Donghua University

Paper

Code

Demo(bilibili)

Demo(YouTube)

Abstract

In recent years, Monocular Depth Estimation (MDE) has evolved from predicting affine-invariant relative depth to estimating metric-scale (absolute) depth. However, local geometric inconsistencies in single-view depth maps and scale inconsistencies across different views still severely hinder their practical application in 3D matching and relative pose estimation. To address these challenges, we propose a geometry-guided depth correction framework for metric-scale relative pose estimation. Our approach first leverages pre-trained foundation models to extract initial metric depth, semi-dense correspondences, and high-dimensional semantic features from dual-view images. We then introduce a Local Depth Residual Learning (LDRL) module to correct geometric deviations. Finally, the corrected depth of stereo-matched pairs is integrated into a differentiable RANSAC framework to jointly optimize the relative pose with consistent scale. Experiments on ScanNet and 7-Scenes demonstrate that our method achieves superior performance and robustness across various challenging scenarios.

Depth Correction Comparison

The figure below illustrates the key difference between existing depth correction methods (based on affine-invariance assumptions) and our proposed geometry-guided approach. While global affine correction (scale $s$ and translation $t$) or relative scale normalization ($s_g$, $\alpha$, $\beta_1$, $\beta_2$) fails to capture local geometric distortions, our patch-wise depth correction produces smaller errors and stronger local consistency.

Figure 1: Comparison between depth correction methods under affine-invariance assumptions and our proposed method. Column 3: Metric depth maps estimated by DepthPro; Column 4: Discrepancy with least-squares fitting of scale $s$ and translation $t$; Column 5: Discrepancy with stereo depth corrected using MADPose; Column 6: Discrepancy with our patch-wise depth correction, demonstrating smaller errors and stronger local consistency.

Framework Overview

Our framework leverages the comprehensive perception of multiple vision foundation models to address both intra-view geometric inconsistencies and inter-view scale misalignments in stereo depth estimation, consisting of three core stages:

Multi-Prior Extraction: A monocular depth estimation foundation model (DepthPro) obtains initial metric depth maps; DINOv2 extracts high-dimensional semantic feature maps; and a semi-dense matcher (ELoFTR) establishes cross-view pixel correspondences.
Local Depth Residual Learning (LDRL): A "Correction Transformer" with alternating Self-Attention and Windowed Cross-Attention (WCA) layers predicts local depth residuals $\Delta s$ for each matched pixel, enabling fine-grained depth correction at the pixel level.
Differentiable Geometric Solving: The corrected 3D metric point correspondences are fed into a geometric solver equipped with differentiable RANSAC, enabling end-to-end joint optimization of pose loss and 3D geometric consistency loss.

[Required Image: Figure 2 - Framework overview pipeline showing Multi-Prior Extraction, Depth Correction Transformer with LDRL, and Differentiable RANSAC]

Figure 2: Overview of the proposed framework. First, multiple priors including initial metric depth maps, high-dimensional features, and semi-dense correspondences are extracted using frozen vision foundation models. Subsequently, the depth correction transformer predicts local depth residuals $\Delta s$ to rectify geometric inconsistencies. The corrected metric depths, together with matching patches, are then fed into a differentiable RANSAC solver to estimate the metric relative pose $(\mathbf{R}, \mathbf{t})$.

Local Depth Residual Learning & Windowed Cross-Attention

To overcome geometric inconsistencies in stereo depth, we predict a local multiplicative residual $\Delta s$ for each matched patch instead of global affine parameters. We employ an exponential mapping in log-space to perform non-linear depth rectification, ensuring physical positivity:

$\hat{D} = D^{prior} \cdot \exp(\beta \cdot \Delta s),$

where $\Delta s \in [-1, 1]$ and the scaling constant $\beta = \ln(5)$ provide a versatile correction range of $[0.2, 5.0] \times D^{prior}$. This mechanism offers inherent scale symmetry and substantial local flexibility, enabling the network to selectively rectify geometric inconsistencies in challenging regions such as object boundaries.

For the Windowed Cross-Attention (WCA) mechanism, we explicitly constrain the search space of cross-attention using the matching prior. For a query coordinate $\mathbf{x}_i$ in image $I_0$, we define a local spatial window $\mathcal{W}(\mathbf{x}_i, \sigma)$ centered at the anchor pixel in the reference image $I_1$. This design ensures that feature interactions are explicitly focused on potential correspondence regions:

$\text{WCA}(f_0^i) = \text{Softmax}\left(\frac{f_0^i K_{win}^{i\top}}{\sqrt{d}}\right) V_{win}^i.$

Differentiable Pose Estimation

After obtaining the corrected depth of matched points, the 2D correspondences are lifted into 3D space using the known camera intrinsic matrix $\mathbf{A}$:

$\mathbf{X}_0^i = \hat{d}_0^i \cdot \mathbf{A}^{-1} \bar{\mathbf{x}}_0^i.$

Drawing on the probabilistic selection paradigm of DSAC, we employ a soft-selection mechanism over pose hypotheses, directly utilizing the confidence scores $\omega_i$ provided by the matcher as weights to solve for the optimal relative pose $(\mathbf{R}, \mathbf{t})$:

$\min_{\mathbf{R}, \mathbf{t}} \sum_{i=1}^{N} \omega_i \cdot \rho(\|(\mathbf{R}\mathbf{X}_0^i + \mathbf{t}) - \mathbf{X}_1^i\|_2).$

Experimental Results

1. Metric Relative Pose Estimation on ScanNet-1500

Table 1 reports Pose error AUC (%) at thresholds of $5^\circ$, $10^\circ$, and $20^\circ$. Our method (DepthPro + LDRL) significantly outperforms existing metric-scale methods, achieving 33.83% AUC@5 compared to 25.70% from Reloc3r-metric-pose. Notably, using ELoFTR with the original DepthPro depth map yields only 3.64% AUC@5, demonstrating that LDRL successfully corrects local geometric deviations.

Category	Methods	Depth Source	AUC@5	AUC@10	AUC@20
Relative	SP + LG + MADPose	MoGe	23.36	43.39	61.08
	RoMa + MADPose	MoGe	34.26	56.77	73.67
	Reloc3r-512	None (ViT)	34.79	58.37	75.56
	MASt3R	PointMap	28.01	50.24	68.83
	DUSt3R	PointMap	23.81	45.91	65.57
Metric Scale	ELoFTR	DepthPro	3.64	13.86	31.76
	Reloc3r-metric-pose	None (ViT)	25.70	50.20	70.07
	Ours	DepthPro + LDRL	33.83	53.98	70.15

2. Geometric Correction Strategy Analysis on ScanNet-1500

Table 2 compares our LDRL against standard feature matchers paired with different depth processing techniques. While global affine correction improves rotation estimation, its improvement on translation metrics remains limited. Our LDRL achieves the best results across all metrics.

Methods	Depth Source	Rot. AUC ($5^\circ$/$10^\circ$/$20^\circ$)			Trans. AUC (0.1m/0.5m/1m)			Med. Rot. ($^\circ$)	Med. Trans. (m)
LoFTR	Monocular Depth	38.7	55.6	69.1	19.7	58.8	72.0	2.8	0.12
SuperGlue		31.7	49.1	64.6	15.3	53.9	69.1	3.8	0.16
SIFT		21.5	33.7	44.0	9.9	36.8	46.8	7.1	0.30
LoFTR + Global Affine	Global Affine Correction	52.6	69.1	79.7	17.5	61.0	75.1	1.8	0.13
SuperGlue + Global Affine		43.5	62.0	75.4	14.0	57.5	73.1	2.4	0.15
SIFT + Global Affine		24.4	36.3	47.1	7.3	35.2	48.5	8.0	0.38
Map-free (Regress)	-	25.3	44.8	62.4	7.4	47.2	64.6	4.7	0.22
Map-free (Match)	-	26.5	46.5	64.2	8.6	47.7	64.6	4.3	0.21
Ours	DepthPro + LDRL	59.8	74.3	83.1	39.3	74.8	83.8	1.4	0.06

3. Cumulative Depth Error Distribution

Figure 3 compares the percentage of patches satisfying different absolute error thresholds. LDRL (blue bars) outperforms Initial Depth (red bars) across all tolerances, achieving a fourfold improvement at the strict $\leq 0.1m$ threshold (0.44 vs. 0.11).

[Required Image: Figure 3 - Cumulative error distribution bar chart comparing Initial Depth vs with LDRL at thresholds 0.05/0.1/0.15/0.3/0.5/1.0m]

Figure 3: Cumulative error distribution of depth estimation on ScanNet-1500. The bar chart displays the ratio of patches where the absolute depth error falls within specific thresholds. "Initial Depth" represents the raw predictions from DepthPro, and "with LDRL" denotes the corrected depth maps after applying our proposed Local Depth Residual Learning correction.

4. Metric Pose Estimation on 7-Scenes

Table 4 evaluates our method on the 7-Scenes visual localization task. Despite using only a single reference image, our method achieves average translation error of 0.05m and rotation error of 1.2°, significantly outperforming all single-pair methods and approaching state-of-the-art multi-view methods.

Category	Methods	Chess	Fire	Heads	Office	Pumpkin	RedKitchen	Stairs	Average
APR	LENS	0.03/1.3	0.10/3.7	0.07/5.8	0.07/1.9	0.08/2.2	0.09/2.2	0.14/3.6	0.08/3.0
	PMNet	0.03/1.3	0.04/1.8	0.02/1.7	0.06/1.7	0.07/2.0	0.08/2.2	0.11/3.0	0.06/1.9
	DFNet+NeFeS	0.02/0.6	0.02/0.7	0.02/1.3	0.02/0.6	0.02/0.6	0.02/0.6	0.05/1.3	0.02/0.8
	Marepo	0.02/1.2	0.02/1.4	0.02/2.0	0.03/1.3	0.04/1.5	0.04/1.7	0.06/1.7	0.03/1.5
Multi-pair	EssNet	0.13/5.1	0.27/10.1	0.15/9.9	0.21/6.9	0.22/6.1	0.23/6.9	0.32/11.2	0.22/8.0
	NC-EssNet	0.12/5.6	0.26/9.6	0.14/10.7	0.20/6.7	0.22/5.7	0.22/6.3	0.31/7.9	0.21/7.5
	Relative PN	0.13/6.5	0.26/12.7	0.14/12.3	0.21/7.4	0.24/6.4	0.24/8.0	0.27/11.8	0.21/9.3
	Relpose-GNN	0.08/2.7	0.21/7.5	0.13/8.7	0.15/4.1	0.15/3.5	0.19/3.7	0.22/6.5	0.16/5.2
	AnchorNet	0.06/3.9	0.15/10.3	0.08/10.9	0.09/5.2	0.10/3.0	0.08/4.7	0.10/9.3	0.09/6.7
	CamNet	0.04/1.7	0.03/1.7	0.05/2.0	0.04/1.6	0.04/1.6	0.04/1.6	0.04/1.5	0.04/1.7
	ExReNet (SN)	0.06/2.2	0.09/3.2	0.04/3.3	0.07/2.2	0.11/2.7	0.09/2.6	0.33/7.3	0.11/3.3
	ExReNet (SUNCG)	0.05/1.6	0.07/2.5	0.03/2.7	0.06/1.8	0.07/2.0	0.07/2.1	0.19/4.9	0.08/2.5
	Reloc3r-224	0.03/1.0	0.04/1.1	0.02/1.2	0.05/0.9	0.07/1.1	0.05/1.2	0.12/2.3	0.05/1.3
	Reloc3r-512	0.03/0.9	0.03/0.8	0.01/1.0	0.04/0.9	0.06/1.1	0.04/1.3	0.07/1.3	0.04/1.0
	Map-free (Regress)	0.09/2.7	0.13/4.5	0.11/4.8	0.11/2.8	0.16/3.1	0.14/3.5	0.18/4.7	0.13/3.7
Single-pair	RelocNet	0.12/4.1	0.26/10.4	0.14/10.5	0.18/5.3	0.26/4.2	0.23/5.1	0.28/7.5	0.21/6.7
	Map-free (Match)	0.10/2.9	0.12/5.0	0.11/5.4	0.12/3.0	0.16/3.2	0.14/3.5	0.21/4.5	0.14/3.9
	Ours	0.04/1.0	0.04/1.1	0.02/1.0	0.05/1.1	0.06/1.4	0.05/1.4	0.07/1.8	0.05/1.2

Ablation Study

Table 5 demonstrates LDRL's contribution across both ScanNet and 7-Scenes datasets. On 7-Scenes, LDRL improves AUC@10 from 20.76% to 64.57%, with median translation error reducing from 0.14m to 0.05m.

Dataset	Method	AUC@5 ↑	AUC@10 ↑	Med. Rot. ($^\circ$) ↓	Med. Trans. (m) ↓
ScanNet	Initial Depth	3.64	13.86	3.48	0.20
ScanNet	Ours	33.83	53.98	1.40	0.06
7-Scenes	Initial Depth	6.57	20.76	2.77	0.14
7-Scenes	Ours	32.59	64.57	1.21	0.05

Conclusion

We present a geometry-guided depth correction framework that addresses local geometric inconsistency and cross-view scale misalignment in monocular depth estimation for metric-scale relative pose estimation. By introducing Local Depth Residual Learning with Windowed Cross-Attention and a differentiable RANSAC framework for joint optimization, our method achieves fine-grained depth correction beyond global affine alignment. Extensive experiments on ScanNet and 7-Scenes demonstrate superior accuracy and robustness in metric-scale pose estimation compared to both end-to-end regression and existing alignment methods, providing a practical solution for high-precision visual localization and 3D reconstruction tasks.

Citation

@inproceedings{xie2026geometry,
  title={Geometry-Guided Depth Correction for Metric Relative Pose Estimation},
  author={Xie, Shibin and Yin, Hao and Wang, Shuting and Fang, Xiaokang and Jin, Liang and Liu, Haotian and Zhang, Yanting and Cai, Shen},
  booktitle={Proceedings of the International Conference on Multimedia Retrieval (ICMR)},
  year={2026},
  organization={ACM}
}