Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2306.12547v2 [cs.CV] 24 Mar 2024

DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching

Shuzhe Wang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT             Juho Kannala*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT           Daniel Barath{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Department of Computer Science, Aalto University
{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Computer Vision and Geometry Group, ETH Zurich
shuzhe.wang@aalto.fi   juho.kannala@aalto.fi  dbarath@inf.ethz.ch
Abstract

Matching 2D keypoints in an image to a sparse 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its low memory requirements, inherent privacy preservation, and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However, existing algorithms often compromise on performance, resulting in a significant deterioration compared to their descriptor-based counterparts. In this paper, we introduce DGC-GNN, a novel algorithm that employs a global-to-local Graph Neural Network (GNN) that progressively exploits geometric and color cues to represent keypoints, thereby improving matching accuracy. Our procedure encodes both Euclidean and angular relations at a coarse level, forming the geometric embedding to guide the point matching. We evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not only doubles the accuracy of the state-of-the-art visual descriptor-free algorithm but also substantially narrows the performance gap between descriptor-based and descriptor-free methods.111The code and trained models are available at: https://github.com/AaltoVision/DGC-GNN-release.

1 Introduction

Establishing 2D-3D matches plays a crucial role in various computer vision applications, including visual localization [20, 40, 42, 37, 53, 38], 3D reconstruction [48, 8, 43, 25], and Simultaneous Localization and Mapping (SLAM) [15, 31, 30]. Traditional methods for establishing point-to-point matches involve extracting keypoints and descriptors from a query image, then matching the 2D and 3D descriptors using exhaustive search. To circumvent the computationally expensive matching process, some approaches [20, 37] narrow the search space by employing image retrieval methods [33, 1] first to identify the most similar images in the database, and then perform descriptor-based image matching [27, 11, 14, 38, 45] between the query and retrieved images. The 2D-3D correspondences are subsequently established by connecting the 2D-2D image matches with the prebuilt 2D-3D correspondences in the database. Another approach [39] is to build 2D-to-3D matches by searching through all point descriptors with an efficient vocabulary-based method. Sattler et al. [40, 41] further explore the combination of both 2D-3D and 3D-2D search as an active correspondence search step for a faster and more efficient matching process.

Refer to caption
Figure 1: 2D-3D matching (shown by green lines) with the proposed DGC-GNN and GoMatch [56]. In this example, DGC-GNN obtains 78 correct matches with 0.02 meters camera translation and 0.24{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT rotation errors, while GoMatch finds only 17 inliers with a pose error of 0.37 meters and 4.37{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT.

While descriptor-based algorithms achieve state-of-the-art accuracy, they store and maintain high-dimensional visual descriptors for each point in potentially large 3D point clouds. The stored model often requires orders of magnitude more storage than the point cloud and images alone [56]. These methods are susceptible to privacy attacks [13, 12, 7, 34] and necessitate computationally expensive model maintenance and descriptor update procedures [56] when incorporating new descriptors or points into the model. Several approaches have been proposed to address these limitations. Yang et al. [54] employ learned point selection to sample a subset of the point cloud for scene compression. Other methods [23, 3] directly learn a function that maps 2D pixels to 3D coordinates without explicitly storing the 3D scene. Additionally, [32] introduces an adversarial learning framework to develop content-concealing descriptors that prevent privacy leakage.

Recently, researchers [5, 26] have begun exploring deep learning techniques for cross-domain direct 2D-3D matching and pose estimation without visual descriptors, showcasing the potential of descriptor-free matching through differentiable geometric optimization. The recently proposed GoMatch [56] represents significant progress in descriptor-free 2D-3D keypoints matching, achieving reasonable matching performance on a variety of real-world datasets [24, 46, 22]. GoMatch first identifies keypoints in the query image, which, along with the 3D points from the model, are converted to bearing vectors in the camera coordinate system. The algorithm employs an attention mechanism [38, 52] to establish reliable 2D-3D correspondences effectively. While GoMatch attains reasonable accuracy, its performance still significantly lags behind its descriptor-based counterparts [38, 37, 41]. Additionally, it relies on geometric cues only from the points and their local neighbors, rendering it incapable of distinguishing geometrically similar structures.

These observations lead us to two critical questions: (1) Is geometry the only information we can utilize? (2) How can we leverage the geometric information derived from the points for matching? In practice, humans identify correspondences between objects by considering global structures and local geometric cues. For example, when matching an image to a point cloud as in Fig. 1, we first locate the building based on its unique structure and then identify the local structure of the roof for matching. Besides geometric cues, the visual context, such as the color information at each point, also provides constraints for 2D-3D matching. Importantly, this color information still preserves privacy, as the RGB data from sparse keypoints is insufficient to reconstruct the scene.

Building upon these observations and the groundwork set by GoMatch, we propose a novel graph-based pipeline, named DGC-GNN, which leverages geometric and color cues in a global-to-local manner for descriptor-free 2D-3D matching. DGC-GNN encodes position and RGB information for each point and extracts a global distance-angular embedding to guide local point matching. Taking inspiration from [45], we employ a cluster-based transformer to constrain information flow within local clusters. We observe, from real-world datasets, that DGC-GNN leads to substantial improvements in the number of correct matches and the accuracy of pose estimation. Notably, it doubles the accuracy of GoMatch, thereby reducing the gap between descriptor-based and descriptor-free methods. In summary, our paper makes the following contributions:

  • We introduce a visual descriptor-free global-to-local GNN for direct 2D-3D keypoint matching. The network leverages multiple cues and incorporates a progressive clustering module to represent the keypoints. This pipeline enhances the accuracy of sparse 2D-3D matching while requiring low memory, being privacy-preserving, and low cost from 3D model maintenance.

  • We demonstrate that color information for each point is crucial for 2D-3D matching. By incorporating RGB encoding into our network, we observe significant performance improvements.

  • Extensive experiments on real-world datasets show that DGC-GNN outperforms previous methods by a large margin on both matching and visual localization tasks.

Refer to caption
Figure 2: Pipeline overview. For keypoints from the 2D image and 3D points from the point cloud, the proposed DGC-GNN (1) considers the bearing vectors and the color at each bearing vector as input. (2) It extracts the point-wise position and color features with two separate encoders and mixes the features as 𝐟𝐩subscript𝐟𝐩\mathbf{f_{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT and 𝐟𝐪subscript𝐟𝐪\mathbf{f_{q}}bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT. (3) The bearing vectors are clustered into K𝐾Kitalic_K groups, and geometric graphs are built upon the clusters to extract the global-level geometric embeddings 𝐟^𝐩ggsuperscriptsubscript^𝐟𝐩𝑔𝑔\mathbf{\hat{f}}_{\mathbf{p}}^{gg}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT and 𝐟^𝐪ggsuperscriptsubscript^𝐟𝐪𝑔𝑔\mathbf{\hat{f}}_{\mathbf{q}}^{gg}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT. (4) We then concatenate 𝐟^𝐩ggsuperscriptsubscript^𝐟𝐩𝑔𝑔\mathbf{\hat{f}}_{\mathbf{p}}^{gg}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT with 𝐟𝐩subscript𝐟𝐩\mathbf{f_{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT and 𝐟^𝐪ggsuperscriptsubscript^𝐟𝐪𝑔𝑔\mathbf{\hat{f}}_{\mathbf{q}}^{gg}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT with 𝐟𝐩subscript𝐟𝐩\mathbf{f_{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT, and build a local graph at each point as self-attention. A cluster-based attention module is adopted to enhance the local features by forcing the message to pass only with the most related features. A differentiable layer matches and optimizes the improved features to obtain score matrix 𝒮𝒮\mathcal{S}caligraphic_S. Finally, an outlier rejection network is applied to prune the matches with low confidence, leading to the final 2D-3D correspondences finalsubscript𝑓𝑖𝑛𝑎𝑙\mathcal{M}_{final}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT.

2 Visual Descriptor-Free 2D-3D Matching

2.1 Problem Formulation and Notation

Given keypoints 𝐏={𝐩n2|n=1,,N}𝐏conditional-setsubscript𝐩𝑛superscript2𝑛1𝑁\mathbf{P}=\{\mathbf{p}_{n}\in\mathbb{R}^{2}\;|\;n=1,...,N\}bold_P = { bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_n = 1 , … , italic_N } from query image I𝐼Iitalic_I and database 3D point cloud 𝐐={𝐪m3|m=1,,M}𝐐conditional-setsubscript𝐪𝑚superscript3𝑚1𝑀\mathbf{Q}=\{\mathbf{q}_{m}\in\mathbb{R}^{3}\;|\;m=1,...,M\}bold_Q = { bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_m = 1 , … , italic_M }, where, optionally, each 3D point is associated with a visual descriptor 𝐝D𝐝superscript𝐷\mathbf{d}\in\mathbb{R}^{D}bold_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. The task is to find a set 𝐩,𝐪subscript𝐩𝐪\mathcal{M}_{\mathbf{p,q}}caligraphic_M start_POSTSUBSCRIPT bold_p , bold_q end_POSTSUBSCRIPT of corresponding keypoints such that

𝐩,𝐪={(n,m)|π(𝐪m,𝐑,𝐭,𝐊)𝐩n2ϵ},subscript𝐩𝐪conditional-set𝑛𝑚subscriptnorm𝜋subscript𝐪𝑚𝐑𝐭𝐊subscript𝐩𝑛2italic-ϵ\mathcal{M}_{\mathbf{p,q}}=\{(n,m)\;|\;||\pi(\mathbf{q}_{m},\mathbf{R,t,K})-% \mathbf{p}_{n}||_{2}\leq\epsilon\},caligraphic_M start_POSTSUBSCRIPT bold_p , bold_q end_POSTSUBSCRIPT = { ( italic_n , italic_m ) | | | italic_π ( bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_R , bold_t , bold_K ) - bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ } , (1)

where π()𝜋\pi(\cdot)italic_π ( ⋅ ) is a mapping that projects a 3D point 𝐪msubscript𝐪𝑚\mathbf{q}_{m}bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT from world coordinates to the image plane, represented by a camera rotation 𝐑3×3𝐑superscript33\mathbf{R}\in\mathbb{R}^{3\times 3}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, translation 𝐭3𝐭superscript3\mathbf{t}\in\mathbb{R}^{3}bold_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and intrinsic parameter matrix 𝐊3×3𝐊superscript33\mathbf{K}\in\mathbb{R}^{3\times 3}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT. Parameter ϵitalic-ϵ\epsilon\in\mathbb{R}italic_ϵ ∈ blackboard_R is the threshold specified in pixels. Additionally, we denote the color of point 𝐩nsubscript𝐩𝑛\mathbf{p}_{n}bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as 𝐜n=[r,g,b]T[0,1]3subscript𝐜𝑛superscript𝑟𝑔𝑏Tsuperscript013\mathbf{c}_{n}=[r,g,b]^{\text{T}}\in[0,1]^{3}bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ italic_r , italic_g , italic_b ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.

Bearing Vector. Similar to [56], we adopt bearing vectors as keypoint representation for both the 2D and 3D points to alleviate their cross-domain nature and represent them in the same space. The bearing vector is the direction from the camera center to a 3D point in the camera coordinate system. Given an image, a 2D pixel 𝐩nsubscript𝐩𝑛\mathbf{p}_{n}bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is uplifted to bearing vector as [𝐛𝐩,n,1]T=𝐊𝟏[𝐩n,1]T,𝐛𝐩,n2formulae-sequencesuperscriptsubscript𝐛𝐩𝑛1Tsuperscript𝐊1superscriptsubscript𝐩𝑛1Tsubscript𝐛𝐩𝑛superscript2[\mathbf{b}_{\mathbf{p},n},1]^{\text{T}}=\mathbf{K^{-1}}[\mathbf{p}_{n},1]^{% \text{T}},\mathbf{b}_{\mathbf{p},n}\in\mathbb{R}^{2}[ bold_b start_POSTSUBSCRIPT bold_p , italic_n end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT = bold_K start_POSTSUPERSCRIPT - bold_1 end_POSTSUPERSCRIPT [ bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT bold_p , italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where 𝐊𝐊\mathbf{K}bold_K is the intrinsic camera matrix. Given a 3D point 𝐪msubscript𝐪𝑚\mathbf{q}_{m}bold_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the corresponding bearing vector is

[𝐛𝐪,m,1]T=𝐑𝐪m+𝐭[𝐑𝐪m+𝐭]z,superscriptsubscript𝐛𝐪𝑚1Tsubscript𝐑𝐪𝑚𝐭subscriptdelimited-[]subscript𝐑𝐪𝑚𝐭𝑧[\mathbf{b}_{\mathbf{q},m},1]^{\text{T}}=\frac{\mathbf{Rq}_{m}+\mathbf{t}}{[% \mathbf{Rq}_{m}+\mathbf{t}]_{z}},[ bold_b start_POSTSUBSCRIPT bold_q , italic_m end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT = divide start_ARG bold_Rq start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_t end_ARG start_ARG [ bold_Rq start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_t ] start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , (2)

where 𝐑𝐑\mathbf{R}bold_R is the camera rotation and 𝐭𝐭\mathbf{t}bold_t is its translation in the world coordinate system and subscript z𝑧zitalic_z denotes the third component of the 3D vector.

2.2 Network Architecture

The proposed DGC-GNN applies a hierarchical mechanism to leverage color and geometric cues in a global-to-local fashion. The overall pipeline is illustrated in Fig.2. We initially employ two local feature extractors to encode RGB and position information for each point simultaneously (Sec. 2.2.1). Additionally, we cluster the points based on their distances and generate global graphs to obtain the global-level geometric embeddings (Sec. 2.2.2). Next, we concatenate the local point features with their corresponding global features and input them into the cluster-based local matching module to identify the initial matches (Sec. 2.2.3). Finally, we incorporate a classification network to filter out matches with low confidence to refine the initial matches (Sec.2.3).

2.2.1 Local Feature Extraction

To extract points-wise features from both the 2D keypoint set 𝐏𝐏\mathbf{P}bold_P and the 3D point cloud 𝐐𝐐\mathbf{Q}bold_Q, we consider the inputs as bearing vectors equipped with color information: 𝒫={𝐛𝐩,𝐜𝐩}𝒫subscript𝐛𝐩subscript𝐜𝐩\mathcal{P}=\{\mathbf{b_{p},c_{p}}\}caligraphic_P = { bold_b start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT } and 𝒬={𝐛𝐪,𝐜𝐪}𝒬subscript𝐛𝐪subscript𝐜𝐪\mathcal{Q}=\{\mathbf{b_{q},c_{q}}\}caligraphic_Q = { bold_b start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT }. Two ResNet-style point encoders [18, 5], denoted as bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and csubscript𝑐\mathcal{F}_{c}caligraphic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, are applied to extract position and color embeddings separately. We then obtain the local point features, 𝐟𝐩subscript𝐟𝐩\mathbf{f_{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT and 𝐟𝐪subscript𝐟𝐪\mathbf{f_{q}}bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT, as follows:

𝐟𝐩=b(𝐛𝐩)+c(𝐜𝐩),𝐟𝐪=b(𝐛𝐪)+c(𝐜𝐪).formulae-sequencesubscript𝐟𝐩subscript𝑏subscript𝐛𝐩subscript𝑐subscript𝐜𝐩subscript𝐟𝐪subscript𝑏subscript𝐛𝐪subscript𝑐subscript𝐜𝐪\mathbf{f_{p}}=\mathcal{F}_{b}(\mathbf{b_{p}})+\mathcal{F}_{c}(\mathbf{c_{p}})% ,\ \ \mathbf{f_{q}}=\mathcal{F}_{b}(\mathbf{b_{q}})+\mathcal{F}_{c}(\mathbf{c_% {q}}).bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_b start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ) + caligraphic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ) , bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_b start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ) + caligraphic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ) . (3)

The resulting point-wise features 𝐟𝐩subscript𝐟𝐩\mathbf{f_{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT and 𝐟𝐪subscript𝐟𝐪\mathbf{f_{q}}bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT are vectors with dimensions N×dsuperscript𝑁𝑑\mathbb{R}^{N\times d}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT and M×dsuperscript𝑀𝑑\mathbb{R}^{M\times d}blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT respectively, where N𝑁Nitalic_N and M𝑀Mitalic_M represent the number of keypoints in 𝐏𝐏\mathbf{P}bold_P and 𝐐𝐐\mathbf{Q}bold_Q, and d𝑑ditalic_d denotes the dimensionality of the encoded features, e.g., d=128𝑑128d=128italic_d = 128 .

2.2.2 Global Geometric Guidance

Global context guidance has demonstrated its effectiveness in various computer vision tasks [49, 23, 55, 36]. Global context helps to differentiate local descriptors from similar structures or patches, thereby reducing ambiguity. However, most existing methods [49, 36] consider the outputs from different encoding layers as global and local features. This approach is not suitable for our scenario, as our input is sparse points. Downsampling the sparse point cloud results in losing distinctive geometric structures. Hence, we adopt cluster-based geometric encoding to extract global embeddings. As shown in Fig. 3.(a) and (c), the input bearing vectors, both in the image and in the point cloud, are first clustered into X𝑋Xitalic_X groups. The groups represent distinct clusters, each associated with a cluster center as the global position, denoted by 𝐛^𝐩,x2,x=1,Xformulae-sequencesubscript^𝐛𝐩𝑥superscript2𝑥1𝑋\mathbf{\hat{b}}_{\mathbf{p},x}\in\mathbb{R}^{2},x=1,...Xover^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_x = 1 , … italic_X. The corresponding global embedding is obtained as the average of the point embeddings within a cluster as 𝐟^𝐩,x=1Pp=1P𝐟𝐩,psubscript^𝐟𝐩𝑥1superscript𝑃superscriptsubscriptsuperscript𝑝1superscript𝑃subscript𝐟𝐩superscript𝑝\mathbf{\hat{f}}_{\mathbf{p},x}=\frac{1}{P^{\prime}}\sum_{p^{\prime}=1}^{P^{% \prime}}\mathbf{f}_{\mathbf{p},p^{\prime}}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_f start_POSTSUBSCRIPT bold_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where 𝐟^𝐩,xdsubscript^𝐟𝐩𝑥superscript𝑑\mathbf{\hat{f}}_{\mathbf{p},x}\in\mathbb{R}^{d}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the number of points in the x𝑥xitalic_x. The same is conducted on the 3D points to obtain 𝐛^𝐪subscript^𝐛𝐪\mathbf{\hat{b}}_{\mathbf{q}}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐟^𝐪subscript^𝐟𝐪\mathbf{\hat{f}}_{\mathbf{q}}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT.

Refer to caption
Figure 3: Cluster-based geometric encoding. (a) The clusters obtained from bearing vectors 𝒬𝒬\mathcal{Q}caligraphic_Q of the 3D point cloud are visualized by color. The local graph is created from the neighboring cluster centers. Black 3D points are filtered out from matching. (b) Angular embedding from the global graph to obtain rotation-invariant geometric cues. (c) The clusters obtained from 2D keypoints’ bearing vectors 𝒫𝒫\mathcal{P}caligraphic_P. Similarly, as in 3D, the local graph is created from the neighboring cluster centers.

Global Geometric Graph. To aggregate and extract the geometric relations among the clusters, we propose a novel graph neural network that encodes both distance and angular cues; the basic GNN structure is built upon [19, 56]. In the following, we describe the graph construction for the 2D global points set 𝒫^={𝐛^𝐩,𝐟^𝐩}^𝒫subscript^𝐛𝐩subscript^𝐟𝐩\mathcal{\hat{P}}=\{\mathbf{\hat{b}_{p}},\mathbf{\hat{f}_{p}}\}over^ start_ARG caligraphic_P end_ARG = { over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT , over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT } and the same goes for 𝒬^={𝐛^𝐪,𝐟^𝐪}^𝒬subscript^𝐛𝐪subscript^𝐟𝐪\mathcal{\hat{Q}}=\{\mathbf{\hat{b}_{q}},\mathbf{\hat{f}_{q}}\}over^ start_ARG caligraphic_Q end_ARG = { over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT , over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT }. Each cluster center point 𝐛^𝐩,xsubscript^𝐛𝐩𝑥\mathbf{\hat{b}}_{\mathbf{p},x}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT is connected to its k𝑘kitalic_k-NN neighbours (kK𝑘𝐾k\leq Kitalic_k ≤ italic_K) in the coordinate space, and ξ𝐩,(x,y)subscript𝜉𝐩𝑥𝑦\xi_{\mathbf{p},(x,y)}italic_ξ start_POSTSUBSCRIPT bold_p , ( italic_x , italic_y ) end_POSTSUBSCRIPT is the edge between center points 𝐛^𝐩,xsubscript^𝐛𝐩𝑥\mathbf{\hat{b}}_{\mathbf{p},x}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT and 𝐛^𝐩,ysubscript^𝐛𝐩𝑦\mathbf{\hat{b}}_{\mathbf{p},y}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_y end_POSTSUBSCRIPT. We update the feature 𝐟^𝐩,xsubscript^𝐟𝐩𝑥\mathbf{\hat{f}}_{\mathbf{p},x}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT using the following equation:

𝐟^𝐩,x(t+1)=maxξ𝐩,(x,y)g1((t)𝐟^𝐩,x((t)𝐟^𝐩,x(t)𝐟^𝐩,y)),{}^{(t+1)}\mathbf{\hat{f}}_{\mathbf{p},x}=\max_{\xi_{\mathbf{p},(x,y)}}% \mathcal{H}_{g1}(^{(t)}\mathbf{\hat{f}}_{\mathbf{p},x}\oplus(^{(t)}\mathbf{% \hat{f}}_{\mathbf{p},x}-^{(t)}\mathbf{\hat{f}}_{\mathbf{p},y})),start_FLOATSUPERSCRIPT ( italic_t + 1 ) end_FLOATSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT bold_p , ( italic_x , italic_y ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_g 1 end_POSTSUBSCRIPT ( start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ⊕ ( start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT - start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_y end_POSTSUBSCRIPT ) ) , (4)

where the direct-sum\oplus denotes concatenation and g1(*)subscript𝑔1\mathcal{H}_{g1}(*)caligraphic_H start_POSTSUBSCRIPT italic_g 1 end_POSTSUBSCRIPT ( * ) is the linear projection with instance normalization [51] and a LeakyReLU function [28]. The max𝑚𝑎𝑥maxitalic_m italic_a italic_x operator applies to the k𝑘kitalic_k-NN neighbors. The global feature 𝐟^𝐩,xsubscript^𝐟𝐩𝑥\mathbf{\hat{f}}_{\mathbf{p},x}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT is updated twice, and calculated as

𝐟^𝐩,xg=g2((0)𝐟^𝐩,x(1)𝐟^𝐩,x(2)𝐟^𝐩,x).\mathbf{\hat{f}}_{\mathbf{p},x}^{g}=\mathcal{H}_{g2}(^{(0)}\mathbf{\hat{f}}_{% \mathbf{p},x}\oplus^{(1)}\mathbf{\hat{f}}_{\mathbf{p},x}\oplus^{(2)}\mathbf{% \hat{f}}_{\mathbf{p},x}).over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = caligraphic_H start_POSTSUBSCRIPT italic_g 2 end_POSTSUBSCRIPT ( start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ⊕ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ⊕ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ) . (5)

g2subscript𝑔2\mathcal{H}_{g2}caligraphic_H start_POSTSUBSCRIPT italic_g 2 end_POSTSUBSCRIPT has a similar structure to g1subscript𝑔1\mathcal{H}_{g1}caligraphic_H start_POSTSUBSCRIPT italic_g 1 end_POSTSUBSCRIPT, but without shared weights. Besides the distance embedding, inspired by [36], we also adopt the angular embedding to obtain rotation-invariant geometric cues for the global representation. To do so, we define the embedding on cluster triplets as shown in Fig. 3. (b). Given bearing vector 𝐛^𝐩,xsubscript^𝐛𝐩𝑥\mathbf{\hat{b}}_{\mathbf{p},x}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT and two of its neighbors 𝐛^𝐩,ysubscript^𝐛𝐩𝑦\mathbf{\hat{b}}_{\mathbf{p},y}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_y end_POSTSUBSCRIPT and 𝐛^𝐩,zsubscript^𝐛𝐩𝑧\mathbf{\hat{b}}_{\mathbf{p},z}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_z end_POSTSUBSCRIPT, the angular embedding of <𝐛^𝐩,xsubscript^𝐛𝐩𝑥\mathbf{\hat{b}}_{\mathbf{p},x}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT, 𝐛^𝐩,ysubscript^𝐛𝐩𝑦\mathbf{\hat{b}}_{\mathbf{p},y}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_y end_POSTSUBSCRIPT> w.r.t. 𝐛^𝐩,zsubscript^𝐛𝐩𝑧\mathbf{\hat{b}}_{\mathbf{p},z}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_z end_POSTSUBSCRIPT is defined as follows:

𝐀x,yz=sine((𝐛^𝐩,z𝐛^𝐩,x,𝐛^𝐩,y𝐛^𝐩,x)/σa),superscriptsubscript𝐀𝑥𝑦𝑧sinesubscript^𝐛𝐩𝑧subscript^𝐛𝐩𝑥subscript^𝐛𝐩𝑦subscript^𝐛𝐩𝑥subscript𝜎𝑎\mathbf{A}_{x,y}^{z}=\operatorname{sine}(\angle(\mathbf{\hat{b}}_{\mathbf{p},z% }-\mathbf{\hat{b}}_{\mathbf{p},x},\ \ \ \mathbf{\hat{b}}_{\mathbf{p},y}-% \mathbf{\hat{b}}_{\mathbf{p},x})/\sigma_{a}),bold_A start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = roman_sine ( ∠ ( over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_z end_POSTSUBSCRIPT - over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT , over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_y end_POSTSUBSCRIPT - over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_p , italic_x end_POSTSUBSCRIPT ) / italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) , (6)

where sine()sine\operatorname{sine}(\cdot)roman_sine ( ⋅ ) is a sinusoidal function and σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a controller constant, all k𝑘kitalic_k neighbours are considered to obtain the angular embedding 𝐀𝐩subscript𝐀𝐩\mathbf{A}_{\mathbf{p}}bold_A start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT. We update the global geometric embedding 𝐟^𝐩ggsuperscriptsubscript^𝐟𝐩𝑔𝑔\mathbf{\hat{f}}_{\mathbf{p}}^{gg}over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT as an angular-aware attention mechanism:

𝐟^𝐩gg=norm(𝐟^𝐩g+Att(𝐟^𝐩g,𝐀𝐩));𝐟^𝐩ggK×d,formulae-sequencesuperscriptsubscript^𝐟𝐩𝑔𝑔normsuperscriptsubscript^𝐟𝐩𝑔Attsuperscriptsubscript^𝐟𝐩𝑔subscript𝐀𝐩superscriptsubscript^𝐟𝐩𝑔𝑔superscript𝐾𝑑\mathbf{\hat{f}}_{\mathbf{p}}^{gg}=\operatorname{norm}(\mathbf{\hat{f}}_{% \mathbf{p}}^{g}+\operatorname{Att}(\mathbf{\hat{f}}_{\mathbf{p}}^{g},\mathbf{A% }_{\mathbf{p}}));\ \ \ \ \mathbf{\hat{f}}_{\mathbf{p}}^{gg}\in\mathbb{R}^{K% \times d},over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT = roman_norm ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT + roman_Att ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_A start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ) ) ; over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT , (7)

where

Att(𝐟^𝐩g,𝐀𝐩)=(𝐟^𝐩g𝐖𝐕).(𝐀𝐩𝐖𝐀)(𝐟^𝐩g𝐖𝐐)T+(𝐟^𝐩g𝐖𝐐)(𝐟^𝐩g𝐖𝐊)Tdim.formulae-sequenceAttsuperscriptsubscript^𝐟𝐩𝑔subscript𝐀𝐩superscriptsubscript^𝐟𝐩𝑔superscript𝐖𝐕subscript𝐀𝐩superscript𝐖𝐀superscriptsuperscriptsubscript^𝐟𝐩𝑔superscript𝐖𝐐Tsuperscriptsubscript^𝐟𝐩𝑔superscript𝐖𝐐superscriptsuperscriptsubscript^𝐟𝐩𝑔superscript𝐖𝐊T𝑑𝑖𝑚\begin{split}\operatorname{Att}(\mathbf{\hat{f}}_{\mathbf{p}}^{g},\mathbf{A}_{% \mathbf{p}})=\\ (\mathbf{\hat{f}}_{\mathbf{p}}^{g}\mathbf{W^{V}}).\frac{(\mathbf{A}_{\mathbf{p% }}\mathbf{W^{A}})(\mathbf{\hat{f}}_{\mathbf{p}}^{g}\mathbf{W^{Q}})^{\text{T}}+% (\mathbf{\hat{f}}_{\mathbf{p}}^{g}\mathbf{W^{Q}})(\mathbf{\hat{f}}_{\mathbf{p}% }^{g}\mathbf{W^{K}})^{\text{T}}}{\sqrt{dim}}.\end{split}start_ROW start_CELL roman_Att ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_A start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT ) . divide start_ARG ( bold_A start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT bold_A end_POSTSUPERSCRIPT ) ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT + ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT ) ( over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d italic_i italic_m end_ARG end_ARG . end_CELL end_ROW

𝐖𝐀,𝐖𝐐,𝐖𝐊,𝐖𝐕d×dsuperscript𝐖𝐀superscript𝐖𝐐superscript𝐖𝐊superscript𝐖𝐕superscript𝑑𝑑\mathbf{W^{A}},\mathbf{W^{Q}},\mathbf{W^{K}},\mathbf{W^{V}}\in\mathbb{R}^{d% \times d}bold_W start_POSTSUPERSCRIPT bold_A end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT are the projection matrices of each item and LayerNorm [2] is applied to Eq. 7. Each local point feature is associated with its corresponding global embedding by 𝐟~𝐩=𝐟𝐩𝐟^𝐩gg,𝐟~𝐩N×2dformulae-sequencesubscript~𝐟𝐩direct-sumsubscript𝐟𝐩superscriptsubscript^𝐟𝐩𝑔𝑔subscript~𝐟𝐩superscript𝑁2𝑑\mathbf{\widetilde{f}}_{\mathbf{p}}=\mathbf{f}_{\mathbf{p}}\oplus\mathbf{\hat{% f}}_{\mathbf{p}}^{gg},\mathbf{\widetilde{f}}_{\mathbf{p}}\in\mathbb{R}^{N% \times 2d}over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ⊕ over^ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_g end_POSTSUPERSCRIPT , over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 2 italic_d end_POSTSUPERSCRIPT to obtain 𝒫~={𝐛𝐩,𝐟~𝐩}~𝒫subscript𝐛𝐩subscript~𝐟𝐩\mathcal{\widetilde{P}}=\{\mathbf{b}_{\mathbf{p}},\mathbf{\widetilde{f}}_{% \mathbf{p}}\}over~ start_ARG caligraphic_P end_ARG = { bold_b start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT , over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT }. The same procedure obtains local and global embedding 𝒬~={𝐛𝐪,𝐟~𝐪}~𝒬subscript𝐛𝐪subscript~𝐟𝐪\mathcal{\widetilde{Q}}=\{\mathbf{b}_{\mathbf{q}},\mathbf{\widetilde{f}}_{% \mathbf{q}}\}over~ start_ARG caligraphic_Q end_ARG = { bold_b start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT , over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT } for the point cloud 𝒬^^𝒬\mathcal{\hat{Q}}over^ start_ARG caligraphic_Q end_ARG.

2.2.3 Cluster-based Local Matching

After extracting the global geometric embedding, we implement a cluster-based matching module to obtain the initial intra-domain 2D-3D matches. This cluster-based GNN  [45] has been shown to be more computationally efficient than its complete-graph counterpart [38]. The network considers the local point features from both 𝒫~~𝒫\mathcal{\widetilde{P}}over~ start_ARG caligraphic_P end_ARG and 𝒬~~𝒬\mathcal{\widetilde{Q}}over~ start_ARG caligraphic_Q end_ARG a complete set, then clusters the feature with strong correlations into the same group and restricts the message passing within each group. In addition to its low computational complexity, we found that cluster GNN can effectively utilize global-to-local geometric cues, as the clustering operation inherits the property of global graph clustering and forces it to distinguish ambiguous local features even with similar global embedding.

Methods ScanNet [10] MegaDepth [24]
Reproj. AUC (%) Rotation ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) Translation (m) P (%) (\uparrow) Reproj. AUC (%) Rotation ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) Translation P (%) (\uparrow)
@1 / 5 / 10px (\uparrow) Quantile @25 / 50 / 75% ()(\downarrow)( ↓ ) @1 / 5 / 10px (\uparrow) Quantile @25 / 50 / 75% ()(\downarrow)( ↓ )
k=1 Oracle 29.13 / 39.83 / 41.34 110.10 / 110.19 / 110.40 0.01 / 0.01 / 0.03 - 34.59 / 85.02 / 92.02 10.04 / 10.06 / 10.12 0.00 / 0.01 / 0.01 -
BPnPNet [5] 10.00 / 10.00 / 10.02 199.17 / 128.90 / 154.68 4.35 / 6.82 / 9.86 13.60 10.22 / 10.63 / 10.89 16.13 / 32.01 / 61.58 1.67 / 3.17 / 5.44 12.95
GoMatch [56] 11.18 / 11.23 / 18.01 112.69 / 112.78 / 136.50 0.19 / 0.91 / 2.63 13.18 15.67 / 22.43 / 28.01 10.60 / 10.08 / 34.63 0.06 / 1.06 / 3.73 14.94
DGC-GNN 12.73 / 21.88 / 32.23 110.94 / 113.17 / 120.14 0.06 / 0.23 / 1.40 14.86 10.20 / 37.64 / 44.04 10.15 / 11.53 / 27.93 0.01 / 0.15 / 3.00 19.00
k=10 BPnPNet [5] 10.00 / 10.00 / 10.03 104.68 / 135.94 / 160.54 4.67 / 7.30 / 10.92 10.84 10.36 / 10.72 / 10.97 16.63 / 34.69 / 67.77 1.64 / 3.30 / 5.97 10.74
GoMatch [56] 10.91 / 18.98 / 31.12 111.18 / 114.94 / 128.97 10.08 / 0.35 / 2.08 14.25 18.90 / 35.67 / 44.99 10.18 / 11.29 / 16.65 0.02 / 0.12 / 1.92 18.76
DGC-GNN 11.76 / 31.74 / 48.11 110.67 / 111.49 / 117.62 10.04 / 0.11 / 0.53 16.42 15.30 / 51.70 / 60.01 10.07 / 10.26 / 15.41 0.01 / 0.02 / 0.57 13.36
Table 1: 2D-3D Matching. We present AUC scores for reprojection errors threshold at 1, 5, and 10 pixels; rotation and translation error quantiles at 25, 50, and 75%; and matching precision. Parameter k𝑘kitalic_k is the number of images retrieved from the database to narrow down the search space. The best results are bold. DGC-GNN nearly doubles the AUC scores of GoMatch and reduces the pose errors to their 33%absentpercent33{\approx}33\%≈ 33 %.

Initialization. As an initialization procedure for the cluster attention module, we run the general self and cross-attention modules proposed in GoMatch [56]. For each local point 𝐛𝐩,nsubscript𝐛𝐩𝑛\mathbf{b}_{\mathbf{p},n}bold_b start_POSTSUBSCRIPT bold_p , italic_n end_POSTSUBSCRIPT, we construct a local graph according to its ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT nearest neighbours in the Euclidean space and update the associated feature 𝐟~𝐩,n2dsubscript~𝐟𝐩𝑛superscript2𝑑\mathbf{\widetilde{f}}_{\mathbf{p},n}\in\mathbb{R}^{2d}over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p , italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT by Eq. 5. Note that we ignore the angular embedding at this stage due to the unaffordable memory requirements with space complexity 𝒪(Nk2)𝒪𝑁superscript𝑘2\mathcal{O}(Nk^{\prime 2})caligraphic_O ( italic_N italic_k start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT ), where N𝑁Nitalic_N is the number of local points. We then use linear attention [21, 49] as a cross-attention mechanism, which allows each point in one modality to interact with all points from another modality. This not only facilitates inter-modality in the feature matching but also reduces the computational complexity from 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ).

Cluster-based Attention. After the graph initialization, the features 𝐟~𝐩subscript~𝐟𝐩\mathbf{\widetilde{f}}_{\mathbf{p}}over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT and 𝐟~𝐪subscript~𝐟𝐪\mathbf{\widetilde{f}}_{\mathbf{q}}over~ start_ARG bold_f end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT coming from the image and the point cloud respectively, are concatenated and processed in a two-level hierarchical clustering attention module. The hierarchical structure is effective in suppressing erroneous groupings. At the first level, we cluster the feature vectors into I𝐼Iitalic_I coarse groups. In the second level, each coarse group is divided into several small groups. The local point information exchange is conducted at each level and only within each group to obtain more representative features. After the sparse clustering, each feature vector is transformed back to its original position and then split again into 𝐟~𝐩subscriptsuperscript~𝐟𝐩\mathbf{\widetilde{f}^{\prime}}_{\mathbf{p}}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT and 𝐟~𝐪subscriptsuperscript~𝐟𝐪\mathbf{\widetilde{f}^{\prime}}_{\mathbf{q}}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT to obtain the keypoints both in the 2D and 3D spaces.

Optimal Transport. We calculate the cost matrix N×Msuperscript𝑁𝑀\mathcal{M}\in\mathbb{R}^{N\times M}caligraphic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT between the two transformed feature sets using the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between pairs of features. Thus, (n,m)=𝐟~𝐩,n𝐟~𝐪,m2𝑛𝑚subscriptnormsubscriptsuperscript~𝐟𝐩𝑛subscriptsuperscript~𝐟𝐪𝑚2\mathcal{M}(n,m)=||\mathbf{\widetilde{f}^{\prime}}_{\mathbf{p},n}-\mathbf{% \widetilde{f}^{\prime}}_{\mathbf{q},m}||_{2}caligraphic_M ( italic_n , italic_m ) = | | over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_p , italic_n end_POSTSUBSCRIPT - over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_q , italic_m end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Following [38], the cost matrix \mathcal{M}caligraphic_M is extended to ¯¯\mathcal{\bar{M}}over¯ start_ARG caligraphic_M end_ARG by adding an additional row and column as dustbins for unmatched points. We then iteratively optimize ¯¯\mathcal{\bar{M}}over¯ start_ARG caligraphic_M end_ARG running the Sinkhorn algorithm [47, 9] in a declarative layer to obtain the score matrix 𝒮¯¯𝒮\mathcal{\bar{S}}over¯ start_ARG caligraphic_S end_ARG. Finally, 𝒮¯¯𝒮\mathcal{\bar{S}}over¯ start_ARG caligraphic_S end_ARG is converted to 𝒮N×M𝒮superscript𝑁𝑀\mathcal{S}\in\mathbb{R}^{N\times M}caligraphic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT by dropping the dustbins. The initial 2D-3D match candidates are acquired by mutual top-1 search, thus

init={(n~,m~)|(n~,m~)MNN(𝒮)},subscript𝑖𝑛𝑖𝑡conditional-set~𝑛~𝑚for-all~𝑛~𝑚MNN𝒮\mathcal{M}_{init}=\{(\widetilde{n},\widetilde{m})\ |\ \forall(\widetilde{n},% \widetilde{m})\in\text{MNN}(\mathcal{S})\},caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = { ( over~ start_ARG italic_n end_ARG , over~ start_ARG italic_m end_ARG ) | ∀ ( over~ start_ARG italic_n end_ARG , over~ start_ARG italic_m end_ARG ) ∈ MNN ( caligraphic_S ) } , (8)

where MNN is the mutual nearest neighbors operator. Set initsubscript𝑖𝑛𝑖𝑡\mathcal{M}_{init}caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT provides initial 2D-3D matches that we further filter in Sec. 2.3 to keep the accurate correspondences only.

2.3 Outlier Rejection

After obtaining the initial matches, outlier pruning runs to remove the incorrect ones. We apply the same outlier rejection network as in GoMatch [56], whose input is the concatenated 2D and 3D keypoint features 𝐟~n~,m~=𝐟~𝐩,n~𝐟~𝐪,m~subscriptsuperscript~𝐟~𝑛~𝑚direct-sumsubscriptsuperscript~𝐟𝐩~𝑛subscriptsuperscript~𝐟𝐪~𝑚\mathbf{\widetilde{f}^{\prime}}_{\widetilde{n},\widetilde{m}}=\mathbf{% \widetilde{f}^{\prime}}_{\mathbf{p},\widetilde{n}}\oplus\mathbf{\widetilde{f}^% {\prime}}_{\mathbf{q},\widetilde{m}}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_n end_ARG , over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT = over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_p , over~ start_ARG italic_n end_ARG end_POSTSUBSCRIPT ⊕ over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_q , over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT and outputs the matching confidence of each matched pair. The final predicted matches are obtained as follows:

final={(n~,m~)|cls(𝐟~n~,m~|(n~,m~)init)θ},subscript𝑓𝑖𝑛𝑎𝑙conditional-setsuperscript~𝑛superscript~𝑚for-allclsconditionalsubscriptsuperscript~𝐟~𝑛~𝑚~𝑛~𝑚subscript𝑖𝑛𝑖𝑡𝜃\mathcal{M}_{final}=\{(\widetilde{n}^{\prime},\widetilde{m}^{\prime})\ |% \forall\ \text{cls}(\mathbf{\widetilde{f}^{\prime}}_{\widetilde{n},\widetilde{% m}}\;|\;(\widetilde{n},\widetilde{m})\in\mathcal{M}_{init})\geq\theta\},caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT = { ( over~ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ∀ cls ( over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_n end_ARG , over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT | ( over~ start_ARG italic_n end_ARG , over~ start_ARG italic_m end_ARG ) ∈ caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT ) ≥ italic_θ } , (9)

where θ𝜃\thetaitalic_θ is the matching confidence threshold.

2.4 Training Loss

We use the same training loss as GoMatch. The loss function \mathcal{L}caligraphic_L consists of two terms, the matching loss otsubscript𝑜𝑡\mathcal{L}_{ot}caligraphic_L start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT and the classification loss orsubscript𝑜𝑟\mathcal{L}_{or}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r end_POSTSUBSCRIPT. The ground truth match set gtsubscript𝑔𝑡\mathcal{M}_{gt}caligraphic_M start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is estimated by reprojecting the 3D points to the 2D image plane and calculating the pixel distance. We also include point sets \mathcal{I}caligraphic_I and 𝒥𝒥\mathcal{J}caligraphic_J for the unmatched points in 𝒫𝒫\mathcal{P}caligraphic_P and 𝒬𝒬\mathcal{Q}caligraphic_Q, respectively. The matching loss otsubscript𝑜𝑡\mathcal{L}_{ot}caligraphic_L start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT minimizes the negative log-likelihood of the matching score 𝒮¯¯𝒮\mathcal{\bar{S}}over¯ start_ARG caligraphic_S end_ARG.

ot=1|gt|+||+|𝒥|((n,m)gtlog𝒮¯n,m+ilog𝒮¯i,m+1+j𝒥log𝒮¯N+1,j).subscript𝑜𝑡1subscript𝑔𝑡𝒥subscript𝑛𝑚subscript𝑔𝑡subscript¯𝒮𝑛𝑚subscript𝑖subscript¯𝒮𝑖𝑚1subscript𝑗𝒥subscript¯𝒮𝑁1𝑗\begin{split}\mathcal{L}_{ot}=-\frac{1}{|\mathcal{M}_{gt}|+|\mathcal{I}|+|% \mathcal{J}|}(\sum\limits_{(n,m)\in\mathcal{M}_{gt}}\log\mathcal{\bar{S}}_{n,m% }+\\ \sum\limits_{i\in\mathcal{I}}\log\mathcal{\bar{S}}_{i,m+1}+\sum\limits_{j\in% \mathcal{J}}\log\mathcal{\bar{S}}_{N+1,j}).\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | caligraphic_M start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT | + | caligraphic_I | + | caligraphic_J | end_ARG ( ∑ start_POSTSUBSCRIPT ( italic_n , italic_m ) ∈ caligraphic_M start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT + end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT roman_log over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i , italic_m + 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_J end_POSTSUBSCRIPT roman_log over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_N + 1 , italic_j end_POSTSUBSCRIPT ) . end_CELL end_ROW (10)

The classification loss is defined as

or=1|init|i=1|init|wi(yilog(pi)+(1yi)log(1pi)),subscript𝑜𝑟1subscript𝑖𝑛𝑖𝑡superscriptsubscript𝑖1subscript𝑖𝑛𝑖𝑡subscript𝑤𝑖subscript𝑦𝑖subscript𝑝𝑖1subscript𝑦𝑖1subscript𝑝𝑖\mathcal{L}_{or}=-\frac{1}{|\mathcal{M}_{init}|}\sum_{i=1}^{|\mathcal{M}_{init% }|}w_{i}(y_{i}\log(p_{i})+(1-y_{i})\log(1-p_{i})),caligraphic_L start_POSTSUBSCRIPT italic_o italic_r end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_M start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (11)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the weight balancing the positive and negative samples, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground truth matching label for the i𝑖iitalic_i-th correspondences, pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted probability of a true match for the i𝑖iitalic_i-th correspondences. The total loss is the sum of the two terms as =ot+orsubscript𝑜𝑡subscript𝑜𝑟\mathcal{L}=\mathcal{L}_{ot}+\mathcal{L}_{or}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_o italic_r end_POSTSUBSCRIPT.

3 Experiments

Training. We train the indoor model of DGC-GNN on the ScanNet [10] dataset and the outdoor model on the MegaDepth [24] dataset. We extract up to 1024 keypoints for each training image by the SIFT detector [27]. Similarly as in GoMatch, we first select a subset of the point cloud by applying image retrieval approaches [1, 50] to obtain potential images observing the same part of the scene as the input one. We randomly sample the retrieval pairs with a visual overlap of more than 35% on MegaDepth and 65% on ScanNet to ensure enough matches on each pair. For the global geometric embedding, we cluster the 2D/3D bearing vectors into X=10𝑋10X=10italic_X = 10 groups, and each cluster center is connected to its k=4𝑘4k=4italic_k = 4 nearest neighbors to build the global graph. For the local point graph, we connect each point with its 10 nearest neighbors and the cluster-based attentions are performed twice to force the intra-cluster information exchange.

We use Adam optimizer with a learning rate of 1e-3. We train DGC-GNN with one 32GB Telsa V100 GPU. The convergence of the model typically requires 50 epochs.

Datasets. We use ScanNet and MegaDepth for training and 2D-3D matching task evaluation. As a downstream application, we perform visual localization on the 7Scenes [46] and Cambridge Landmarks [22] datasets. MegaDepth is a popular outdoor dataset with 196 scenes captured around the world. The sparse 3D reconstructions are provided by the COLMAP [43] structure-from-motion software. Following [56], we train our outdoor model on 99 scenes and evaluate it with 53 scenes. ScanNet is a large-scale RGB-D indoor dataset comprising 1613 scans with over 2.5 million images. We randomly selected 105 scenes for the training and 30 for the evaluation. Cambridge Landmarks is a middle-scale outdoor dataset consisting of 6 individual scenes. A structure-from-motion algorithm provides the ground truth camera poses. We follow [22, 56] to evaluate our method on four scenes. 7Scenes is a small indoor dataset with RGB-D images and camera poses provided by the depth SLAM system. We evaluate on the standard test sequences.

Evaluation Protocol. For matching on ScanNet and MegaDepth, we follow [56] and report the AUC score calculated from the reprojection errors. To calculate the errors for the 2D-3D matches in finalsubscript𝑓𝑖𝑛𝑎𝑙\mathcal{M}_{final}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT, we project the 3D points to the image plane using the ground truth and estimated camera poses. Then, we calculate the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance of the ground truth and estimated reprojected 2D points. We use multiple thresholds, 1, 5, and 10 pixels, to evaluate the AUC scores. The camera translation and rotation error quantiles at 25%, 50%, and 75% are also reported. Moreover, we evaluate the matching quality by calculating the matching precision P, which is the ratio of inlier matches after PnP-RANSAC to the number of final matches finalsubscript𝑓𝑖𝑛𝑎𝑙\mathcal{M}_{final}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT. For visual localization tasks, we report the median translation (in meters) and rotation (in degrees) camera pose errors.

3.1 2D-3D Matching

We compare with the two descriptor-free matchers GoMatch [56] and BPnPnet [5]. At inference, we use the 3D points from the top-k𝑘kitalic_k retrieved database images to match with the keypoints from query images. Following [56], we report the upper bound of the AUC score using the ground truth matches. We refer to these values as Oracle. We select the GT matches by thresholding the reprojection error based on normalized image coordinates, using a threshold of 0.0010.0010.0010.001, to bypass the influence of camera intrinsics during GT selection. This is in contrast to what is done in [56]. Results on GT selected by a pixel threshold are in the supp. material. We use the official code with the default setting to generate the evaluation dataset on MegaDepth [24] and rerun GoMatch and BPnPNet with the released models. Note that we also tested GoMatch after retraining it on MegaDepth and achieved similar results as with the released model.

Methods G. Emb. C. Att. Color Ang. Reproj. AUC (%) Rotation ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) Translation
Sec. 2.2.2 Sec. 2.2.3 Sec. 2.2.1 Sec. 2.2.2 @1 / 5 / 10px (\uparrow) Quantile@25 / 50 / 75% (\downarrow)
GoMatch [56] 18.90 / 35.67 / 44.99 0.18 / 1.29 / 16.65 0.02 / 0.12 / 1.92
Variants 10.86 / 41.18 / 50.51 0.13 / 0.76 / 13.47 0.01 / 0.07 / 1.62
11.64 / 44.46 / 53.99 0.11 / 0.55 / 19.49 0.01 / 0.05 / 1.05
13.20 / 46.33 / 54.34 0.09 / 0.41 / 19.98 0.01 / 0.03 / 1.19
14.19 / 48.34 / 56.54 0.08 / 0.34 / 19.23 0.01 / 0.03 / 1.03
DGC-GNN 15.30 / 51.70 / 60.01 0.07 / 0.26 / 15.41 0.01 / 0.02 / 0.57
Table 2: Ablation Study. AUC scores thresholded at 1, 5, and 10 pixels; rotation and translation error quantiles at 25, 50, 75% with the proposed components added one by one to the GoMatch pipeline on the MegaDepth dataset.

Matching Results. The results with k=1𝑘1k=1italic_k = 1 and k=10𝑘10k=10italic_k = 10 are presented in Table 1. Parameter k𝑘kitalic_k is the number of retrieved image pairs that are used for evaluation. The proposed method outperforms GoMatch and BPnPNet by a significant margin on both scenes. Specifically, DGC-GNN achieves 10.2 / 37.64 / 44.04% reprojection AUC compared to GoMatch with 5.67 / 22.43 / 28.01% on MegaDepth with k=1𝑘1k=1italic_k = 1. DGC-GNN halves the rotation and translation errors of GoMatch on all thresholds and it obtains better matching quality. Notably, the performance of DGC-GNN with k=1𝑘1k=1italic_k = 1 surpasses that of GoMatch with k=10𝑘10k=10italic_k = 10, indicating the effectiveness of our method even with a single view.

Refer to caption
Figure 4: Outlier Sensitivity. The AUC scores of BPnPNet [5], GoMatch [56], and the proposed DGC-GNN thresholded at 1, 5, and 10 pixels are plotted as a function of the outlier ratio. Oracle represents the AUC upper bound using ground truth matches.

Sensitivity to Outliers. To evaluate the sensitivity to keypoint outliers, we follow the procedure in GoMatch [56]. The outliers are controlled by the outlier ratio, ranging from 0 to 1, calculated as the number of unmatched keypoints divided by the maximum of the numbers of 2D and 3D points. If the outlier ratio is 00, all the input 2D and 3D points are selected from the ground truth matches, and no outliers are included in the matching process. When it is 1111, we directly use the keypoints from the query image and 3D points from the top-k𝑘kitalic_k retrieved images without any filtering or outlier removal. The results are shown in Fig. 4. Even in the presence of outliers, DGC-GNN outperforms other methods by a large margin. This indicates that our method is more robust to outliers and can handle challenging matching scenarios more effectively than the state-of-the-art.

Ablation Study. We investigate the effectiveness of different components of DGC-GNN on the 2D-3D matching quality on the MegaDepth dataset [24] with k=10𝑘10k=10italic_k = 10. The results are reported in Table 2. We provide results with k=1𝑘1k=1italic_k = 1 in the supp. material. We conduct the ablations by gradually adding the components: global geometric embedding (G. Emb), cluster attention (C. Att.), Color, and Angular embedding (Ang.) to the original GoMatch pipeline. Incorporating color information into the matching process significantly impacts the performance, resulting in improvements of 2.55 / 3.88 / 2.55% (AUC@1 / 5 / 10px). This demonstrates the importance of color cues for accurate and robust matching. The global-to-local geometric (G. Emb.) and Angular relation embedding (Ang.) substantially improve the matching performance by 1.90 / 5.51 / 5.52% and 1.11 / 3.36 / 3.47%, respectively. It highlights the effectiveness of incorporating global geometric context and local geometric details. The cluster attention mechanism also plays a vital role, improving performance by 0.78 / 3.28 / 3.48%. The best results are obtained when all components are added to the pipeline.


Methods No Desc. Maint. Privacy Cambridge-Landmarks [22] (cm, {}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) MB used 7Scenes [46] (cm, {}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) MB used
King’s Hospital Shop St. Mary’s Chess Fire Heads Office Pumpkin Kitchen Stairs
E2E MS-Trans. [44] 83 / 1.47 181 / 2.39 86 / 3.07 162 / 3.99 1171 11 / 4.66 24 / 9.60 14 / 12.19 17 / 5.66 18 / 4.44 17 / 5.94 26 / 8.45 11171
DSAC* [3] 15 / 0.30 121 / 0.40 15 / 0.30 113 / 0.40 1112 12 / 1.10 12 / 1.24 11 / 1.82 13 / 1.15 14 / 1.34 14 / 1.68 13 / 1.16 11196
HSCNet [23] 18 / 0.30 119 / 0.30 16 / 0.30 119 / 0.30 1592 12 / 0.70 12 / 0.90 11 / 0.90 13 / 0.80 14 / 1.00 14 / 1.20 13 / 0.80 11036
DB HybridSC [6] 81 / 0.59 175 / 1.01 19 / 0.54 150 / 0.49 1113 - - - - - - - -
AS [41] 13 / 0.22 120 / 0.36 14 / 0.21 118 / 0.25 1813 13 / 0.87 12 / 1.01 11 / 0.82 14 / 1.15 17 / 1.69 15 / 1.72 4 / 1.01 -
SP [11]+SG [38] 12 / 0.20 115 / 0.30 14 / 0.20 117 / 0.21 3215 12 / 0.85 12 / 0.94 11 / 0.75 13 / 0.92 15 / 1.30 14 / 1.40 15 / 1.47 22977
DF GoMatch [56] 25 / 0.64 283 / 8.14 48 / 4.77 335 / 9.94 1148 14 / 1.65 13 / 3.86 19 / 5.17 11 / 2.48 16 / 3.32 13 / 2.84 89 / 21.12 11302
DGC-GNN 18 / 0.47 75 / 2.83 15 / 1.57 106 / 4.03 1169 13 / 1.43 15 / 1.77 14 / 2.95 16 / 1.61 18 / 1.93 18 / 2.09 71 / 19.5 11355
Table 3: Visual Localization. We report the median pose errors (cm, {}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) and storage requirements (MB) on the scenes of the 7Scenes [46] and Cambridge-Landmarks [22] datasets. Three groups of methods are shown: end-to-end (E2E), descriptor-based (DB), and descriptor-free (DF). We do not show BPnPNet as it fails on most scenes. The best results are shown in bold in each group.

3.2 Visual Localization

Visual localization estimates the 6 degrees-of-freedom camera pose of an input query image w.r.t a known map of the scene. One of the most prominent ways of approaching this problem is via establishing 2D-3D correspondences and running robust pose estimation. Following [56], we ran the proposed DGC-GNN to obtain matches. For each query image, we match its keypoints with the 3D points from the top-10 retrieved views to build the 2D-3D correspondences. The camera pose is then estimated by PnP-RANSAC [16, 17]. We use two standard datasets, 7Scenes [46] and Cambridge Landmarks [22]. For 7Scenes, we extract the keypoints with the SIFT detectors, and the top 10 pairs are retrieved using DenseVLAD [50]. For Cambridge Landmarks, the keypoints are extracted by SuperPoint [11] to ensure consistency with the SuperPoint-based structure-from-motion model. The top 10 pairs are provided by NetVLAD [1].

3.2.1 Results

In Table 3, we present the 3D model maintenance costs, privacy, storage requirements, and camera pose median errors (cm, {}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) of standard descriptor-based localization techniques and descriptor-free methods. DGC-GNN consistently outperforms GoMatch on all scenes by a significant margin. On Cambridge Landmarks, the average median error of DGC-GNN is 54 cm / 2.23{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT, while GoMatch leads to 173 cm / 5.87{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT error. On 7Scenes, the average error of DGC-GNN is 15 cm / 4.47{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT, and that of GoMatch is 22 cm / 5.77{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT. DGC-GNN requires a similar amount of memory to other descriptor-free methods. Also, it inherits their privacy-preserving properties due to not requiring visual descriptors.

The trade-off between descriptor-based (DB) and descriptor-free (DF) algorithms is visible from the table. While descriptor-based ones lead to the best accuracy overall, they require excessive memory and descriptor maintenance and are susceptible to privacy attacks. Although the model compression method, HybridSC [6], shows effectiveness in storage saving, it achieves similar performance compared to DGC-GNN on the Cambridge Landmarks dataset while still requiring descriptor maintenance. End-to-end methods (E2E) overcome these problems and achieve accurate results. However, their main limitation is that such approaches must be trained independently on each scene. The proposed DGC-GNN only needs to be trained once, making it more efficient and convenient to use as an off-the-shelf tool.

3.2.2 Generalizability

Scenes Trained on MegaDepth [24] (SIFT) Trained on ScanNet [10] (SIFT)
GoMatch (SIFT) GoMatch (SP) DGC-GNN (SIFT) DGC-GNN (SP) DGC-GNN (SIFT) DGC-GNN (SP)
Chess 14 / 11.65 14 / 11.56 13 / 11.41 13 / 11.46 13 / 11.43 14 / 11.51
Fire 13 / 13.86 12 / 13.71 15 / 11.81 17 / 12.30 15 / 11.77 16 / 12.03
Heads 19 / 15.17 15 / 13.43 15 / 13.13 14 / 12.78 14 / 12.95 14 / 13.02
Office 11 / 12.48 17 / 11.76 17 / 11.66 17 / 11.66 16 / 11.61 17 / 11.66
Pumpkin 16 / 13.32 28 / 15.65 18 / 12.03 12 / 12.75 18 / 11.93 10 / 12.38
Redkitchen 13 / 12.84 14 / 13.03 18 / 12.14 10 / 12.36 18 / 12.09 19 / 12.28
Stairs 89 / 21.12 58 / 13.12 83 / 21.53 55 / 13.05 71 / 19.50 58 / 14.32
All 22 / 15.78 18 / 14.61 17 / 14.82 14 / 13.77 15 / 14.47 14 / 13.89

Table 4: Model Generalizability on Visual Localization Task. We report the translation and rotation median error (cm / {}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) on 7Scenes dataset [46]. We evaluate the models trained on MegaDepth [24] and ScanNet [10] datasets. The best performance is bold.

Similar to [56], we discuss the generalizability of our DGC-GNN model on the visual localization task across different training and evaluation scenes. Specifically, we investigate the performance of our model when trained on MegaDepth [24] and ScanNet [10] and evaluated on the 7Scenes dataset [46]. We also explore the impact of using different keypoint detectors, namely SIFT [27] and SuperPoint [11], during the evaluation.

These experiments are summarized in Table 4, providing an overview of the performance of DGC-GNN under different training and evaluation conditions. While the best overall performance is achieved by training on the MegaDepth dataset with SIFT features, the results are similar in both training scenarios, showcasing that the proposed method generalizes well to unseen data. While we train on SIFT features, the best results are achieved by using SuperPoint features at inference time. This demonstrates that DGC-GNN is insensitive to the features used and can be utilized off-the-shelf even without retraining to our specific scenario.

4 Conclusion

In conclusion, this paper introduces DGC-GNN, a novel graph-based pipeline for visual descriptor-free 2D-3D matching that effectively leverages geometric and color cues in a global-to-local manner. Our global-to-local procedure encodes both Euclidean and angular relations at a coarse level, forming a geometric embedding to guide the local point matching. By employing a cluster-based transformer, we enable efficient information passing within local clusters, ultimately leading to significant improvements in the number of correct matches and the accuracy of pose estimation. Compared to the state-of-the-art descriptor-free matcher GoMatch [56], the proposed DGC-GNN demonstrates a substantial improvement, doubling the accuracy on real-world and large-scale datasets. Furthermore, it results in significantly increased localization accuracy. These advancements contribute to reducing the gap between descriptor-based and descriptor-free methods while addressing the limitations of descriptor-based ones, such as memory footprint, maintenance costs, and susceptibility to privacy attacks.

Limitations. The primary limitation of our proposed DGC-GNN method lies in its performance being inferior to traditional descriptor-based algorithms. The performance difference can be attributed to the insufficiency of unique 3D structures in the geometry, which hinders the ability of the algorithm to identify distinct matches in real-world scenarios. Although DGC-GNN demonstrates a notable improvement over existing descriptor-free approaches, there remains a performance gap to overcome in order to achieve results on par with or superior to those of descriptor-based methods.

Acknowledgements. This work was supported by the Academy of Finland (grants No. 327911, No. 353138) and the Hasler Stiftung Research Grant via the ETH Zurich Foundation. We acknowledge the computational resources provided by the CSC-IT Center for Science, Finland.

References

  • Arandjelovic et al. [2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5297–5307, 2016.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Brachmann and Rother [2021] Eric Brachmann and Carsten Rother. Visual camera re-localization from RGB and RGB-D images using DSAC. TPAMI, 2021.
  • Bradski [2000] Gary Bradski. The opencv library. Software Tools for the Professional Programmer, 25(11):120–123, 2000.
  • Campbell et al. [2020] Dylan Campbell, Liu Liu, and Stephen Gould. Solving the blind perspective-n-point problem end-to-end with robust differentiable geometric optimization. In Proceedings of the European Conference on Computer Vision (ECCV), page preprint. Springer, 2020.
  • Camposeco et al. [2019] Federico Camposeco, Andrea Cohen, Marc Pollefeys, and Torsten Sattler. Hybrid scene compression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019.
  • Chelani et al. [2021] Kunal Chelani, Fredrik Kahl, and Torsten Sattler. How privacy-preserving are line clouds? recovering scene details from 3d lines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15668–15678, 2021.
  • Cui and Tan [2015] Zhaopeng Cui and Ping Tan. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 864–872, 2015.
  • Cuturi [2013] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  • Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
  • DeTone et al. [2018] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pages 224–236, 2018.
  • Dosovitskiy and Brox [2016a] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. Advances in neural information processing systems, 29, 2016a.
  • Dosovitskiy and Brox [2016b] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4829–4837, 2016b.
  • Dusmanu et al. [2019] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8092–8101, 2019.
  • Eade and Drummond [2006] Ethan Eade and Tom Drummond. Scalable monocular slam. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 469–476. IEEE, 2006.
  • Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • Gao et al. [2003] Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Complete solution classification for the perspective-three-point problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):930–943, 2003.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Huang et al. [2021] Shengyu Huang, Zan Gojcic, Mikhail Usvyatsov, Andreas Wieser, and Konrad Schindler. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 4267–4276, 2021.
  • Irschara et al. [2009] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and Horst Bischof. From structure-from-motion point clouds to fast location recognition. In in IEEE Conference on Computer Vision and Pattern Recognition, pages 2599–2606. IEEE, 2009.
  • Katharopoulos et al. [2020] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
  • Kendall et al. [2015] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
  • Li et al. [2020] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.
  • Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
  • Lindenberger et al. [2021] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5987–5997, 2021.
  • Liu et al. [2020] Liu Liu, Dylan Campbell, Hongdong Li, Dingfu Zhou, Xibin Song, and Ruigang Yang. Learning 2d-3d correspondences to solve the blind perspective-n-point problem. arXiv preprint arXiv:2003.06752, 2020.
  • Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • Maas et al. [2013] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the International Conference of Machine Learning, page 3, 2013.
  • Moré [2006] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical Analysis: Proceedings of the Biennial Conference, pages 105–116. Springer, 2006.
  • Mur-Artal and Tardós [2017] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
  • Mur-Artal et al. [2015] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
  • Ng et al. [2022] Tony Ng, Hyo Jin Kim, Vincent T Lee, Daniel DeTone, Tsun-Yi Yang, Tianwei Shen, Eddy Ilg, Vassileios Balntas, Krystian Mikolajczyk, and Chris Sweeney. Ninjadesc: content-concealing visual descriptors via adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12797–12807, 2022.
  • Nister and Stewenius [2006] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 2161–2168. Ieee, 2006.
  • Pan et al. [2023] Linfei Pan, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Privacy preserving localization via coordinate permutations. In Proceedings of the International Conference on Computer Vision, pages 18174–18183, 2023.
  • Pittaluga et al. [2019] Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 145–154, 2019.
  • Qin et al. [2022] Zheng Qin, Hao Yu, Changjian Wang, Yulan Guo, Yuxing Peng, and Kai Xu. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11143–11152, 2022.
  • Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
  • Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4938–4947, 2020.
  • Sattler et al. [2011] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image-based localization using direct 2d-to-3d matching. In International Conference on Computer Vision, pages 667–674. IEEE, 2011.
  • Sattler et al. [2012] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving image-based localization by active correspondence search. In European conference on computer vision, pages 752–765. Springer, 2012.
  • Sattler et al. [2016] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
  • Sattler et al. [2018] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  • Shavit et al. [2021] Yoli Shavit, Ron Ferens, and Yosi Keller. Learning multi-scene absolute pose regression with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2733–2742, 2021.
  • Shi et al. [2022] Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12517–12526, 2022.
  • Shotton et al. [2013] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
  • Sinkhorn and Knopp [1967] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
  • Snavely et al. [2006] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph, pages 835–846. 2006.
  • Sun et al. [2021] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Torii et al. [2015] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
  • Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • Wang et al. [2021] Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, and Juho Kannala. Continual learning for image-based camera localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3252–3262, 2021.
  • Yang et al. [2022] Luwei Yang, Rakesh Shrestha, Wenbo Li, Shuaicheng Liu, Guofeng Zhang, Zhaopeng Cui, and Ping Tan. Scenesqueezer: Learning to compress scene for camera relocalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8259–8268, 2022.
  • Yu et al. [2021] Hao Yu, Fu Li, Mahdi Saleh, Benjamin Busam, and Slobodan Ilic. Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration. Advances in Neural Information Processing Systems, 34:23872–23884, 2021.
  • Zhou et al. [2022] Qunjie Zhou, Sérgio Agostinho, Aljoša Ošep, and Laura Leal-Taixé. Is geometry enough for matching in visual localization? In Proceedings of the European Conference on Computer Vision (ECCV), pages 407–425. Springer, 2022.
\thetitle

Supplementary Material

Appendix A Training and Evaluation Details

Dataset Generation. The training data generation process for MegaDepth [24] follows the methodology outlined in GoMatch [56]. The undistorted SfM model reconstructions used in MegaDepth are provided by D2Net [14]. For training, we sample up to 500 images from each scene. For each sampled image, we select the top-k𝑘kitalic_k co-visible views that have at least 35% image overlap. This ensures that there are enough matches for training. The overlapping score is computed by dividing the number of co-visible 3D points by the total number of points in the training image.

In the case of ScanNet [10], a similar procedure is conducted. We also sample up to 500 images from each scene for the training set generation. The co-visible images are obtained using the co-visible scores provided by LoFTR [49]. We extract all the co-visible views of a training image with co-visible scores larger than 0.65. Then, we randomly sample the top-k𝑘kitalic_k views for training. Since ScanNet is an RGB-D dataset without an SfM reconstruction, we obtain the 3D points for each image by projecting the detected 2D keypoints with valid depth to 3D. By doing this for each image, we reconstruct a sparse 3D point cloud based on the detected 2D keypoints. Note that the correspondence between different co-visible frames is not required in this case.

In total, for MegaDepth, we generate a training set consisting of 25,624 images from 99 scenes and a test set comprising 12,399 images covering 53 scenes. For ScanNet, we create a training set with 52,008 images from 105 scenes. The test set for ScanNet consists of 14,892 query images from 30 scenes. The data generation of 7Scenes [46] and Cambridge dataset [22] follows the same procedure in  [56].

Refer to caption
Figure 5: Points Reprojection and Image Recovery Example.
Method Reprojection InvSFM [35]
Points Points+RGB Points Points+RGB Points+SIFT
SSIM (\downarrow) 0.240 0.258 0.352 0.375 0.476
Table 5: SSIM Results. We evaluate the SSIM from Point Reprojection and Image Recovering, adding RGB to points leads only to a slight SSIM increase on both reprojection and image recovery.

Inference. We consider a query with at least 10 keypoints as valid input. The 3D points from the top-k𝑘kitalic_k retrieved database images are then applied to match against the queries with our proposed pipeline. We use the Sinkhorn algorithm [47, 9] to optimize the extended cost matrix ¯N+1,M+1¯superscript𝑁1𝑀1\bar{\mathcal{M}}\in\mathbb{R}^{N+1,M+1}over¯ start_ARG caligraphic_M end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N + 1 , italic_M + 1 end_POSTSUPERSCRIPT in an iterative manner with up to 20 iterations to obtain the initial matches. The final matches are obtained by filtering the matches with matching confidence θ<0.5𝜃0.5\theta<0.5italic_θ < 0.5 in the outlier rejection module. For the visual localization task, the camera poses are estimated by the P3P solver with RANSAC [16] implemented in OpenCV [4] and then refined by Levenberg-Marquardt [29] algorithm on the inliers matches, minimizing the reprojection error.

Refer to caption
Figure 6: 2D-3D Matching (shown by green lines) with the proposed DGC-GNN and GoMatch [56].

Appendix B Privacy Issue of RGB Points

We investigate the impact on privacy resulting from incorporating RGB information into pixels and points. To assess this, we compute the Structural Similarity Index Measure (SSIM) for 3D points reprojected onto the image plane against the ground truth (GT) images on MegaDepth over 500 images from multiple scenes. Additionally, we recover the images from points + RGB and points + descriptors with InvSFM [35] to calculate the SSIM against the GTs. The findings are detailed in Table 5 and Fig 5. The addition of RGB data to points results in only a marginal increase in SSIM for both direct reprojection and image reconstruction via InvSFM, significantly less than what is achieved by incorporating SIFT descriptors. It is worth noting that denser point clouds might provide sufficient context, potentially leading to privacy concerns. However, in our setting, we mitigate this risk by limiting the number of keypoints from each database image to a maximum of 1024.

Appendix C Additional Results

Qualitative Results. More visualizations of inlier matches provided by DGC-GNN and GoMatch on MegaDepth are shown in Fig. 6. DGC-GNN consistently finds more correct matches on multiple scenes, highlighting the effectiveness of the proposed method.

Methods Global C. Att. Color Ang. Cluster Reproj. AUC (%) Rotation ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) Translation
@1 / 5 / 10px (\uparrow) Quantile@25 / 50 / 75% (\downarrow)
GoMatch [56] (w/o OR) 14.47 / 17.95 / 23.42 1.29 / 11.85 / 33.60 0.11 / 1.18 / 3.58
GoMatch [56] 15.67 / 22.43 / 28.01 0.60 / 10.08 / 34.63 0.06 / 1.06 / 3.73
Variants G.Emb K-means 17.68 / 28.41 / 34.36 0.28 / 16.78 / 34.52 0.03 / 0.73 / 3.77
G.Label K-means 17.13 / 27.33 / 33.18 0.31 / 17.34 / 33.63 0.03 / 0.76 / 3.64
G.Emb K-means 18.10 / 30.64 / 37.07 0.24 / 14.48 / 34.30 0.03 / 0.63 / 3.51
G.Emb K-means 19.82 / 35.29 / 41.16 0.17 / 12.88 / 31.74 0.02 / 0.27 / 3.24
G.Emb Mean-shift 10.07 / 36.01 / 43.03 0.16 / 12.15 / 28.99 0.01 / 0.20 / 3.26
DGC-GNN (w/o OR) G.Emb K-means 18.56 / 30.79 / 37.03 0.22 / 14.85 / 30.07 0.02 / 0.47 / 3.10
DGC-GNN G.Emb K-means 10.20 / 37.64 / 44.04 0.15 / 11.53 / 27.93 0.01 / 0.15 / 3.00
Table 6: Additional Ablation Results. AUC scores thresholded at 1, 5, and 10 pixels on k=1𝑘1k=1italic_k = 1; rotation and translation error quantiles at 25, 50, 75% with the proposed components added one by one to the GoMatch pipeline.
Refer to caption
(a) GoMatch
Refer to caption
(b) +G. Emb, +Cluster Attention
Refer to caption
(c) +G. Emb, +Cluster Attention, +Color
Refer to caption
(d) DGC-GNN
Figure 7: Qualitative Matching Results of Different Architectures. We visualize the number of inlier matches after the PnP-RANSAC with different architectures (shown by green lines ).

Additional Ablation Results. In addition to the ablation results presented in the main paper, we also provide ablation results for single-view matching with k=1𝑘1k=1italic_k = 1 on MegaDepth [24]. Furthermore, we conduct two additional ablations to investigate the impact of different component selections. Firstly, we compare the effectiveness of the geometric global embedding (G. Emb.) used in the main paper with the global clustering label embedding (G. Label). Instead of encoding geometric cues, we encode the label of each global cluster and concatenate it to the local point feature. Then, we explore the selection of different clustering algorithms. We compare the performance of K-Means and Mean-Shift clustering algorithms in our pipeline. Last, we study the effectiveness of the outlier rejection (OR) network.

The results are presented in Table 6. We observe similar conclusions for each component as in the main paper. The results obtained using the global label embedding (G. Label) with cluster attention (C.Att) show even worse performance compared to geometric embedding (G. Emb.) only, indicating the superiority of our clustering-based geometric embedding over the label embedding and highlighting the importance of incorporating geometric cues in the embedding process for effective point matching. Regarding the impact of different clustering algorithms, we only observe a minor difference in K-Means and Mean-Shift results, suggesting that our approach is robust to the choice of the clustering algorithm. The results also demonstrate that outlier rejection is an essential post-processing module to achieve good performance. In addition to the numerical results, we visualize the inlier matches (see Fig. 7) to provide deep insights into the behavior and performance of different architectures.

Hyperparameters analysis. Besides the component ablations, we also give an in-depth analysis of the hyperparameters used in our main pipeline. Here, we add additional ablations on the number of input keypoints, the number of cluster groups at the coarse level, the number of nearest neighbors in the local graph build, and the outlier rejection threshold by retraining our DGC-GNN. The results are presented in Table 7. We observe that DGC-GNN with G. Clusters = 10 and Local NN = 10 achieves overall the best performance. Setting the outlier rejection threshold to 0.7 leads to the best performance. However, the results are stable across different configurations, indicating robustness to the parameter setting.

Methods G. Cluters Local NN OR Threshold Reproj. AUC (%) Rotation ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) Translation
@1 / 5 / 10px (\uparrow) Quantile@25 / 50 / 75% (\downarrow)
DGC-GNN 10 10 0.5 15.30 / 51.70 / 60.01 0.07 / 0.26 / 15.41 0.01 / 0.02 / 0.57
HyperParam. 5 10 0.5 14.73 / 50.12 / 58.26 0.08 / 0.28 / 18.76 0.01 / 0.03 / 0.99
15 10 0.5 15.14 / 50.56 / 58.62 0.07 / 0.28 / 17.66 0.01 / 0.03 / 0.89
10 20 0.5 14.77 / 49.84 / 57.97 0.07 / 0.29 / 18.26 0.01 / 0.03 / 0.90
10 30 0.5 14.75 / 50.95 / 59.45 0.08 / 0.28 / 15.48 0.01 / 0.03 / 0.58
10 10 0.3 13.28 / 46.44 / 55.05 0.08 / 0.43 / 18.63 0.01 / 0.04 / 0.98
10 10 0.7 16.63 / 56.26 / 64.46 0.07 / 0.19 / 12.58 0.01 / 0.02 / 0.27
Table 7: Ablation Study on Hyperparameters. We report the results of ablations with retrieved image k=10𝑘10k=10italic_k = 10 on the number of global clusters, the number of nearest neighbour points for local graph build, and different thresholds for outlier rejection. The best results are bold.

Matching Results in pixel threshold. As mentioned in the main paper, we selected the ground truth matches in normalized image coordinates. The described GT difference only affects the reprojection AUC scores. Here, we present the matching results in Table 5 by selecting the ground truth matches in pixel coordinate with 1111 pixel threshold as done in [56]. Our conclusions still hold.

Appendix D Model Parameters and Timing

We discuss the model parameters and running time of DGC-GNN in this section. DGC-GNN incorporates global geometric embedding and local clustering attention, which has around 5.7 million trainable parameters and an estimated model size of 22.6 MB. The average inference time for each image pair over the Megadepth evaluation queries is 77.8ms. It roughly breaks down into point encoding (24 ms), global geometric embedding (14 ms), cluster-based attention (22 ms), optimal transport (7 ms), and outlier rejection (8 ms). The measurements are conducted on a 32GB NVIDIA Telsa V100 GPU with a maximum of 1024 keypoints.