DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching

Shuzhe Wang

{}^{*}

Juho Kannala

{}^{*}

Daniel Barath

{}^{\dagger}

{}^{*}

Department of Computer Science, Aalto University

{}^{\dagger}

Computer Vision and Geometry Group, ETH Zurich
shuzhe.wang@aalto.fi juho.kannala@aalto.fi dbarath@inf.ethz.ch

Abstract

Matching 2D keypoints in an image to a sparse 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its low memory requirements, inherent privacy preservation, and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However, existing algorithms often compromise on performance, resulting in a significant deterioration compared to their descriptor-based counterparts. In this paper, we introduce DGC-GNN, a novel algorithm that employs a global-to-local Graph Neural Network (GNN) that progressively exploits geometric and color cues to represent keypoints, thereby improving matching accuracy. Our procedure encodes both Euclidean and angular relations at a coarse level, forming the geometric embedding to guide the point matching. We evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not only doubles the accuracy of the state-of-the-art visual descriptor-free algorithm but also substantially narrows the performance gap between descriptor-based and descriptor-free methods.¹¹1The code and trained models are available at: https://github.com/AaltoVision/DGC-GNN-release.

1 Introduction

Establishing 2D-3D matches plays a crucial role in various computer vision applications, including visual localization [20, 40, 42, 37, 53, 38], 3D reconstruction [48, 8, 43, 25], and Simultaneous Localization and Mapping (SLAM) [15, 31, 30]. Traditional methods for establishing point-to-point matches involve extracting keypoints and descriptors from a query image, then matching the 2D and 3D descriptors using exhaustive search. To circumvent the computationally expensive matching process, some approaches [20, 37] narrow the search space by employing image retrieval methods [33, 1] first to identify the most similar images in the database, and then perform descriptor-based image matching [27, 11, 14, 38, 45] between the query and retrieved images. The 2D-3D correspondences are subsequently established by connecting the 2D-2D image matches with the prebuilt 2D-3D correspondences in the database. Another approach [39] is to build 2D-to-3D matches by searching through all point descriptors with an efficient vocabulary-based method. Sattler et al. [40, 41] further explore the combination of both 2D-3D and 3D-2D search as an active correspondence search step for a faster and more efficient matching process.

Refer to caption — Figure 1: 2D-3D matching (shown by green lines) with the proposed DGC-GNN and GoMatch [56]. In this example, DGC-GNN obtains 78 correct matches with 0.02 meters camera translation and 0.24 ${}^{\circ}$ rotation errors, while GoMatch finds only 17 inliers with a pose error of 0.37 meters and 4.37 ${}^{\circ}$ .

While descriptor-based algorithms achieve state-of-the-art accuracy, they store and maintain high-dimensional visual descriptors for each point in potentially large 3D point clouds. The stored model often requires orders of magnitude more storage than the point cloud and images alone [56]. These methods are susceptible to privacy attacks [13, 12, 7, 34] and necessitate computationally expensive model maintenance and descriptor update procedures [56] when incorporating new descriptors or points into the model. Several approaches have been proposed to address these limitations. Yang et al. [54] employ learned point selection to sample a subset of the point cloud for scene compression. Other methods [23, 3] directly learn a function that maps 2D pixels to 3D coordinates without explicitly storing the 3D scene. Additionally, [32] introduces an adversarial learning framework to develop content-concealing descriptors that prevent privacy leakage.

Recently, researchers [5, 26] have begun exploring deep learning techniques for cross-domain direct 2D-3D matching and pose estimation without visual descriptors, showcasing the potential of descriptor-free matching through differentiable geometric optimization. The recently proposed GoMatch [56] represents significant progress in descriptor-free 2D-3D keypoints matching, achieving reasonable matching performance on a variety of real-world datasets [24, 46, 22]. GoMatch first identifies keypoints in the query image, which, along with the 3D points from the model, are converted to bearing vectors in the camera coordinate system. The algorithm employs an attention mechanism [38, 52] to establish reliable 2D-3D correspondences effectively. While GoMatch attains reasonable accuracy, its performance still significantly lags behind its descriptor-based counterparts [38, 37, 41]. Additionally, it relies on geometric cues only from the points and their local neighbors, rendering it incapable of distinguishing geometrically similar structures.

These observations lead us to two critical questions: (1) Is geometry the only information we can utilize? (2) How can we leverage the geometric information derived from the points for matching? In practice, humans identify correspondences between objects by considering global structures and local geometric cues. For example, when matching an image to a point cloud as in Fig. 1, we first locate the building based on its unique structure and then identify the local structure of the roof for matching. Besides geometric cues, the visual context, such as the color information at each point, also provides constraints for 2D-3D matching. Importantly, this color information still preserves privacy, as the RGB data from sparse keypoints is insufficient to reconstruct the scene.

Building upon these observations and the groundwork set by GoMatch, we propose a novel graph-based pipeline, named DGC-GNN, which leverages geometric and color cues in a global-to-local manner for descriptor-free 2D-3D matching. DGC-GNN encodes position and RGB information for each point and extracts a global distance-angular embedding to guide local point matching. Taking inspiration from [45], we employ a cluster-based transformer to constrain information flow within local clusters. We observe, from real-world datasets, that DGC-GNN leads to substantial improvements in the number of correct matches and the accuracy of pose estimation. Notably, it doubles the accuracy of GoMatch, thereby reducing the gap between descriptor-based and descriptor-free methods. In summary, our paper makes the following contributions:

•

We introduce a visual descriptor-free global-to-local GNN for direct 2D-3D keypoint matching. The network leverages multiple cues and incorporates a progressive clustering module to represent the keypoints. This pipeline enhances the accuracy of sparse 2D-3D matching while requiring low memory, being privacy-preserving, and low cost from 3D model maintenance.
•

We demonstrate that color information for each point is crucial for 2D-3D matching. By incorporating RGB encoding into our network, we observe significant performance improvements.
•

Extensive experiments on real-world datasets show that DGC-GNN outperforms previous methods by a large margin on both matching and visual localization tasks.

2 Visual Descriptor-Free 2D-3D Matching

2.1 Problem Formulation and Notation

Given keypoints $\mathbf{P}=\{\mathbf{p}_{n}\in\mathbb{R}^{2}\;|\;n=1,...,N\}$ from query image $I$ and database 3D point cloud $\mathbf{Q}=\{\mathbf{q}_{m}\in\mathbb{R}^{3}\;|\;m=1,...,M\}$ , where, optionally, each 3D point is associated with a visual descriptor $\mathbf{d}\in\mathbb{R}^{D}$ . The task is to find a set $\mathcal{M}_{\mathbf{p,q}}$ of corresponding keypoints such that

\mathcal{M}_{\mathbf{p,q}}=\{(n,m)\;|\;||\pi(\mathbf{q}_{m},\mathbf{R,t,K})-% \mathbf{p}_{n}||_{2}\leq\epsilon\},

(1)

where $\pi(\cdot)$ is a mapping that projects a 3D point $\mathbf{q}_{m}$ from world coordinates to the image plane, represented by a camera rotation $\mathbf{R}\in\mathbb{R}^{3\times 3}$ , translation $\mathbf{t}\in\mathbb{R}^{3}$ , and intrinsic parameter matrix $\mathbf{K}\in\mathbb{R}^{3\times 3}$ . Parameter $\epsilon\in\mathbb{R}$ is the threshold specified in pixels. Additionally, we denote the color of point $\mathbf{p}_{n}$ as $\mathbf{c}_{n}=[r,g,b]^{\text{T}}\in[0,1]^{3}$ .

Bearing Vector. Similar to [56], we adopt bearing vectors as keypoint representation for both the 2D and 3D points to alleviate their cross-domain nature and represent them in the same space. The bearing vector is the direction from the camera center to a 3D point in the camera coordinate system. Given an image, a 2D pixel $\mathbf{p}_{n}$ is uplifted to bearing vector as $[\mathbf{b}_{\mathbf{p},n},1]^{\text{T}}=\mathbf{K^{-1}}[\mathbf{p}_{n},1]^{% \text{T}},\mathbf{b}_{\mathbf{p},n}\in\mathbb{R}^{2}$ , where $\mathbf{K}$ is the intrinsic camera matrix. Given a 3D point $\mathbf{q}_{m}$ , the corresponding bearing vector is

[\mathbf{b}_{\mathbf{q},m},1]^{\text{T}}=\frac{\mathbf{Rq}_{m}+\mathbf{t}}{[% \mathbf{Rq}_{m}+\mathbf{t}]_{z}},

(2)

where $\mathbf{R}$ is the camera rotation and $\mathbf{t}$ is its translation in the world coordinate system and subscript $z$ denotes the third component of the 3D vector.

2.2 Network Architecture

The proposed DGC-GNN applies a hierarchical mechanism to leverage color and geometric cues in a global-to-local fashion. The overall pipeline is illustrated in Fig.2. We initially employ two local feature extractors to encode RGB and position information for each point simultaneously (Sec. 2.2.1). Additionally, we cluster the points based on their distances and generate global graphs to obtain the global-level geometric embeddings (Sec. 2.2.2). Next, we concatenate the local point features with their corresponding global features and input them into the cluster-based local matching module to identify the initial matches (Sec. 2.2.3). Finally, we incorporate a classification network to filter out matches with low confidence to refine the initial matches (Sec.2.3).

2.2.1 Local Feature Extraction

To extract points-wise features from both the 2D keypoint set $\mathbf{P}$ and the 3D point cloud $\mathbf{Q}$ , we consider the inputs as bearing vectors equipped with color information: $\mathcal{P}=\{\mathbf{b_{p},c_{p}}\}$ and $\mathcal{Q}=\{\mathbf{b_{q},c_{q}}\}$ . Two ResNet-style point encoders [18, 5], denoted as $\mathcal{F}_{b}$ and $\mathcal{F}_{c}$ , are applied to extract position and color embeddings separately. We then obtain the local point features, $\mathbf{f_{p}}$ and $\mathbf{f_{q}}$ , as follows:

\mathbf{f_{p}}=\mathcal{F}_{b}(\mathbf{b_{p}})+\mathcal{F}_{c}(\mathbf{c_{p}})% ,\ \ \mathbf{f_{q}}=\mathcal{F}_{b}(\mathbf{b_{q}})+\mathcal{F}_{c}(\mathbf{c_% {q}}).

(3)

The resulting point-wise features $\mathbf{f_{p}}$ and $\mathbf{f_{q}}$ are vectors with dimensions $\mathbb{R}^{N\times d}$ and $\mathbb{R}^{M\times d}$ respectively, where $N$ and $M$ represent the number of keypoints in $\mathbf{P}$ and $\mathbf{Q}$ , and $d$ denotes the dimensionality of the encoded features, e.g., $d=128$ .

2.2.2 Global Geometric Guidance

Global context guidance has demonstrated its effectiveness in various computer vision tasks [49, 23, 55, 36]. Global context helps to differentiate local descriptors from similar structures or patches, thereby reducing ambiguity. However, most existing methods [49, 36] consider the outputs from different encoding layers as global and local features. This approach is not suitable for our scenario, as our input is sparse points. Downsampling the sparse point cloud results in losing distinctive geometric structures. Hence, we adopt cluster-based geometric encoding to extract global embeddings. As shown in Fig. 3.(a) and (c), the input bearing vectors, both in the image and in the point cloud, are first clustered into $X$ groups. The groups represent distinct clusters, each associated with a cluster center as the global position, denoted by $\mathbf{\hat{b}}_{\mathbf{p},x}\in\mathbb{R}^{2},x=1,...X$ . The corresponding global embedding is obtained as the average of the point embeddings within a cluster as $\mathbf{\hat{f}}_{\mathbf{p},x}=\frac{1}{P^{\prime}}\sum_{p^{\prime}=1}^{P^{% \prime}}\mathbf{f}_{\mathbf{p},p^{\prime}}$ , where $\mathbf{\hat{f}}_{\mathbf{p},x}\in\mathbb{R}^{d}$ , $P^{\prime}$ is the number of points in the $x$ . The same is conducted on the 3D points to obtain $\mathbf{\hat{b}}_{\mathbf{q}}$ and $\mathbf{\hat{f}}_{\mathbf{q}}$ .

Global Geometric Graph. To aggregate and extract the geometric relations among the clusters, we propose a novel graph neural network that encodes both distance and angular cues; the basic GNN structure is built upon [19, 56]. In the following, we describe the graph construction for the 2D global points set $\mathcal{\hat{P}}=\{\mathbf{\hat{b}_{p}},\mathbf{\hat{f}_{p}}\}$ and the same goes for $\mathcal{\hat{Q}}=\{\mathbf{\hat{b}_{q}},\mathbf{\hat{f}_{q}}\}$ . Each cluster center point $\mathbf{\hat{b}}_{\mathbf{p},x}$ is connected to its $k$ -NN neighbours ( $k\leq K$ ) in the coordinate space, and $\xi_{\mathbf{p},(x,y)}$ is the edge between center points $\mathbf{\hat{b}}_{\mathbf{p},x}$ and $\mathbf{\hat{b}}_{\mathbf{p},y}$ . We update the feature $\mathbf{\hat{f}}_{\mathbf{p},x}$ using the following equation:

{}^{(t+1)}\mathbf{\hat{f}}_{\mathbf{p},x}=\max_{\xi_{\mathbf{p},(x,y)}}% \mathcal{H}_{g1}(^{(t)}\mathbf{\hat{f}}_{\mathbf{p},x}\oplus(^{(t)}\mathbf{% \hat{f}}_{\mathbf{p},x}-^{(t)}\mathbf{\hat{f}}_{\mathbf{p},y})),

(4)

where the $\oplus$ denotes concatenation and $\mathcal{H}_{g1}(*)$ is the linear projection with instance normalization [51] and a LeakyReLU function [28]. The $max$ operator applies to the $k$ -NN neighbors. The global feature $\mathbf{\hat{f}}_{\mathbf{p},x}$ is updated twice, and calculated as

\mathbf{\hat{f}}_{\mathbf{p},x}^{g}=\mathcal{H}_{g2}(^{(0)}\mathbf{\hat{f}}_{% \mathbf{p},x}\oplus^{(1)}\mathbf{\hat{f}}_{\mathbf{p},x}\oplus^{(2)}\mathbf{% \hat{f}}_{\mathbf{p},x}).

(5)

$\mathcal{H}_{g2}$ has a similar structure to $\mathcal{H}_{g1}$ , but without shared weights. Besides the distance embedding, inspired by [36], we also adopt the angular embedding to obtain rotation-invariant geometric cues for the global representation. To do so, we define the embedding on cluster triplets as shown in Fig. 3. (b). Given bearing vector $\mathbf{\hat{b}}_{\mathbf{p},x}$ and two of its neighbors $\mathbf{\hat{b}}_{\mathbf{p},y}$ and $\mathbf{\hat{b}}_{\mathbf{p},z}$ , the angular embedding of < $\mathbf{\hat{b}}_{\mathbf{p},x}$ , $\mathbf{\hat{b}}_{\mathbf{p},y}$ > w.r.t. $\mathbf{\hat{b}}_{\mathbf{p},z}$ is defined as follows:

\mathbf{A}_{x,y}^{z}=\operatorname{sine}(\angle(\mathbf{\hat{b}}_{\mathbf{p},z% }-\mathbf{\hat{b}}_{\mathbf{p},x},\ \ \ \mathbf{\hat{b}}_{\mathbf{p},y}-% \mathbf{\hat{b}}_{\mathbf{p},x})/\sigma_{a}),

(6)

where $\operatorname{sine}(\cdot)$ is a sinusoidal function and $\sigma_{a}$ is a controller constant, all $k$ neighbours are considered to obtain the angular embedding $\mathbf{A}_{\mathbf{p}}$ . We update the global geometric embedding $\mathbf{\hat{f}}_{\mathbf{p}}^{gg}$ as an angular-aware attention mechanism:

\mathbf{\hat{f}}_{\mathbf{p}}^{gg}=\operatorname{norm}(\mathbf{\hat{f}}_{% \mathbf{p}}^{g}+\operatorname{Att}(\mathbf{\hat{f}}_{\mathbf{p}}^{g},\mathbf{A% }_{\mathbf{p}}));\ \ \ \ \mathbf{\hat{f}}_{\mathbf{p}}^{gg}\in\mathbb{R}^{K% \times d},

(7)

where

\begin{split}\operatorname{Att}(\mathbf{\hat{f}}_{\mathbf{p}}^{g},\mathbf{A}_{% \mathbf{p}})=\\ (\mathbf{\hat{f}}_{\mathbf{p}}^{g}\mathbf{W^{V}}).\frac{(\mathbf{A}_{\mathbf{p% }}\mathbf{W^{A}})(\mathbf{\hat{f}}_{\mathbf{p}}^{g}\mathbf{W^{Q}})^{\text{T}}+% (\mathbf{\hat{f}}_{\mathbf{p}}^{g}\mathbf{W^{Q}})(\mathbf{\hat{f}}_{\mathbf{p}% }^{g}\mathbf{W^{K}})^{\text{T}}}{\sqrt{dim}}.\end{split}

$\mathbf{W^{A}},\mathbf{W^{Q}},\mathbf{W^{K}},\mathbf{W^{V}}\in\mathbb{R}^{d% \times d}$ are the projection matrices of each item and LayerNorm [2] is applied to Eq. 7. Each local point feature is associated with its corresponding global embedding by $\mathbf{\widetilde{f}}_{\mathbf{p}}=\mathbf{f}_{\mathbf{p}}\oplus\mathbf{\hat{% f}}_{\mathbf{p}}^{gg},\mathbf{\widetilde{f}}_{\mathbf{p}}\in\mathbb{R}^{N% \times 2d}$ to obtain $\mathcal{\widetilde{P}}=\{\mathbf{b}_{\mathbf{p}},\mathbf{\widetilde{f}}_{% \mathbf{p}}\}$ . The same procedure obtains local and global embedding $\mathcal{\widetilde{Q}}=\{\mathbf{b}_{\mathbf{q}},\mathbf{\widetilde{f}}_{% \mathbf{q}}\}$ for the point cloud $\mathcal{\hat{Q}}$ .

2.2.3 Cluster-based Local Matching

After extracting the global geometric embedding, we implement a cluster-based matching module to obtain the initial intra-domain 2D-3D matches. This cluster-based GNN [45] has been shown to be more computationally efficient than its complete-graph counterpart [38]. The network considers the local point features from both $\mathcal{\widetilde{P}}$ and $\mathcal{\widetilde{Q}}$ a complete set, then clusters the feature with strong correlations into the same group and restricts the message passing within each group. In addition to its low computational complexity, we found that cluster GNN can effectively utilize global-to-local geometric cues, as the clustering operation inherits the property of global graph clustering and forces it to distinguish ambiguous local features even with similar global embedding.

Methods		ScanNet [10]				MegaDepth [24]
		Reproj. AUC (%)	Rotation ( ${}^{\circ}$ )	Translation (m)	P (%) ( $\uparrow$ )	Reproj. AUC (%)	Rotation ( ${}^{\circ}$ )	Translation	P (%) ( $\uparrow$ )
		@1 / 5 / 10px ( $\uparrow$ )	Quantile @25 / 50 / 75% $(\downarrow)$		P (%) ( $\uparrow$ )	@1 / 5 / 10px ( $\uparrow$ )	Quantile @25 / 50 / 75% $(\downarrow)$		P (%) ( $\uparrow$ )
k=1	Oracle	29.13 / 39.83 / 41.34	110.10 / 110.19 / 110.40	0.01 / 0.01 / 0.03	-	34.59 / 85.02 / 92.02	10.04 / 10.06 / 10.12	0.00 / 0.01 / 0.01	-
	BPnPNet [5]	10.00 / 10.00 / 10.02	199.17 / 128.90 / 154.68	4.35 / 6.82 / 9.86	13.60	10.22 / 10.63 / 10.89	16.13 / 32.01 / 61.58	1.67 / 3.17 / 5.44	12.95
	GoMatch [56]	11.18 / 11.23 / 18.01	112.69 / 112.78 / 136.50	0.19 / 0.91 / 2.63	13.18	15.67 / 22.43 / 28.01	10.60 / 10.08 / 34.63	0.06 / 1.06 / 3.73	14.94
	DGC-GNN	12.73 / 21.88 / 32.23	110.94 / 113.17 / 120.14	0.06 / 0.23 / 1.40	14.86	10.20 / 37.64 / 44.04	10.15 / 11.53 / 27.93	0.01 / 0.15 / 3.00	19.00
k=10	BPnPNet [5]	10.00 / 10.00 / 10.03	104.68 / 135.94 / 160.54	4.67 / 7.30 / 10.92	10.84	10.36 / 10.72 / 10.97	16.63 / 34.69 / 67.77	1.64 / 3.30 / 5.97	10.74
	GoMatch [56]	10.91 / 18.98 / 31.12	111.18 / 114.94 / 128.97	10.08 / 0.35 / 2.08	14.25	18.90 / 35.67 / 44.99	10.18 / 11.29 / 16.65	0.02 / 0.12 / 1.92	18.76
	DGC-GNN	11.76 / 31.74 / 48.11	110.67 / 111.49 / 117.62	10.04 / 0.11 / 0.53	16.42	15.30 / 51.70 / 60.01	10.07 / 10.26 / 15.41	0.01 / 0.02 / 0.57	13.36

Table 1: 2D-3D Matching. We present AUC scores for reprojection errors threshold at 1, 5, and 10 pixels; rotation and translation error quantiles at 25, 50, and 75%; and matching precision. Parameter

k

is the number of images retrieved from the database to narrow down the search space. The best results are bold. DGC-GNN nearly doubles the AUC scores of GoMatch and reduces the pose errors to their

{\approx}33\%

Initialization. As an initialization procedure for the cluster attention module, we run the general self and cross-attention modules proposed in GoMatch [56]. For each local point $\mathbf{b}_{\mathbf{p},n}$ , we construct a local graph according to its $k^{\prime}$ nearest neighbours in the Euclidean space and update the associated feature $\mathbf{\widetilde{f}}_{\mathbf{p},n}\in\mathbb{R}^{2d}$ by Eq. 5. Note that we ignore the angular embedding at this stage due to the unaffordable memory requirements with space complexity $\mathcal{O}(Nk^{\prime 2})$ , where $N$ is the number of local points. We then use linear attention [21, 49] as a cross-attention mechanism, which allows each point in one modality to interact with all points from another modality. This not only facilitates inter-modality in the feature matching but also reduces the computational complexity from $\mathcal{O}(N^{2})$ to $\mathcal{O}(N)$ .

Cluster-based Attention. After the graph initialization, the features $\mathbf{\widetilde{f}}_{\mathbf{p}}$ and $\mathbf{\widetilde{f}}_{\mathbf{q}}$ coming from the image and the point cloud respectively, are concatenated and processed in a two-level hierarchical clustering attention module. The hierarchical structure is effective in suppressing erroneous groupings. At the first level, we cluster the feature vectors into $I$ coarse groups. In the second level, each coarse group is divided into several small groups. The local point information exchange is conducted at each level and only within each group to obtain more representative features. After the sparse clustering, each feature vector is transformed back to its original position and then split again into $\mathbf{\widetilde{f}^{\prime}}_{\mathbf{p}}$ and $\mathbf{\widetilde{f}^{\prime}}_{\mathbf{q}}$ to obtain the keypoints both in the 2D and 3D spaces.

Optimal Transport. We calculate the cost matrix $\mathcal{M}\in\mathbb{R}^{N\times M}$ between the two transformed feature sets using the $L_{2}$ distance between pairs of features. Thus, $\mathcal{M}(n,m)=||\mathbf{\widetilde{f}^{\prime}}_{\mathbf{p},n}-\mathbf{% \widetilde{f}^{\prime}}_{\mathbf{q},m}||_{2}$ . Following [38], the cost matrix $\mathcal{M}$ is extended to $\mathcal{\bar{M}}$ by adding an additional row and column as dustbins for unmatched points. We then iteratively optimize $\mathcal{\bar{M}}$ running the Sinkhorn algorithm [47, 9] in a declarative layer to obtain the score matrix $\mathcal{\bar{S}}$ . Finally, $\mathcal{\bar{S}}$ is converted to $\mathcal{S}\in\mathbb{R}^{N\times M}$ by dropping the dustbins. The initial 2D-3D match candidates are acquired by mutual top-1 search, thus

\mathcal{M}_{init}=\{(\widetilde{n},\widetilde{m})\ |\ \forall(\widetilde{n},% \widetilde{m})\in\text{MNN}(\mathcal{S})\},

(8)

where MNN is the mutual nearest neighbors operator. Set $\mathcal{M}_{init}$ provides initial 2D-3D matches that we further filter in Sec. 2.3 to keep the accurate correspondences only.

2.3 Outlier Rejection

After obtaining the initial matches, outlier pruning runs to remove the incorrect ones. We apply the same outlier rejection network as in GoMatch [56], whose input is the concatenated 2D and 3D keypoint features $\mathbf{\widetilde{f}^{\prime}}_{\widetilde{n},\widetilde{m}}=\mathbf{% \widetilde{f}^{\prime}}_{\mathbf{p},\widetilde{n}}\oplus\mathbf{\widetilde{f}^% {\prime}}_{\mathbf{q},\widetilde{m}}$ and outputs the matching confidence of each matched pair. The final predicted matches are obtained as follows:

\mathcal{M}_{final}=\{(\widetilde{n}^{\prime},\widetilde{m}^{\prime})\ |% \forall\ \text{cls}(\mathbf{\widetilde{f}^{\prime}}_{\widetilde{n},\widetilde{% m}}\;|\;(\widetilde{n},\widetilde{m})\in\mathcal{M}_{init})\geq\theta\},

(9)

where $\theta$ is the matching confidence threshold.

2.4 Training Loss

We use the same training loss as GoMatch. The loss function $\mathcal{L}$ consists of two terms, the matching loss $\mathcal{L}_{ot}$ and the classification loss $\mathcal{L}_{or}$ . The ground truth match set $\mathcal{M}_{gt}$ is estimated by reprojecting the 3D points to the 2D image plane and calculating the pixel distance. We also include point sets $\mathcal{I}$ and $\mathcal{J}$ for the unmatched points in $\mathcal{P}$ and $\mathcal{Q}$ , respectively. The matching loss $\mathcal{L}_{ot}$ minimizes the negative log-likelihood of the matching score $\mathcal{\bar{S}}$ .

\begin{split}\mathcal{L}_{ot}=-\frac{1}{|\mathcal{M}_{gt}|+|\mathcal{I}|+|% \mathcal{J}|}(\sum\limits_{(n,m)\in\mathcal{M}_{gt}}\log\mathcal{\bar{S}}_{n,m% }+\\ \sum\limits_{i\in\mathcal{I}}\log\mathcal{\bar{S}}_{i,m+1}+\sum\limits_{j\in% \mathcal{J}}\log\mathcal{\bar{S}}_{N+1,j}).\end{split}

(10)

The classification loss is defined as

\mathcal{L}_{or}=-\frac{1}{|\mathcal{M}_{init}|}\sum_{i=1}^{|\mathcal{M}_{init% }|}w_{i}(y_{i}\log(p_{i})+(1-y_{i})\log(1-p_{i})),

(11)

where $w_{i}$ is the weight balancing the positive and negative samples, $y_{i}$ is the ground truth matching label for the $i$ -th correspondences, $p_{i}$ is the predicted probability of a true match for the $i$ -th correspondences. The total loss is the sum of the two terms as $\mathcal{L}=\mathcal{L}_{ot}+\mathcal{L}_{or}$ .

3 Experiments

Training. We train the indoor model of DGC-GNN on the ScanNet [10] dataset and the outdoor model on the MegaDepth [24] dataset. We extract up to 1024 keypoints for each training image by the SIFT detector [27]. Similarly as in GoMatch, we first select a subset of the point cloud by applying image retrieval approaches [1, 50] to obtain potential images observing the same part of the scene as the input one. We randomly sample the retrieval pairs with a visual overlap of more than 35% on MegaDepth and 65% on ScanNet to ensure enough matches on each pair. For the global geometric embedding, we cluster the 2D/3D bearing vectors into $X=10$ groups, and each cluster center is connected to its $k=4$ nearest neighbors to build the global graph. For the local point graph, we connect each point with its 10 nearest neighbors and the cluster-based attentions are performed twice to force the intra-cluster information exchange.

We use Adam optimizer with a learning rate of 1e-3. We train DGC-GNN with one 32GB Telsa V100 GPU. The convergence of the model typically requires 50 epochs.

Datasets. We use ScanNet and MegaDepth for training and 2D-3D matching task evaluation. As a downstream application, we perform visual localization on the 7Scenes [46] and Cambridge Landmarks [22] datasets. MegaDepth is a popular outdoor dataset with 196 scenes captured around the world. The sparse 3D reconstructions are provided by the COLMAP [43] structure-from-motion software. Following [56], we train our outdoor model on 99 scenes and evaluate it with 53 scenes. ScanNet is a large-scale RGB-D indoor dataset comprising 1613 scans with over 2.5 million images. We randomly selected 105 scenes for the training and 30 for the evaluation. Cambridge Landmarks is a middle-scale outdoor dataset consisting of 6 individual scenes. A structure-from-motion algorithm provides the ground truth camera poses. We follow [22, 56] to evaluate our method on four scenes. 7Scenes is a small indoor dataset with RGB-D images and camera poses provided by the depth SLAM system. We evaluate on the standard test sequences.

Evaluation Protocol. For matching on ScanNet and MegaDepth, we follow [56] and report the AUC score calculated from the reprojection errors. To calculate the errors for the 2D-3D matches in $\mathcal{M}_{final}$ , we project the 3D points to the image plane using the ground truth and estimated camera poses. Then, we calculate the $L_{2}$ distance of the ground truth and estimated reprojected 2D points. We use multiple thresholds, 1, 5, and 10 pixels, to evaluate the AUC scores. The camera translation and rotation error quantiles at 25%, 50%, and 75% are also reported. Moreover, we evaluate the matching quality by calculating the matching precision P, which is the ratio of inlier matches after PnP-RANSAC to the number of final matches $\mathcal{M}_{final}$ . For visual localization tasks, we report the median translation (in meters) and rotation (in degrees) camera pose errors.

3.1 2D-3D Matching

We compare with the two descriptor-free matchers GoMatch [56] and BPnPnet [5]. At inference, we use the 3D points from the top- $k$ retrieved database images to match with the keypoints from query images. Following [56], we report the upper bound of the AUC score using the ground truth matches. We refer to these values as Oracle. We select the GT matches by thresholding the reprojection error based on normalized image coordinates, using a threshold of $0.001$ , to bypass the influence of camera intrinsics during GT selection. This is in contrast to what is done in [56]. Results on GT selected by a pixel threshold are in the supp. material. We use the official code with the default setting to generate the evaluation dataset on MegaDepth [24] and rerun GoMatch and BPnPNet with the released models. Note that we also tested GoMatch after retraining it on MegaDepth and achieved similar results as with the released model.

Methods	G. Emb.	C. Att.	Color	Ang.	Reproj. AUC (%)	Rotation ( ${}^{\circ}$ )	Translation
Methods	Sec. 2.2.2	Sec. 2.2.3	Sec. 2.2.1	Sec. 2.2.2	@1 / 5 / 10px ( $\uparrow$ )	Quantile@25 / 50 / 75% ( $\downarrow$ )
GoMatch [56]					18.90 / 35.67 / 44.99	0.18 / 1.29 / 16.65	0.02 / 0.12 / 1.92
Variants	✓				10.86 / 41.18 / 50.51	0.13 / 0.76 / 13.47	0.01 / 0.07 / 1.62
	✓	✓			11.64 / 44.46 / 53.99	0.11 / 0.55 / 19.49	0.01 / 0.05 / 1.05
		✓	✓		13.20 / 46.33 / 54.34	0.09 / 0.41 / 19.98	0.01 / 0.03 / 1.19
	✓	✓	✓		14.19 / 48.34 / 56.54	0.08 / 0.34 / 19.23	0.01 / 0.03 / 1.03
DGC-GNN	✓	✓	✓	✓	15.30 / 51.70 / 60.01	0.07 / 0.26 / 15.41	0.01 / 0.02 / 0.57

Table 2: Ablation Study. AUC scores thresholded at 1, 5, and 10 pixels; rotation and translation error quantiles at 25, 50, 75% with the proposed components added one by one to the GoMatch pipeline on the MegaDepth dataset.

Matching Results. The results with $k=1$ and $k=10$ are presented in Table 1. Parameter $k$ is the number of retrieved image pairs that are used for evaluation. The proposed method outperforms GoMatch and BPnPNet by a significant margin on both scenes. Specifically, DGC-GNN achieves 10.2 / 37.64 / 44.04% reprojection AUC compared to GoMatch with 5.67 / 22.43 / 28.01% on MegaDepth with $k=1$ . DGC-GNN halves the rotation and translation errors of GoMatch on all thresholds and it obtains better matching quality. Notably, the performance of DGC-GNN with $k=1$ surpasses that of GoMatch with $k=10$ , indicating the effectiveness of our method even with a single view.

Sensitivity to Outliers. To evaluate the sensitivity to keypoint outliers, we follow the procedure in GoMatch [56]. The outliers are controlled by the outlier ratio, ranging from 0 to 1, calculated as the number of unmatched keypoints divided by the maximum of the numbers of 2D and 3D points. If the outlier ratio is $0$ , all the input 2D and 3D points are selected from the ground truth matches, and no outliers are included in the matching process. When it is $1$ , we directly use the keypoints from the query image and 3D points from the top- $k$ retrieved images without any filtering or outlier removal. The results are shown in Fig. 4. Even in the presence of outliers, DGC-GNN outperforms other methods by a large margin. This indicates that our method is more robust to outliers and can handle challenging matching scenarios more effectively than the state-of-the-art.

Ablation Study. We investigate the effectiveness of different components of DGC-GNN on the 2D-3D matching quality on the MegaDepth dataset [24] with $k=10$ . The results are reported in Table 2. We provide results with $k=1$ in the supp. material. We conduct the ablations by gradually adding the components: global geometric embedding (G. Emb), cluster attention (C. Att.), Color, and Angular embedding (Ang.) to the original GoMatch pipeline. Incorporating color information into the matching process significantly impacts the performance, resulting in improvements of 2.55 / 3.88 / 2.55% (AUC@1 / 5 / 10px). This demonstrates the importance of color cues for accurate and robust matching. The global-to-local geometric (G. Emb.) and Angular relation embedding (Ang.) substantially improve the matching performance by 1.90 / 5.51 / 5.52% and 1.11 / 3.36 / 3.47%, respectively. It highlights the effectiveness of incorporating global geometric context and local geometric details. The cluster attention mechanism also plays a vital role, improving performance by 0.78 / 3.28 / 3.48%. The best results are obtained when all components are added to the pipeline.

Methods

No Desc. Maint.

Privacy

Cambridge-Landmarks [22] (cm,

{}^{\circ}

)

MB used

7Scenes [46] (cm,

{}^{\circ}

)

MB used

King’s

Hospital

Shop

St. Mary’s

Chess

Fire

Heads

Office

Pumpkin

Kitchen

Stairs

E2E

MS-Trans. [44]

✓

83 / 1.47

181 / 2.39

86 / 3.07

162 / 3.99

1171

11 / 4.66

24 / 9.60

14 / 12.19

17 / 5.66

18 / 4.44

17 / 5.94

26 / 8.45

11171

DSAC* [3]

✓

15 / 0.30

121 / 0.40

15 / 0.30

113 / 0.40

1112

12 / 1.10

12 / 1.24

11 / 1.82

13 / 1.15

14 / 1.34

14 / 1.68

13 / 1.16

11196

HSCNet [23]

✓

18 / 0.30

119 / 0.30

16 / 0.30

119 / 0.30

1592

12 / 0.70

12 / 0.90

11 / 0.90

13 / 0.80

14 / 1.00

14 / 1.20

13 / 0.80

11036

HybridSC [6]

✗

–

81 / 0.59

175 / 1.01

19 / 0.54

150 / 0.49

1113

AS [41]

✗

13 / 0.22

120 / 0.36

14 / 0.21

118 / 0.25

1813

13 / 0.87

12 / 1.01

11 / 0.82

14 / 1.15

17 / 1.69

15 / 1.72

4 / 1.01

SP [11]+SG [38]

✗

12 / 0.20

115 / 0.30

14 / 0.20

117 / 0.21

3215

12 / 0.85

12 / 0.94

11 / 0.75

13 / 0.92

15 / 1.30

14 / 1.40

15 / 1.47

22977

GoMatch [56]

✓

25 / 0.64

283 / 8.14

48 / 4.77

335 / 9.94

1148

14 / 1.65

13 / 3.86

19 / 5.17

11 / 2.48

16 / 3.32

13 / 2.84

89 / 21.12

11302

DGC-GNN

✓

18 / 0.47

75 / 2.83

15 / 1.57

106 / 4.03

1169

13 / 1.43

15 / 1.77

14 / 2.95

16 / 1.61

18 / 1.93

18 / 2.09

71 / 19.5

11355

Table 3: Visual Localization. We report the median pose errors (cm,

{}^{\circ}

) and storage requirements (MB) on the scenes of the 7Scenes [46] and Cambridge-Landmarks [22] datasets. Three groups of methods are shown: end-to-end (E2E), descriptor-based (DB), and descriptor-free (DF). We do not show BPnPNet as it fails on most scenes. The best results are shown in bold in each group.

3.2 Visual Localization

Visual localization estimates the 6 degrees-of-freedom camera pose of an input query image w.r.t a known map of the scene. One of the most prominent ways of approaching this problem is via establishing 2D-3D correspondences and running robust pose estimation. Following [56], we ran the proposed DGC-GNN to obtain matches. For each query image, we match its keypoints with the 3D points from the top-10 retrieved views to build the 2D-3D correspondences. The camera pose is then estimated by PnP-RANSAC [16, 17]. We use two standard datasets, 7Scenes [46] and Cambridge Landmarks [22]. For 7Scenes, we extract the keypoints with the SIFT detectors, and the top 10 pairs are retrieved using DenseVLAD [50]. For Cambridge Landmarks, the keypoints are extracted by SuperPoint [11] to ensure consistency with the SuperPoint-based structure-from-motion model. The top 10 pairs are provided by NetVLAD [1].

3.2.1 Results

In Table 3, we present the 3D model maintenance costs, privacy, storage requirements, and camera pose median errors (cm, ${}^{\circ}$ ) of standard descriptor-based localization techniques and descriptor-free methods. DGC-GNN consistently outperforms GoMatch on all scenes by a significant margin. On Cambridge Landmarks, the average median error of DGC-GNN is 54 cm / 2.23 ${}^{\circ}$ , while GoMatch leads to 173 cm / 5.87 ${}^{\circ}$ error. On 7Scenes, the average error of DGC-GNN is 15 cm / 4.47 ${}^{\circ}$ , and that of GoMatch is 22 cm / 5.77 ${}^{\circ}$ . DGC-GNN requires a similar amount of memory to other descriptor-free methods. Also, it inherits their privacy-preserving properties due to not requiring visual descriptors.

The trade-off between descriptor-based (DB) and descriptor-free (DF) algorithms is visible from the table. While descriptor-based ones lead to the best accuracy overall, they require excessive memory and descriptor maintenance and are susceptible to privacy attacks. Although the model compression method, HybridSC [6], shows effectiveness in storage saving, it achieves similar performance compared to DGC-GNN on the Cambridge Landmarks dataset while still requiring descriptor maintenance. End-to-end methods (E2E) overcome these problems and achieve accurate results. However, their main limitation is that such approaches must be trained independently on each scene. The proposed DGC-GNN only needs to be trained once, making it more efficient and convenient to use as an off-the-shelf tool.

3.2.2 Generalizability

Scenes	Trained on MegaDepth [24] (SIFT)				Trained on ScanNet [10] (SIFT)
Scenes	GoMatch (SIFT)	GoMatch (SP)	DGC-GNN (SIFT)	DGC-GNN (SP)	DGC-GNN (SIFT)	DGC-GNN (SP)
Chess	14 / 11.65	14 / 11.56	13 / 11.41	13 / 11.46	13 / 11.43	14 / 11.51
Fire	13 / 13.86	12 / 13.71	15 / 11.81	17 / 12.30	15 / 11.77	16 / 12.03
Heads	19 / 15.17	15 / 13.43	15 / 13.13	14 / 12.78	14 / 12.95	14 / 13.02
Office	11 / 12.48	17 / 11.76	17 / 11.66	17 / 11.66	16 / 11.61	17 / 11.66
Pumpkin	16 / 13.32	28 / 15.65	18 / 12.03	12 / 12.75	18 / 11.93	10 / 12.38
Redkitchen	13 / 12.84	14 / 13.03	18 / 12.14	10 / 12.36	18 / 12.09	19 / 12.28
Stairs	89 / 21.12	58 / 13.12	83 / 21.53	55 / 13.05	71 / 19.50	58 / 14.32
All	22 / 15.78	18 / 14.61	17 / 14.82	14 / 13.77	15 / 14.47	14 / 13.89

Table 4: Model Generalizability on Visual Localization Task. We report the translation and rotation median error (cm /

{}^{\circ}

) on 7Scenes dataset [46]. We evaluate the models trained on MegaDepth [24] and ScanNet [10] datasets. The best performance is bold.

Similar to [56], we discuss the generalizability of our DGC-GNN model on the visual localization task across different training and evaluation scenes. Specifically, we investigate the performance of our model when trained on MegaDepth [24] and ScanNet [10] and evaluated on the 7Scenes dataset [46]. We also explore the impact of using different keypoint detectors, namely SIFT [27] and SuperPoint [11], during the evaluation.

These experiments are summarized in Table 4, providing an overview of the performance of DGC-GNN under different training and evaluation conditions. While the best overall performance is achieved by training on the MegaDepth dataset with SIFT features, the results are similar in both training scenarios, showcasing that the proposed method generalizes well to unseen data. While we train on SIFT features, the best results are achieved by using SuperPoint features at inference time. This demonstrates that DGC-GNN is insensitive to the features used and can be utilized off-the-shelf even without retraining to our specific scenario.

4 Conclusion

In conclusion, this paper introduces DGC-GNN, a novel graph-based pipeline for visual descriptor-free 2D-3D matching that effectively leverages geometric and color cues in a global-to-local manner. Our global-to-local procedure encodes both Euclidean and angular relations at a coarse level, forming a geometric embedding to guide the local point matching. By employing a cluster-based transformer, we enable efficient information passing within local clusters, ultimately leading to significant improvements in the number of correct matches and the accuracy of pose estimation. Compared to the state-of-the-art descriptor-free matcher GoMatch [56], the proposed DGC-GNN demonstrates a substantial improvement, doubling the accuracy on real-world and large-scale datasets. Furthermore, it results in significantly increased localization accuracy. These advancements contribute to reducing the gap between descriptor-based and descriptor-free methods while addressing the limitations of descriptor-based ones, such as memory footprint, maintenance costs, and susceptibility to privacy attacks.

Limitations. The primary limitation of our proposed DGC-GNN method lies in its performance being inferior to traditional descriptor-based algorithms. The performance difference can be attributed to the insufficiency of unique 3D structures in the geometry, which hinders the ability of the algorithm to identify distinct matches in real-world scenarios. Although DGC-GNN demonstrates a notable improvement over existing descriptor-free approaches, there remains a performance gap to overcome in order to achieve results on par with or superior to those of descriptor-based methods.

Acknowledgements. This work was supported by the Academy of Finland (grants No. 327911, No. 353138) and the Hasler Stiftung Research Grant via the ETH Zurich Foundation. We acknowledge the computational resources provided by the CSC-IT Center for Science, Finland.

References

Arandjelovic et al. [2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5297–5307, 2016.
Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Brachmann and Rother [2021] Eric Brachmann and Carsten Rother. Visual camera re-localization from RGB and RGB-D images using DSAC. TPAMI, 2021.
Bradski [2000] Gary Bradski. The opencv library. Software Tools for the Professional Programmer, 25(11):120–123, 2000.
Campbell et al. [2020] Dylan Campbell, Liu Liu, and Stephen Gould. Solving the blind perspective-n-point problem end-to-end with robust differentiable geometric optimization. In Proceedings of the European Conference on Computer Vision (ECCV), page preprint. Springer, 2020.
Camposeco et al. [2019] Federico Camposeco, Andrea Cohen, Marc Pollefeys, and Torsten Sattler. Hybrid scene compression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019.
Chelani et al. [2021] Kunal Chelani, Fredrik Kahl, and Torsten Sattler. How privacy-preserving are line clouds? recovering scene details from 3d lines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15668–15678, 2021.
Cui and Tan [2015] Zhaopeng Cui and Ping Tan. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 864–872, 2015.
Cuturi [2013] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
DeTone et al. [2018] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pages 224–236, 2018.
Dosovitskiy and Brox [2016a] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. Advances in neural information processing systems, 29, 2016a.
Dosovitskiy and Brox [2016b] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4829–4837, 2016b.
Dusmanu et al. [2019] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8092–8101, 2019.
Eade and Drummond [2006] Ethan Eade and Tom Drummond. Scalable monocular slam. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 469–476. IEEE, 2006.
Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
Gao et al. [2003] Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Complete solution classification for the perspective-three-point problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):930–943, 2003.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Huang et al. [2021] Shengyu Huang, Zan Gojcic, Mikhail Usvyatsov, Andreas Wieser, and Konrad Schindler. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 4267–4276, 2021.
Irschara et al. [2009] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and Horst Bischof. From structure-from-motion point clouds to fast location recognition. In in IEEE Conference on Computer Vision and Pattern Recognition, pages 2599–2606. IEEE, 2009.
Katharopoulos et al. [2020] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
Kendall et al. [2015] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
Li et al. [2020] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.
Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
Lindenberger et al. [2021] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5987–5997, 2021.
Liu et al. [2020] Liu Liu, Dylan Campbell, Hongdong Li, Dingfu Zhou, Xibin Song, and Ruigang Yang. Learning 2d-3d correspondences to solve the blind perspective-n-point problem. arXiv preprint arXiv:2003.06752, 2020.
Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
Maas et al. [2013] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the International Conference of Machine Learning, page 3, 2013.
Moré [2006] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical Analysis: Proceedings of the Biennial Conference, pages 105–116. Springer, 2006.
Mur-Artal and Tardós [2017] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
Mur-Artal et al. [2015] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
Ng et al. [2022] Tony Ng, Hyo Jin Kim, Vincent T Lee, Daniel DeTone, Tsun-Yi Yang, Tianwei Shen, Eddy Ilg, Vassileios Balntas, Krystian Mikolajczyk, and Chris Sweeney. Ninjadesc: content-concealing visual descriptors via adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12797–12807, 2022.
Nister and Stewenius [2006] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 2161–2168. Ieee, 2006.
Pan et al. [2023] Linfei Pan, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Privacy preserving localization via coordinate permutations. In Proceedings of the International Conference on Computer Vision, pages 18174–18183, 2023.
Pittaluga et al. [2019] Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 145–154, 2019.
Qin et al. [2022] Zheng Qin, Hao Yu, Changjian Wang, Yulan Guo, Yuxing Peng, and Kai Xu. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11143–11152, 2022.
Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4938–4947, 2020.
Sattler et al. [2011] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image-based localization using direct 2d-to-3d matching. In International Conference on Computer Vision, pages 667–674. IEEE, 2011.
Sattler et al. [2012] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving image-based localization by active correspondence search. In European conference on computer vision, pages 752–765. Springer, 2012.
Sattler et al. [2016] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
Sattler et al. [2018] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
Shavit et al. [2021] Yoli Shavit, Ron Ferens, and Yosi Keller. Learning multi-scene absolute pose regression with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2733–2742, 2021.
Shi et al. [2022] Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12517–12526, 2022.
Shotton et al. [2013] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
Sinkhorn and Knopp [1967] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
Snavely et al. [2006] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph, pages 835–846. 2006.
Sun et al. [2021] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Torii et al. [2015] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
Wang et al. [2021] Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, and Juho Kannala. Continual learning for image-based camera localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3252–3262, 2021.
Yang et al. [2022] Luwei Yang, Rakesh Shrestha, Wenbo Li, Shuaicheng Liu, Guofeng Zhang, Zhaopeng Cui, and Ping Tan. Scenesqueezer: Learning to compress scene for camera relocalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8259–8268, 2022.
Yu et al. [2021] Hao Yu, Fu Li, Mahdi Saleh, Benjamin Busam, and Slobodan Ilic. Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration. Advances in Neural Information Processing Systems, 34:23872–23884, 2021.
Zhou et al. [2022] Qunjie Zhou, Sérgio Agostinho, Aljoša Ošep, and Laura Leal-Taixé. Is geometry enough for matching in visual localization? In Proceedings of the European Conference on Computer Vision (ECCV), pages 407–425. Springer, 2022.

\thetitle

Supplementary Material

Appendix A Training and Evaluation Details

Dataset Generation. The training data generation process for MegaDepth [24] follows the methodology outlined in GoMatch [56]. The undistorted SfM model reconstructions used in MegaDepth are provided by D2Net [14]. For training, we sample up to 500 images from each scene. For each sampled image, we select the top- $k$ co-visible views that have at least 35% image overlap. This ensures that there are enough matches for training. The overlapping score is computed by dividing the number of co-visible 3D points by the total number of points in the training image.

In the case of ScanNet [10], a similar procedure is conducted. We also sample up to 500 images from each scene for the training set generation. The co-visible images are obtained using the co-visible scores provided by LoFTR [49]. We extract all the co-visible views of a training image with co-visible scores larger than 0.65. Then, we randomly sample the top- $k$ views for training. Since ScanNet is an RGB-D dataset without an SfM reconstruction, we obtain the 3D points for each image by projecting the detected 2D keypoints with valid depth to 3D. By doing this for each image, we reconstruct a sparse 3D point cloud based on the detected 2D keypoints. Note that the correspondence between different co-visible frames is not required in this case.

In total, for MegaDepth, we generate a training set consisting of 25,624 images from 99 scenes and a test set comprising 12,399 images covering 53 scenes. For ScanNet, we create a training set with 52,008 images from 105 scenes. The test set for ScanNet consists of 14,892 query images from 30 scenes. The data generation of 7Scenes [46] and Cambridge dataset [22] follows the same procedure in [56].

Method	Reprojection		InvSFM [35]
Method	Points	Points+RGB	Points	Points+RGB	Points+SIFT
SSIM ( $\downarrow$ )	0.240	0.258	0.352	0.375	0.476

Table 5: SSIM Results. We evaluate the SSIM from Point Reprojection and Image Recovering, adding RGB to points leads only to a slight SSIM increase on both reprojection and image recovery.

Inference. We consider a query with at least 10 keypoints as valid input. The 3D points from the top- $k$ retrieved database images are then applied to match against the queries with our proposed pipeline. We use the Sinkhorn algorithm [47, 9] to optimize the extended cost matrix $\bar{\mathcal{M}}\in\mathbb{R}^{N+1,M+1}$ in an iterative manner with up to 20 iterations to obtain the initial matches. The final matches are obtained by filtering the matches with matching confidence $\theta<0.5$ in the outlier rejection module. For the visual localization task, the camera poses are estimated by the P3P solver with RANSAC [16] implemented in OpenCV [4] and then refined by Levenberg-Marquardt [29] algorithm on the inliers matches, minimizing the reprojection error.

Appendix B Privacy Issue of RGB Points

We investigate the impact on privacy resulting from incorporating RGB information into pixels and points. To assess this, we compute the Structural Similarity Index Measure (SSIM) for 3D points reprojected onto the image plane against the ground truth (GT) images on MegaDepth over 500 images from multiple scenes. Additionally, we recover the images from points + RGB and points + descriptors with InvSFM [35] to calculate the SSIM against the GTs. The findings are detailed in Table 5 and Fig 5. The addition of RGB data to points results in only a marginal increase in SSIM for both direct reprojection and image reconstruction via InvSFM, significantly less than what is achieved by incorporating SIFT descriptors. It is worth noting that denser point clouds might provide sufficient context, potentially leading to privacy concerns. However, in our setting, we mitigate this risk by limiting the number of keypoints from each database image to a maximum of 1024.

Appendix C Additional Results

Qualitative Results. More visualizations of inlier matches provided by DGC-GNN and GoMatch on MegaDepth are shown in Fig. 6. DGC-GNN consistently finds more correct matches on multiple scenes, highlighting the effectiveness of the proposed method.

Methods	Global	C. Att.	Color	Ang.	Cluster	Reproj. AUC (%)	Rotation ( ${}^{\circ}$ )	Translation
Methods	Global	C. Att.	Color	Ang.	Cluster	@1 / 5 / 10px ( $\uparrow$ )	Quantile@25 / 50 / 75% ( $\downarrow$ )
GoMatch [56] (w/o OR)						14.47 / 17.95 / 23.42	1.29 / 11.85 / 33.60	0.11 / 1.18 / 3.58
GoMatch [56]						15.67 / 22.43 / 28.01	0.60 / 10.08 / 34.63	0.06 / 1.06 / 3.73
Variants	G.Emb				K-means	17.68 / 28.41 / 34.36	0.28 / 16.78 / 34.52	0.03 / 0.73 / 3.77
	G.Label	✓			K-means	17.13 / 27.33 / 33.18	0.31 / 17.34 / 33.63	0.03 / 0.76 / 3.64
	G.Emb	✓			K-means	18.10 / 30.64 / 37.07	0.24 / 14.48 / 34.30	0.03 / 0.63 / 3.51
	G.Emb	✓	✓		K-means	19.82 / 35.29 / 41.16	0.17 / 12.88 / 31.74	0.02 / 0.27 / 3.24
	G.Emb	✓	✓	✓	Mean-shift	10.07 / 36.01 / 43.03	0.16 / 12.15 / 28.99	0.01 / 0.20 / 3.26
DGC-GNN (w/o OR)	G.Emb	✓	✓	✓	K-means	18.56 / 30.79 / 37.03	0.22 / 14.85 / 30.07	0.02 / 0.47 / 3.10
DGC-GNN	G.Emb	✓	✓	✓	K-means	10.20 / 37.64 / 44.04	0.15 / 11.53 / 27.93	0.01 / 0.15 / 3.00

Table 6: Additional Ablation Results. AUC scores thresholded at 1, 5, and 10 pixels on

k=1

; rotation and translation error quantiles at 25, 50, 75% with the proposed components added one by one to the GoMatch pipeline.

Additional Ablation Results. In addition to the ablation results presented in the main paper, we also provide ablation results for single-view matching with $k=1$ on MegaDepth [24]. Furthermore, we conduct two additional ablations to investigate the impact of different component selections. Firstly, we compare the effectiveness of the geometric global embedding (G. Emb.) used in the main paper with the global clustering label embedding (G. Label). Instead of encoding geometric cues, we encode the label of each global cluster and concatenate it to the local point feature. Then, we explore the selection of different clustering algorithms. We compare the performance of K-Means and Mean-Shift clustering algorithms in our pipeline. Last, we study the effectiveness of the outlier rejection (OR) network.

The results are presented in Table 6. We observe similar conclusions for each component as in the main paper. The results obtained using the global label embedding (G. Label) with cluster attention (C.Att) show even worse performance compared to geometric embedding (G. Emb.) only, indicating the superiority of our clustering-based geometric embedding over the label embedding and highlighting the importance of incorporating geometric cues in the embedding process for effective point matching. Regarding the impact of different clustering algorithms, we only observe a minor difference in K-Means and Mean-Shift results, suggesting that our approach is robust to the choice of the clustering algorithm. The results also demonstrate that outlier rejection is an essential post-processing module to achieve good performance. In addition to the numerical results, we visualize the inlier matches (see Fig. 7) to provide deep insights into the behavior and performance of different architectures.

Hyperparameters analysis. Besides the component ablations, we also give an in-depth analysis of the hyperparameters used in our main pipeline. Here, we add additional ablations on the number of input keypoints, the number of cluster groups at the coarse level, the number of nearest neighbors in the local graph build, and the outlier rejection threshold by retraining our DGC-GNN. The results are presented in Table 7. We observe that DGC-GNN with G. Clusters = 10 and Local NN = 10 achieves overall the best performance. Setting the outlier rejection threshold to 0.7 leads to the best performance. However, the results are stable across different configurations, indicating robustness to the parameter setting.

Methods	G. Cluters	Local NN	OR Threshold	Reproj. AUC (%)	Rotation ( ${}^{\circ}$ )	Translation
Methods	G. Cluters	Local NN	OR Threshold	@1 / 5 / 10px ( $\uparrow$ )	Quantile@25 / 50 / 75% ( $\downarrow$ )
DGC-GNN	10	10	0.5	15.30 / 51.70 / 60.01	0.07 / 0.26 / 15.41	0.01 / 0.02 / 0.57
HyperParam.	5	10	0.5	14.73 / 50.12 / 58.26	0.08 / 0.28 / 18.76	0.01 / 0.03 / 0.99
	15	10	0.5	15.14 / 50.56 / 58.62	0.07 / 0.28 / 17.66	0.01 / 0.03 / 0.89
	10	20	0.5	14.77 / 49.84 / 57.97	0.07 / 0.29 / 18.26	0.01 / 0.03 / 0.90
	10	30	0.5	14.75 / 50.95 / 59.45	0.08 / 0.28 / 15.48	0.01 / 0.03 / 0.58
	10	10	0.3	13.28 / 46.44 / 55.05	0.08 / 0.43 / 18.63	0.01 / 0.04 / 0.98
	10	10	0.7	16.63 / 56.26 / 64.46	0.07 / 0.19 / 12.58	0.01 / 0.02 / 0.27

Table 7: Ablation Study on Hyperparameters. We report the results of ablations with retrieved image

k=10

on the number of global clusters, the number of nearest neighbour points for local graph build, and different thresholds for outlier rejection. The best results are bold.

Matching Results in pixel threshold. As mentioned in the main paper, we selected the ground truth matches in normalized image coordinates. The described GT difference only affects the reprojection AUC scores. Here, we present the matching results in Table 5 by selecting the ground truth matches in pixel coordinate with $1$ pixel threshold as done in [56]. Our conclusions still hold.

Appendix D Model Parameters and Timing

We discuss the model parameters and running time of DGC-GNN in this section. DGC-GNN incorporates global geometric embedding and local clustering attention, which has around 5.7 million trainable parameters and an estimated model size of 22.6 MB. The average inference time for each image pair over the Megadepth evaluation queries is 77.8ms. It roughly breaks down into point encoding (24 ms), global geometric embedding (14 ms), cluster-based attention (22 ms), optimal transport (7 ms), and outlier rejection (8 ms). The measurements are conducted on a 32GB NVIDIA Telsa V100 GPU with a maximum of 1024 keypoints.