Given a partial 3D reconstruction produced by running structure from motion on Internet images capturing large-scale landmarks, such as the front or the rear façade of the Milan Cathedral depicted above, we present a technique for grounding this reconstruction in a complete 3D reference model of the scene. Reference models are constructed from pseudo-synthetic renderings extracted from Google Earth Studio. As illustrated above, our approach allows for merging partial, disjoint 3D reconstructions into a unified model.
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models.
Below, we show the alignment produced by our optimization based method, aligning a partial reconstruction of the Freiburg Cathedral to a 3DGS reference model generated from Google Earth Studio. Four images from the partial reconstruction are provided, shown in the top half, whereas the bottom half shows the corresponding rendering from the Gaussian Splatting model. As illustrated below, our inverse optimization-based approach predicts precise transformations, even in the presence of inaccurate initializations.
Given a 3DGS reference model (left) and a set of Internet images (right), we propose an inverse optimization scheme that predicts a global 6DoF+scale alignment T while keeping the parameters of 3DGS model fixed. We obtain an initial transformation T (in red) using a traditional SfM technique. During optimization, we calculate a semantic feature loss Lsem and backpropagate it to update T (converging to the rendered view in green after N steps).
We construct the WikiEarth benchmark, which includes augmented 3D reconstructions from the WikiScenes dataset (meta-images) with Google Earth Studio renderings. The dataset includes 33 partial reconstructions (meta images) from 23 different scenes. Reconstruction of four landmarks from our benchmark are provided below. The blue frustums depicts the rendered images from Google Earth Studio, and the red frustums the images from WikiScenes.
Below we compare performance on the WikiEarth benchmark against multiple
baselines. We report the average rotation and translation errors
(ΔR, ΔT) across all meta-images, the Meta-image Transformation Accuracy (MTA),
and the percentage of alignment outlier (O%).
We compare our method to the COLMAP
baseline, to a SfM pipeline that uses
SuperPoint as the feature extractor
and LightGlue as the feature matcher (SP+LG),
and to the distributed camera-model SfM method
gDLS+++. We also conducted comparisons to recent
feed-forward methods
Additional details are provided in the paper Experiments section.
Our method demonstrates low errors as the number of images in each meta-image increases, with rotation and translation errors reaching a performance plateau at approximately six images. The system also shows strong robustness to initialization noise, successfully aligning meta-images with less than 10 degrees of error for each rotation parameter.