Scene Grounding in The Wild

1Tel Aviv University, 2Cornell University

CVPR 2026

Overview

Given a partial 3D reconstruction produced by running structure from motion on Internet images capturing large-scale landmarks, such as the front or the rear façade of the Milan Cathedral depicted above, we present a technique for grounding this reconstruction in a complete 3D reference model of the scene. Reference models are constructed from pseudo-synthetic renderings extracted from Google Earth Studio. As illustrated above, our approach allows for merging partial, disjoint 3D reconstructions into a unified model.

Abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models.

Optimization Visualization

Below, we show the alignment produced by our optimization based method, aligning a partial reconstruction of the Freiburg Cathedral to a 3DGS reference model generated from Google Earth Studio. Four images from the partial reconstruction are provided, shown in the top half, whereas the bottom half shows the corresponding rendering from the Gaussian Splatting model. As illustrated below, our inverse optimization-based approach predicts precise transformations, even in the presence of inaccurate initializations.



Alignment Visualization

Below we provide a sample of the alignment results, before and after our optimization scheme. Each visualization shows the input Internet image in the lower half and the rendered image from the reference model in the upper half. The initialization is achieved with COLMAP. Randomly-selected samples over all scenes and considered baseline models are provided in the Interactive Visualizations.

Wells Cathedral

Initialization

Initialization + Ours

Brussels Cathedral

Initialization

Initialization + Ours

Freiburg Cathedral

Initialization

Initialization + Ours

Bordeaux Cathedral

Initialization

Initialization + Ours

Metz Cathedral

Initialization

Initialization + Ours

How Does it Work?

Given a 3DGS reference model (left) and a set of Internet images (right), we propose an inverse optimization scheme that predicts a global 6DoF+scale alignment T while keeping the parameters of 3DGS model fixed. We obtain an initial transformation T (in red) using a traditional SfM technique. During optimization, we calculate a semantic feature loss Lsem and backpropagate it to update T (converging to the rendered view in green after N steps).


Method

The WikiEarth Benchmark

We construct the WikiEarth benchmark, which includes augmented 3D reconstructions from the WikiScenes dataset (meta-images) with Google Earth Studio renderings. The dataset includes 33 partial reconstructions (meta images) from 23 different scenes. Reconstruction of four landmarks from our benchmark are provided below. The blue frustums depicts the rendered images from Google Earth Studio, and the red frustums the images from WikiScenes.

Overview

Results

Below we compare performance on the WikiEarth benchmark against multiple baselines. We report the average rotation and translation errors (ΔR, ΔT) across all meta-images, the Meta-image Transformation Accuracy (MTA), and the percentage of alignment outlier (O%). We compare our method to the COLMAP baseline, to a SfM pipeline that uses SuperPoint as the feature extractor and LightGlue as the feature matcher (SP+LG), and to the distributed camera-model SfM method gDLS+++. We also conducted comparisons to recent feed-forward methods Additional details are provided in the paper Experiments section.

Comparison with Baselines

Method Analysis

Our method demonstrates low errors as the number of images in each meta-image increases, with rotation and translation errors reaching a performance plateau at approximately six images. The system also shows strong robustness to initialization noise, successfully aligning meta-images with less than 10 degrees of error for each rotation parameter.


Method Analysis

BibTeX