Lang3D-XL

Language Embedded 3D Gaussians for Large-scale Scenes

1Tel Aviv University, 2The Hebrew University of Jerusalem, 3Cornell University

SIGGRAPH Asia 2025


TL;DR We introduce Lang3D-XL, a method for efficiently distilling language features into large-scale, in-the-wild 3D Gaussian scenes.

Our method leverages an extremely compact semantic bottleneck (just three dimensions per Gaussian) and a novel Attenuated Downsampler module to efficiently distill language features into large-scale 3D Gaussian Scenes. As illustrated above, Land3D-XL enables open-vocabulary querying at interactive speeds.


Abstract


Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.


3D Localization Results


Portals

Portals

Pinnacles

Pinnacles*

Pinnacles

Pinnacles*

Statues

Statues

Stone Staircase

Stone Staircase

*Pinnacles are pyramidal or conical crowning ornaments found at tops of architectural elements.
Ornamental Arches

Ornamental Arches

Pillars

Pillars*

Rose Windows

Rose Windows†

Rose Windows

Rose Windows†

Balustrades

Balustrades‡

*Pillars are vertical structural support elements, typically cylindrical and ornamental arches are curved structural elements that span openings between pillars or walls.
†A rose window is a large, circular stained glass window featuring intricate patterns that resemble the petals of a rose.
‡A balustrade is a row of small columns (balusters) topped by a rail, serving as a barrier.
Mihrab

Mihrab*

Domes

Domes

Turrets

Turrets

Calligraphy Panels

Calligraphy Panels†

Muqaranas

Muqaranas‡

*A mihrab is a semicircular niche in the wall of a mosque.
†Calligraphy panels are decorative architectural elements featuring artistic handwritten text.
‡Muqarnas are a form of three-dimensional decoration in Islamic architecture.
Reliefs

Reliefs*

Reliefs

Reliefs*

Windows

Windows

Semi Circular Arch

Semi Circular Arch

Pediment

Pediment†

*Reliefs are sculptural forms raised from a background, typically a flat surface.
†A pediment is a triangular or curved decorative element typically found above the entrance of classical buildings, supported by columns.
Portals

Portals

Pinnacles

Pinnacles*

Pinnacles

Pinnacles*

Statues

Statues

Stone Staircase

Stone Staircase

*Pinnacles are pyramidal or conical crowning ornaments found at tops of architectural elements.
Ornamental Arches

Ornamental Arches

Pillars

Pillars*

Rose Windows

Rose Windows†

Rose Windows

Rose Windows†

Balustrades

Balustrades‡

*Pillars are vertical structural support elements, typically cylindrical and ornamental arches are curved structural elements that span openings between pillars or walls.
†A rose window is a large, circular stained glass window featuring intricate patterns that resemble the petals of a rose.
‡A balustrade is a row of small columns (balusters) topped by a rail, serving as a barrier.


We illustrate localization results of Land3D-XL over diverse architectural elements across multiple landmarks from the HolyScenes dataset. In particular, our approach effectively localizes esoteric architectural terminology while maintaining precise spatial localization across varied lighting conditions, viewpoints, and architectural styles. The segmentation masks (shown in color overlays) accurately capture the boundaries and extent of each queried architectural feature, demonstrating the effectiveness of our approach for localizing semantic concepts over large-scale scenes.


How does it work?


Method Overview


🌍 Given in-the-wild images of large-scale scenes, we first reconstruct the scene using 3D Gaussian Splatting, augmented with learnable low-dimensional semantic bottleneck features.

🎨 We render these low-dimensional features and process them through a multi-resolution hash encoder that operates in feature space, enabling similar features across different spatial locations to share representations.

🔍 The hash encoder outputs high-dimensional CLIP and DINOv2 features via a shallow MLP, which are then processed by our novel Attenuated Downsampler to mitigate semantic misalignments.

⚡ This enables interactive open-vocabulary querying of large-scale, in-the-wild scenes, while maintaining high localization accuracy.

📋 See our paper for details on how Land3D-XL achieves high-fidelity language embedding in challenging in-the-wild scenes.


BibTeX

@misc{krakovsky2025lang3dxllanguageembedded3d,
  title={Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes},
  author={Shai Krakovsky and Gal Fiebelman and Sagie Benaim and Hadar Averbuch-Elor},
  year={2025},
  eprint={2512.07807},
  archivePrefix={cs.CV},
  primaryClass={cs.GR},
  url={https://arxiv.org/abs/2512.07807},
}