Lang3D-XL

Language Embedded 3D Gaussians for Large-scale Scenes

Shai Krakovsky¹, Gal Fiebelman², Sagie Benaim², Hadar Averbuch-Elor³

¹Tel Aviv University, ²The Hebrew University of Jerusalem, ³Cornell University

SIGGRAPH Asia 2025

Pillars

Windows

Pediment

Towers

TL;DR We introduce Lang3D-XL, a method for efficiently distilling language features into large-scale, in-the-wild 3D Gaussian scenes.

Our method leverages an extremely compact semantic bottleneck (just three dimensions per Gaussian) and a novel Attenuated Downsampler module to efficiently distill language features into large-scale 3D Gaussian Scenes. As illustrated above, Land3D-XL enables open-vocabulary querying at interactive speeds.

Abstract

Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

3D Localization Results

Portals

Pinnacles*

Statues

Stone Staircase

*Pinnacles are pyramidal or conical crowning ornaments found at tops of architectural elements.

Ornamental Arches

Pillars*

Rose Windows†

Balustrades‡

*Pillars are vertical structural support elements, typically cylindrical and ornamental arches are curved structural elements that span openings between pillars or walls.
†A rose window is a large, circular stained glass window featuring intricate patterns that resemble the petals of a rose.
‡A balustrade is a row of small columns (balusters) topped by a rail, serving as a barrier.

Mihrab*

Domes

Turrets

Calligraphy Panels†

Muqaranas‡

*A mihrab is a semicircular niche in the wall of a mosque.
†Calligraphy panels are decorative architectural elements featuring artistic handwritten text.
‡Muqarnas are a form of three-dimensional decoration in Islamic architecture.

Reliefs*

Windows

Semi Circular Arch

Pediment†

*Reliefs are sculptural forms raised from a background, typically a flat surface.
†A pediment is a triangular or curved decorative element typically found above the entrance of classical buildings, supported by columns.

Portals

Pinnacles*

Statues

Stone Staircase

*Pinnacles are pyramidal or conical crowning ornaments found at tops of architectural elements.

Ornamental Arches

Pillars*

Rose Windows†

Balustrades‡

We illustrate localization results of Land3D-XL over diverse architectural elements across multiple landmarks from the HolyScenes dataset. In particular, our approach effectively localizes esoteric architectural terminology while maintaining precise spatial localization across varied lighting conditions, viewpoints, and architectural styles. The segmentation masks (shown in color overlays) accurately capture the boundaries and extent of each queried architectural feature, demonstrating the effectiveness of our approach for localizing semantic concepts over large-scale scenes.

How does it work?

🌍 Given in-the-wild images of large-scale scenes, we first reconstruct the scene using 3D Gaussian Splatting, augmented with learnable low-dimensional semantic bottleneck features.

🎨 We render these low-dimensional features and process them through a multi-resolution hash encoder that operates in feature space, enabling similar features across different spatial locations to share representations.

🔍 The hash encoder outputs high-dimensional CLIP and DINOv2 features via a shallow MLP, which are then processed by our novel Attenuated Downsampler to mitigate semantic misalignments.

⚡ This enables interactive open-vocabulary querying of large-scale, in-the-wild scenes, while maintaining high localization accuracy.

📋 See our paper for details on how Land3D-XL achieves high-fidelity language embedding in challenging in-the-wild scenes.

BibTeX


                @misc{krakovsky2025lang3dxllanguageembedded3d, 

                     title={Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes}, 

                     author={Shai Krakovsky and Gal Fiebelman and Sagie Benaim and Hadar Averbuch-Elor}, 

                     year={2025}, 

                     eprint={2512.07807}, 

                     archivePrefix={cs.CV}, 

                     primaryClass={cs.GR}, 

                     url={https://arxiv.org/abs/2512.07807}, 

                }