TL;DR We introduce Lang3D-XL, a method for efficiently distilling language features into large-scale,
in-the-wild 3D Gaussian scenes.
Our method leverages an extremely compact semantic bottleneck (just three dimensions per Gaussian) and a novel Attenuated Downsampler module to
efficiently distill language features into large-scale 3D Gaussian Scenes. As illustrated above, Land3D-XL enables
open-vocabulary querying at interactive speeds.
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
We illustrate localization results of Land3D-XL over diverse architectural elements across multiple
landmarks from the HolyScenes dataset. In particular, our approach effectively localizes esoteric architectural
terminology while maintaining precise spatial localization across varied lighting conditions, viewpoints,
and architectural styles. The segmentation masks (shown in color overlays) accurately capture the boundaries and extent
of each queried architectural feature, demonstrating the effectiveness of our approach for
localizing semantic concepts over large-scale scenes.
🌍 Given in-the-wild images of large-scale scenes, we first reconstruct the scene using 3D Gaussian Splatting,
augmented with learnable low-dimensional semantic bottleneck features.
🎨 We render these low-dimensional features and process them through a multi-resolution hash encoder
that operates in feature space, enabling similar features across different spatial locations to share representations.
🔍 The hash encoder outputs high-dimensional CLIP and DINOv2 features via a shallow MLP, which are
then processed by our novel Attenuated Downsampler to mitigate semantic misalignments.
⚡ This enables interactive open-vocabulary querying of large-scale, in-the-wild scenes, while maintaining high localization accuracy.
📋 See our paper for details on how Land3D-XL achieves high-fidelity language embedding in challenging in-the-wild scenes.
@misc{krakovsky2025lang3dxllanguageembedded3d,
title={Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes},
author={Shai Krakovsky and Gal Fiebelman and Sagie Benaim and Hadar Averbuch-Elor},
year={2025},
eprint={2512.07807},
archivePrefix={cs.CV},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2512.07807},
}