4-LEGS

4D Language Embedded Gaussian Splatting

1Tel Aviv University, 2Google Research



TL;DR Our method grounds spatio-temporal features into a 4D Gaussian Splatting representation.

This allows localizing actions in time and space. Above we illustrate our method, given input multiview videos capturing a dynamic 3D scene, we optimize a 4-LEGS, a 4D Language Embedded Gaussian Splatting representation of the dynamic scene. Then we localize a text query in both space and time using the mean relevancy score and the extracted relvancy maps. These spatio-temporal maps allow for creating various highlight effects, such as automatically visualizing a bullet-time display at a slower speed of the input query.


Abstract

The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.


Examples of 4-LEGS Text-Prompted Video Editing Applications







Select a video editing application and then select one of the text prompts to view the edit enabled by the spatio-temporal grounding achieved by our method.
As illustrated in this interactive visualization, 4-LEGS enables interactive text-conditioned video editing by localizing spatio-temporal features in both time and space.


How does it work?


Overview


🌍 Given multiple videos capturing a dynamic 3D scene, we first extract pixel-aligned spatio-temporal language features at multiple scales using a pretrained video-text model.

💡 We average these features to produce spatio-temporal features, which are encoded into a more compact latent space that is used for supervising the optimization of a 4D language embedded Gaussian.

🔍 During inference, given an input language query, 4-LEGS localizes the query in time by computing a relevancy score over the volumetric language features distilled on the gaussians and in space we render relevancy maps in real time.

📋 See our paper for more details on our 4D language embedded gaussians and how we apply them to enable an interactive interface for text-conditioned video editing tasks.


BibTeX

@misc{fiebelman20244legs4dlanguageembedded,
    title={4-LEGS: 4D Language Embedded Gaussian Splatting},
    author={Gal Fiebelman and Tamir Cohen and Ayellet Morgenstern and Peter Hedman and Hadar Averbuch-Elor},
    year={2024},
    eprint={2410.10719},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgements

This work was partially funded by Google through a TAU-Google grant.