TL;DR Our method adds structural guidance to 3D diffusion models.
This allows for generating
text-conditional 3D shapes that enforce task-specific structural priors from auxiliary
guidance shapes (denoted by Guidance above). Our approach supports
different 3D editing tasks, such as: semantic shape editing
(performing semantic modifications over the input guidance shape; text colored in green above),
text-conditional Abstraction-to-3D (transforming primitive-based abstractions into highly expressive shapes; text colored in blue above) and 3D stylization
(stylizing uncolored guidance shapes according to target text prompts; text colored in red above), all enabled within seconds.
We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice·E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice·E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.
            | ||
            |
🌍 We finetune a transformer-based diffusion model, as exemplified by Shap·E
which is pretrained on a large dataset of text-conditional 3D assets, to enable structural control over the generated 3D shapes.
💡 The diffusion model (in gray) is modified to use latent vectors from multiple entities - a conditional guidance shape
\(\mathbf{X}_c\) and an input 3D shape \(\mathbf{X}_{in}\), and self-attention layers are replaced with
our proposed cross-entity attention mechanism (in red). This mechanism mixes their latent representations by
carefully combining their Queries functions, allowing for learning task-specific structural priors while
preserving the model's generative capabilities.
🔍 During inference, Spice·E receives a guidance shape in addition to a target text prompt, enabling the generation of 3D shapes
(represented as either a neural radiance field or a signed texture field) conditioned on both high-level text directives and low-level
structural constraints.
📋 See our paper for more details on our cross-entity attention mechanism and how we apply it for incorporating structural priors
in pretrained 3D diffusion models.
@misc{sella2024spicee,
title={Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention},
author={Etai Sella and Gal Fiebelman and Noam Atia and Hadar Averbuch-Elor},
year={2024},
eprint={2311.17834},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We thank Peter Hedman, Ron Mokadi, Daniel Garibi, Itai Lang and Or Patashnik for helpful discussions.
This work was supported by the Israel Science Foundation (grant no. 2510/23) and by the Alon Scholarship.