Spice·E

Structural Priors in 3D Diffusion using Cross-Entity Attention

SIGGRAPH 2024 (Conference Proceedings)
*Denotes equal contribution
Tel-Aviv University
A Birthday Cupcake
Guidance
Output    
It has a Pagoda like feature
Guidance
Output    
A chair made out of bubblegum
Guidance
Output    
It has less spindles in the backrest
Guidance
Output    
A Witch's Hat
Guidance
Output    


TL;DR Our method adds structural guidance to 3D diffusion models.

This allows for generating text-conditional 3D shapes that enforce task-specific structural priors from auxiliary guidance shapes (denoted by Guidance above). Our approach supports different 3D editing tasks, such as: semantic shape editing (performing semantic modifications over the input guidance shape; text colored in green above), text-conditional Abstraction-to-3D (transforming primitive-based abstractions into highly expressive shapes; text colored in blue above) and 3D stylization (stylizing uncolored guidance shapes according to target text prompts; text colored in red above), all enabled within seconds.


Abstract

We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice·E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice·E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.


Examples of Text-conditional Abstraction-to-3D Results



           
           


Select a guidance primitive-based abstraction and then select one of the text prompts to view structure and text conditional tables generated with our method.
As illustrated in this interactive visualization, Spice·E allows for enforcing structural priors while conveying the target text prompt.

How does it work?


Overview


🌍 We finetune a transformer-based diffusion model, as exemplified by Shap·E which is pretrained on a large dataset of text-conditional 3D assets, to enable structural control over the generated 3D shapes.

💡 The diffusion model (in gray) is modified to use latent vectors from multiple entities - a conditional guidance shape \(\mathbf{X}_c\) and an input 3D shape \(\mathbf{X}_{in}\), and self-attention layers are replaced with our proposed cross-entity attention mechanism (in red). This mechanism mixes their latent representations by carefully combining their Queries functions, allowing for learning task-specific structural priors while preserving the model's generative capabilities.

🔍 During inference, Spice·E receives a guidance shape in addition to a target text prompt, enabling the generation of 3D shapes (represented as either a neural radiance field or a signed texture field) conditioned on both high-level text directives and low-level structural constraints.

📋 See our paper for more details on our cross-entity attention mechanism and how we apply it for incorporating structural priors in pretrained 3D diffusion models.


BibTeX

@misc{sella2024spicee,
    title={Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention},
    author={Etai Sella and Gal Fiebelman and Noam Atia and Hadar Averbuch-Elor},
    year={2024},
    eprint={2311.17834},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgements

We thank Peter Hedman, Ron Mokadi, Daniel Garibi, Itai Lang and Or Patashnik for helpful discussions.
This work was supported by the Israel Science Foundation (grant no. 2510/23) and by the Alon Scholarship.