Natural language offers a highly intuitive interface for enabling localized, fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. We introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for localized shape edits, adding structural Input through partial conditional shapes to preserve global identity. To enhance identity preservation within edited regions, we propose an inference-time coordinate blending algorithm. This algorithm balances reconstruction of the full shape with inpainting over progressive noise levels, enabling seamless blending of original and edited shapes without requiring costly and inaccurate inversion. Extensive experiments demonstrate that our method outperforms existing techniques across multiple metrics, measuring both fidelity to the original shape and adherence to textual prompts.
Choose an input shape (top row) and an editing instruction (bottom row) to sample localized, fine grained edits produced by our method.
For many additional results, please refer to the supplementary material.
🗂️ To construct our training dataset, we use ShapeTalk, a collection of point cloud shape pairs and corresponding text prompts that describe geometric differences between them.
From this dataset, we derive a specialized subset called l-ShapeTalk by using a fine-tuned LLaMA 3 model to extract the specific part mentioned in each prompt.
This enables us to retain only samples where the edited part is clearly identified, allowing accurate edit mask generation via a PointNet-based segmentation model.
🧪 We fine-tune a transformer-based diffusion model equipped with a cross-entity attention mechanism, following the approach of Spice·E, on top of the pretrained Point·E architecture.
✏️ Our model, Inpaint-E, is trained to take a partial point cloud with a masked region and a corresponding text prompt as input, and to generate a completed shape that reflects the requested edit.
📉 The model is supervised using a denoising objective: it predicts the noise added to the ground truth shape, conditioned on the prompt and the masked input.
🔁 To support our inference time algorithm, we occaisonaly replace the masked point cloud with a complete one and the editing instruction with a null prompt. This mechanism trains the model to reconstruct the guidance shape with high fidelity when required.
🎯 During inference, we aim to generate an edited shape that matches the prompt while preserving the original structure.
🧱 To achieve this, we introduce a coordinate blending technique that combines two types of denoising: inpainting and reconstruction.
🧩 First, we run the model using the full input shape and a null text prompt to reconstruct the original shape from noise.
✏️ At a specific timestep \( t = t_r \), we switch to editing mode, initiating a seperate branch: the model receives the masked shape and the full prompt, generating a modified version of the target part.
🌀 We then blend the outputs from the two paths—keeping the edited part from inpainting and the rest from reconstruction—ensuring smooth, high-quality edits without affecting the rest of the shape.
⚖️ This process avoids using inversion, which is often slow and inaccurate, and leads to better identity preservation across the unedited regions.
@misc{sella2025blendedpointclouddiffusion,
title={Blended Point Cloud Diffusion for Localized Text-guided Shape Editing},
author={Etai Sella and Noam Atia and Ron Mokady and Hadar Averbuch-Elor},
year={2025},
eprint={2507.15399},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2507.15399},
}