Vinedresser3D: Agentic Text-guided 3D Editing

1The Hong Kong University of Science and Technology
2University of Illinois Urbana-Champaign
CVPR 2026
Vinedresser3D teaser

TL;DR: Vinedresser3D is an agentic framework for 3D editing. It uses a multimodal LLM to interpret editing prompts, and performs precise, mask-free 3D editing directly in the latent space of a native 3D generative model. Vinedresser3D achieves state-of-the-art results in both automatic metrics and human preference.

Abstract

Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the parts to edit and the edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.

Approach

Pipeline of the method

Given a 3D asset and an editing prompt, Vinedresser3D first uses a multimodal large language model (MLLM) to interpret the instruction: it describes the original asset, determines the edit type (addition, modification, or deletion), and generates decomposed textual and visual guidance. A segmentation model then partitions the asset into parts, and the MLLM identifies the specific region to edit. With this guidance and region in hand, an inversion-based editing pipeline carries out the edit directly in 3D latent space.

Inversion-based editing pipeline

Specifically, the pipeline first inverts the original 3D asset back to structured noises, conditioned on the original description. It then performs editing through inpainting, denoising with Trellis-text and Trellis-image alternately at each timestep, using both the new textual and edited visual guidance as conditions.

Editing Results

Citation

@article{chi2025vinedresser3d,
  author    = {Chi, Yankuan and Li, Xiang and Huang, Zixuan and Rehg, James M.},
  title     = {Vinedresser3D: Agentic Text-guided 3D Editing},
  journal   = {arXiv},
  year      = {2025}
}