ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment

Stanford University
Teaser Image Teaser Image

We introduce a novel text-driven framework for autoregressive 3D indoor scene synthesis, completion, and editing—supporting object addition, removal, and swapping via natural language prompts. Here, we show a final scene with this process.

How ReSpace Works

At the heart of our method lies a Structured Scene Representation (SSR) in JSON format, together with short object prompts for object addition in natural language.

ReSpace Architecture

For single object addition, an object prompt gets fed, together with the existing SSR, into SG-LLM, a specially trained model for spatial reasoning and object placement. We train SG-LLM via SFT+GRPO.

ReSpace Architecture

To go from text to mesh-based scene, we employ a sampling engine for 3D assets that matches geometry and semantics for a queried object.

ReSpace Architecture

For object removal and full scene synthesis, we leverage a zero-shot LLM to edit the SSR directly in text space (removal) and to generate object prompt lists for addition (full scenes) that get passed autoregressively into SG-LLM.

ReSpace Architecture

We further propose a novel metric for layout violations, Voxelization-Based Loss (VBL), using voxelized meshes instead of 3D bounding boxes. VBL is simply the sum of all voxels that are out-of-bounds or intersect between objects. Lower VBL indicates higher quality scenes.

Abstract

Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. In contrast, LLM-based methods enable richer semantics via natural language (e.g., 'modern studio with light wood furniture') but do not support editing, remain limited to rectangular layouts or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a novel voxelization-based evaluation that captures fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on object addition while maintaining competitive results on full scene synthesis.

Experiments: Qualitative Results

We conduct experiments with the 3D-FRONT and 3D-FUTURE datasets on both single object addition and removal and full scene synthesis. For both, we perform quantitative comparison with baselines: ATISS and MiDiffusion. More details are in the paper.

Instructions (object addition)

Full Scene Synthesis (BoN=1)

Prompt List (8): “plush cushions bed”, “gray minimalist nightstand”, “traditional three-door wardrobe”, “dual-tone table”, “decorative padded beige seat chair”, “white planter”, “wooden and metal bookcase”, “modern lamp”

Prompt List (11): “modern brown leather sofa”, “grey corner table with crossbar”, “low-profile floor lamp”, “simple white coffee table”, “blue velvet armchair”, “wooden wall art”, “classic pendant lamp”, “low storage cabinet”, “mid-century wooden side table”, “tv stand”, “hanging plant”

Full Scene Synthesis (BoN=8)

Framework Overview

ReSpace Architecture

Video (Summary)

BibTeX

If using our dataset/model or if you found our work useful, please consider citing us as follows:

@article{bucher2025respace,
  title={ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment},
  author={Bucher, Martin JJ and Armeni, Iro},
  journal={arXiv preprint arXiv:2506.02459},
  year={2025}
}