ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

How ReSpace Works

At the heart of our method lies a Structured Scene Representation (SSR) in JSON format, together with short object prompts for object addition in natural language.

For single object addition, an object prompt gets fed, together with the existing SSR, into SG-LLM, a specially trained model for spatial reasoning and object placement. Rejection sampling Fine-Tuning (RFT) outpeforms DPO and GRPO after pure SFT.

To go from text (SSR) to mesh-based scene, we employ a sampling engine for 3D assets that matches geometry and semantics for a queried object.

For object removal and full scene synthesis, we leverage a zero-shot LLM to edit the SSR directly in text space (removal) and to generate object prompt lists for addition (full scenes) that get passed autoregressively into SG-LLM.

Experiments: Human Evaluations

We conducted human evaluation studies using Bradley-Terry analysis and evaluated on a rectangular-only subset of 100 scenes.

1 ReSpace

75.3%

2 Mi-Diff

56.7%

3 ATISS

45.7%

4 LayoutVLM

39.9%

5 LayoutGPT

31.2%

Experiments: Qualitative Results

We conduct experiments with the 3D-FRONT and 3D-FUTURE datasets on both single object addition and removal and full scene synthesis. For both, we perform quantitative comparison with baselines: ATISS and MiDiffusion. More details are in the paper.

Instructions (object addition)

Full Scene Synthesis (BoN=1)

Prompt List (8): “plush cushions bed”, “gray minimalist nightstand”, “traditional three-door wardrobe”, “dual-tone table”, “decorative padded beige seat chair”, “white planter”, “wooden and metal bookcase”, “modern lamp”

Prompt List (11): “modern brown leather sofa”, “grey corner table with crossbar”, “low-profile floor lamp”, “simple white coffee table”, “blue velvet armchair”, “wooden wall art”, “classic pendant lamp”, “low storage cabinet”, “mid-century wooden side table”, “tv stand”, “hanging plant”

Full Scene Synthesis (BoN=8)

Abstract

Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scene generation either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics via natural language, but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for autoregressive text-driven 3D indoor scene synthesis and editing. Our approach features a compact structured scene representation with explicit room boundaries that enables asset-agnostic deployment and frames scene manipulation as a next-token prediction task, supporting object addition, removal, and swapping via natural language. We employ supervised fine-tuning with a preference alignment stage to train a specialized language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. We further introduce a voxelization-based evaluation metric capturing fine-grained geometric violations beyond 3D bounding boxes. Experiments surpass state-of-the-art on object addition and achieve superior human-perceived quality on the application of full scene synthesis, despite not being trained on it.

BibTeX

If using our dataset/model or if you found our work useful, please cite us as follows:

@article{bucher2025respace, title={ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing}, author={Bucher, Martin JJ and Armeni, Iro}, journal={arXiv preprint arXiv:2506.02459}, year={2025} }