Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation.
Kitchen
Bedroom
Children Room
Meeting Room
Kitchen
Bathroom
Office
Office
Restaurant
Gym
Meeting Room
Living Room
We show some examples of scene generation with room structures like windows and doors, which we obmit in the main experiments.
Our code can easily export the generated scenes as USD files and load them into Isaac Sim.
Through Apple Vision Pro, we remotely control a Unitree G1 humanoid robot to perform object interactions.
Three key advantages of SceneWeaver for embodied AI applications:
✓ High-fidelity simulation with preserved textures and geometric details.
✓ Robust physical interactions guaranteed by collision-free and boundary-constrained object placement.
✓ Task-aligned scene layouts that adapt to diverse EAI requirements through controllable synthesis.
Results of different Initializer for bedroom generation.
Results of Add Crowd. We show samples of crowded shelf generation in two scenes: bookstore and kitchen.
Results of Add 2D Guidance. We show samples of crowded shelf generation in two scenes: bookstore and kitchen.
@inproceedings{yang2025sceneweaver,
title={SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent},
author={Yang, Yandan and Jia, Baoxiong and Zhang, Shujie and Huang, Siyuan},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}