SPARCS: Simulation-Ready Cluttered Scene Estimation via Physics-Aware Joint Shape and Pose Optimization

1UIUC. 2HKU. 3Meta Reality Labs.

Quick Overview

TL;DR: We enable Simulation-ready Physics-Aware Reconstruction for Cluttered Scenes (SPARCS) from a single RGBD image via Joint Shape and Pose Optimization.

Here we show an example of how the state of the art vision models fail to create simulation-ready cluttered scene as they are not physics-aware, while our framework enforces multi-contact physics constraints and reconstructs scenes that can be directly used for downstream robot planning/policy learning.

Abstract

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.

Method

Method overview

An overview of SPARCS: Given a single RGBD image observation of a cluttered scene, we use SAM3D and FoundationPose to derive an initial estimation of object shapes and poses. But these estimates can violate physical constraints and are not simulation ready (red). Our method jointly adjusts shape and pose parameters to enforce physics constraints while minimizing a perceptual loss, leading to simulation-ready results (green).

Results

We first test our framework on five self-created highly-cluttered scenes with complicated support relationships

Scenes

Single-view RGBD image

Violation

Physics violation from SAM3D + FoundationPose (red: penetration, blue: floating)

Visual-only estimation causes simulation collapeses

Ours

Our estimate (click to check the simulated contact forces)

Our estimate achieves negligible drift and long term stability in MuJoCo

Results (Extended)

We also select cluttered scenes from the YCB-V dataset and show the generalization of our framework

Scenes

Single-view RGBD image

Ours

Our estimate (click to check the simulated contact forces)

Our estimate achieves negligible drift and long term stability in MuJoCo

BibTeX

@article{huang2026simulation,
      title={Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization},
      author={Huang, Wei-Cheng and Han, Jiaheng and Ye, Xiaohan and Pan, Zherong and Hauser, Kris},
      journal={arXiv preprint arXiv:2602.20150},
      year={2026}
    }