OpenCabinet

Generative Diffusion Policy for Mobile Manipulation in RoboCasa365

CS 188: Robotics Final Project Univeristy of California, Los Angeles

Problem Statement

Standard behavior cloning models often fail in complex manipulation tasks by "averaging" divergent expert trajectories, leading to suboptimal movements.

Our objective: Build a robot capable of identifying handles, synchronizing 7-DOF arm and mobile base movements, and successfully opening cabinet doors in randomized kitchen environments.

Successful Target Split Rollout (Seed 019)

System Design & Methodology

1D Convolutional U-Net

Our model utilizes an ActionChunkUNet1D architecture. We implemented Action Chunking with a 16-step prediction horizon and 8-step execution horizon to mitigate the "stuttery" movements typical of standard BC.

Contact-Phase Importance Sampling

Using a WeightedRandomSampler, we prioritized precision during the grasping bottleneck. Training frames within 0.05m of the handle were assigned higher weights, forcing the optimization to focus on successful clamping and initial pulling.

State-Space Augmentation

  • Handle Geometry: 3D coordinates and relative distance to EEF.
  • Temporal Awareness: 50 diffusion steps for denoising.
  • Stall Detection: Logic to resample action chunks if hinge angles remain static.

Quantitative & Qualitative Results

Success vs. Failure Distribution

Success Graph

Steps to Completion (Success Cases)

Steps Graph
45%
Final Success Rate
161 - 329
Step Range (Success)
20
Randomized Seeds

Success defined as max_door_open > 0.90 within 700 steps.

Failure Analysis

Failure Case Analysis (Seed 000)

Common Failure Modes

  • 01. Grasp Slippage: Occurs when the robot reaches the handle but fails to maintain friction lock during high-torque pulling phases (e.g., Test 5 reached 0.81 openness).
  • 02. Workspace Singularities: Randomized base initializations occasionally forced the arm into kinematic limits.

Discussion & Future Work

Reachability Maps

Implementing maps to allow the PandaOmron to dynamically reposition its base before initiating a reach, reducing the kinematic strain on the arm.

Vision-Language Models (VLM)

Moving beyond pre-recorded demonstrations toward real-time visual environment understanding for more adaptive movements.