OpenCabinet | Diffusion Policy Project

Problem Statement

Standard behavior cloning models often fail in complex manipulation tasks by "averaging" divergent expert trajectories, leading to suboptimal movements.

Our objective: Build a robot capable of identifying handles, synchronizing 7-DOF arm and mobile base movements, and successfully opening cabinet doors in randomized kitchen environments.

Successful Target Split Rollout (Seed 019)

System Design & Methodology

1D Convolutional U-Net

Our model utilizes an ActionChunkUNet1D architecture. We implemented Action Chunking with a 16-step prediction horizon and 8-step execution horizon to mitigate the "stuttery" movements typical of standard BC.

Contact-Phase Importance Sampling

Using a WeightedRandomSampler, we prioritized precision during the grasping bottleneck. Training frames within 0.05m of the handle were assigned higher weights, forcing the optimization to focus on successful clamping and initial pulling.

State-Space Augmentation

• Handle Geometry: 3D coordinates and relative distance to EEF.
• Temporal Awareness: 50 diffusion steps for denoising.
• Stall Detection: Logic to resample action chunks if hinge angles remain static.

Quantitative & Qualitative Results

Success vs. Failure Distribution

Steps to Completion (Success Cases)

45%

Final Success Rate

161 - 329

Step Range (Success)

20

Randomized Seeds

Success defined as max_door_open > 0.90 within 700 steps.

Failure Analysis

Failure Case Analysis (Seed 000)

Common Failure Modes

01. Grasp Slippage: Occurs when the robot reaches the handle but fails to maintain friction lock during high-torque pulling phases (e.g., Test 5 reached 0.81 openness).
02. Workspace Singularities: Randomized base initializations occasionally forced the arm into kinematic limits.

Discussion & Future Work

Reachability Maps

Implementing maps to allow the PandaOmron to dynamically reposition its base before initiating a reach, reducing the kinematic strain on the arm.

Vision-Language Models (VLM)

Moving beyond pre-recorded demonstrations toward real-time visual environment understanding for more adaptive movements.