OBEYED-VLA: Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding
This is the official code for the perception grounding module of OBEYED-VLA. OBEYED-VLA decouples perception from control using frozen VLM-based object-centric grounding plus masked-depth geometric grounding, then fine-tunes a VLA only on clean single-object demos.
This codebase exposes two main entrypoints:
perception_service_batch.py: batch processing to generate training data (mask selection, depth, and overlays).perception_service_fastapi.py: FastAPI service for deployment.
- This codebase currently does not contain the real-world robot interface or the action reasoning policy. We developed our robot interface from diffusion-policy to work with UR10e. For action reasoning, we employ openpi.
- Ensure you have at least 24GB of GPU for YOLO + DepthAnythingV2.
- Qwen3-VL 8B-Instruct requires one A6000 GPU, if you do not have one, you can change the code to use OpenAI's API instead.
- Paths are relative; keep
weights/yolo11l-seg.ptand downloaded DepthAnythingV2 checkpoints in expected locations. - If modifying weights paths, update
YOLO11SegEngineConfigor pass a custom path.
- Install uv following the instructions.
- From repo root, resolve deps from
pyproject.toml:uv sync(creates/uses a 3.10 env automatically).
- YOLO segmentation: default path
weights/yolo11l-seg.pt(included). - DepthAnythingV2: download the metric depth weights (not bundled):
Adjust the filename if you switch encoders (
mkdir -p DepthAnythingV2/checkpoints wget -O DepthAnythingV2/checkpoints/depth_anything_v2_metric_hypersim_vitl.pth \ https://huggingface.co/LiheYoung/Depth-Anything-V2/resolve/main/checkpoints/depth_anything_v2_metric_hypersim_vitl.pth
vits,vitb,vitl,vitg). - Cutie: if weights are missing, place them under
Cutie/weights/:Update to the latest links from the Cutie repo if needed.mkdir -p Cutie/weights wget -O Cutie/weights/cutie-base-mega.pth \ https://huggingface.co/SHI-Labs/Cutie/resolve/main/weights/cutie-base-mega.pth
- The perception service queries a hosted VLM; update
VLM_MODELandBASE_URLinperception_core.pyto match your served model and endpoint. - We use Qwen3-VL (8B-Instruct) and serve it via vLLM; adapt to your deployment as needed.
- Script:
batch_processing.sh - Defaults: reads videos under
assets/grocery_itemsand writes outputs there. - Usage:
./batch_processing.sh <subject1> [subject2 ...] - Env toggles:
VIDEO_ROOT(defaultassets/grocery_items)OUT_ROOT(defaultassets/grocery_items)OVERWRITE=1|0,PER_EPISODE_DIR=1|0,DRY_RUN=1|0
- Internally calls
perception_service_batch.pyto generate perception grounded data for VLA policy training.
- Entry:
perception_service_fastapi.py - Run (from repo root):
uv run uvicorn perception_service_fastapi:app --host 0.0.0.0 --port 8000 - Accepts image uploads and returns processed outputs using the same perception core.