We propose RoboVIP, a multi-view inpainting-based video diffusion model with identity reference as conditions to augment robotics manipulation data in both simulation and real-world robot setup.
🔥 Update | 🔧 Installation | 💻 Inference Augmentation | 🧩 Dataset Preprocessing | 🔥Train
- Release the paper
- Release the Video Diffusion Model weights and Inference Code
- Less GPU memory intense version (<80GB) of Bridge RLDS
- Release the preprocessing code of the dataset
- Release the training code for the Video Diffusion Model
- Release the simulation testing
- Release the training code for simulation
⭐ If you like RoboVIP, please help ⭐⭐star⭐⭐ this repo. Thanks! 🤗
# Main Install
conda create -n robovip python=3.10
conda activate robovip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Auxiliary packages
python -m spacy download en_core_web_sm
# Install ffmpeg
conda install ffmpeg
The inference code we provide is three fold:
(1) RoboVIP on Droid testing split
(2) Bridge dataset in RLDS structure for preparing datasets used in simulation training
(3) Franka real-robot deployment datasets in the LeRobot structure
We upload Droid testing dataset that we used in video diffusion model comparison, Visual Identity Pool for Bridge and Droid (used for augmentation), and our Franka real-robot data to the huggingface.
# Download the necessary Data
hf download HikariDawn/RoboVIP_data --repo-type dataset --local-dir RoboVIP_data # ~41GB before unzip
cd RoboVIP_data
unzip evaluation_Droid_dataset.zip
unzip grouped_ID_Bridge_filter.zip
unzip grouped_ID_Droid.zip
unzip real_robot_sample.zip
cd ../ # Back to the main directory
# Prepare other weights (like SAM2)
mkdir weights
cd weights
wget https://huggingface.co/facebook/sam2.1-hiera-large/resolve/main/sam2.1_hiera_large.pt
cd ../ # Back to the main directory
After downloading the necessary data and checkpoints, we can be ready to run RoboVIP on Droid testing split we made.
python inference_code/test_RoboVIP_on_Droid.pyThis will run the testing datasets for Droid that we used. We provide 400 cases of the testing dataset for Droid, where we will use the first 300 of them. This file will automatically load our RoboVIP model weight (Droid variant) in Huggingface.
NOTE: Read the code carefully before executing and modify the part needed to fit your computing environment.
Run with care: some scripts automatically shutil.rmtree the output folder. Double-check paths to avoid deleting the wrong directory.
You need to first download the Bridge dataset in RLDS format; we reuse the HF dataset repo by another contributor:
# Prepare Bridge Dataset in RLDS form
hf download shihao1895/bridge-rlds --repo-type dataset --local-dir bridge_rlds # ~124GB Then, execute:
python inference_code/augment_bridge_rlds_for_simulation.pyThis will cost GPU memory to 77GB (because it loads a lot of models at the same time) by using 8B version of the Qwen3-VL (not 32B version). We used Qwen3-VL-32B-Instruct version in our paper, but this will cost more than 100GB of GPU memory in total (not friendly to most GPU device). You can save more GPU memory by open int4 quantization for Qwen or even reuse Cosmos-Reason-7B model to replace Qwen.
This Bridge variant will automatically load our RoboVIP model weight (Bridge variant) in Huggingface. The visual identity pool is curated from the Bridge dataset (already prepared if you follow previous RoboVIP_data instruction).
For the augmented data, we will slightly modify the sturcutre of RLDS to store it. Feature by an extra key of "augmented" and the value is compressed jpg bytes. Despite RLDS-related files, this program will also store the generated videos (with the conditioning inputs) to another folder, named "visual_results_for_simulation".
Note: Please check the setting inside carefully before executing it to avoid incorrect 'shutil.rmtree'!
This part augments Franka-collected real-world robot data in lerobot h5py files (automatically downloaded from previous RoboVIP_data huggingface).
Execute:
python inference_code/augment_franka_lerobot_for_real_robot.pySince the real robot task is only pick cube, we directly enter the word "cube" for RoboSAM (Open-Vocab Segmentation) to accelerate and save memory. Visual Identity pool is curated from the Droid dataset (already prepared if you follow previous RoboVIP_data instruction). Compared to Bridge in step (2), this Real-Robot version will not generate all videos at once for RoboVIP video diffusion model (because there are too many frames). Instead, we generate at most 49 frames each time and needs to run our model multiple times to complete one episode.
We will also store the generated videos (with the conditioning inputs) to a folder named "visual_results_for_real_robot".
Note: Please check the setting inside carefully before executing it to avoid incorrect 'shutil.rmtree'!
The dataset preprocessing code can be found in preprocess folder. This is composed of the curation of Basic Parameter Filter, Caption, Robot + Interact Object Segmentation, and Visual Identity Image Curation. Further, we provide our visual identity code and its corresponding filtering techniques.
To successfully train the model, you need to first curate the dataset and preprocess based on our previous instructions.
It is needed to modify csv_train_folder_paths, csv_validation_folder_path, and any other setting you feel that it is needed to change under config/ folder
# Train Droid with 8GPU
accelerate launch --config_file config/accelerate_config/8gpu.yaml train_code/train_wan_v2v_withID_lora.py --config_path config/train_Wan_14B_MV2MV_withID_Droid.yaml
# Train Bridge with 8GPU
accelerate launch --config_file config/accelerate_config/8gpu.yaml train_code/train_wan_v2v_withID_lora.py --config_path config/train_Wan_14B_MV2MV_withID_Brdige.yamlThe default setting we used is 8GPU, 1 Batch Size per GPU (but with 4 gradient accumulation steps). Using gradient accumulation steps to control actual batch size (instead of the superficial batch size setting) is good for variable number of frames / variable resolution inputs, where no attention mask is needed. The training time is 3-4 days for Top H-series GPU. We beleive that training shorter is also fine.
If you make use of our work, please cite our paper.
@article{wang2026robovip,
title={RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation},
author={Wang, Boyang and Zhang, Haoran and Zhang, Shujie and Hao, Jinkun and Jia, Mingda and Lv, Qi and Mao, Yucheng and Lyu, Zhaoyang and Zeng, Jia and Xu, Xudong and others},
journal={arXiv preprint arXiv:2601.05241},
year={2026}
}
RoboVIP is built on diffusers and RoboEngine. We appreciate the authors for sharing their awesome codebase.
