Skip to content

RoboVIP/RoboVIP_VDM

Repository files navigation

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Paper Website

We propose RoboVIP, a multi-view inpainting-based video diffusion model with identity reference as conditions to augment robotics manipulation data in both simulation and real-world robot setup.

🔥 Update | 🔧 Installation | 💻 Inference Augmentation | 🧩 Dataset Preprocessing | 🔥Train

Update 🔥🔥🔥

  • Release the paper
  • Release the Video Diffusion Model weights and Inference Code
  • Less GPU memory intense version (<80GB) of Bridge RLDS
  • Release the preprocessing code of the dataset
  • Release the training code for the Video Diffusion Model
  • Release the simulation testing
  • Release the training code for simulation

If you like RoboVIP, please help ⭐⭐star⭐⭐ this repo. Thanks! 🤗

Installation 🔧

# Main Install
conda create -n robovip python=3.10
conda activate robovip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# Auxiliary packages
python -m spacy download en_core_web_sm

# Install ffmpeg 
conda install ffmpeg

Inference and Augment the Data ⚡

The inference code we provide is three fold:
(1) RoboVIP on Droid testing split
(2) Bridge dataset in RLDS structure for preparing datasets used in simulation training
(3) Franka real-robot deployment datasets in the LeRobot structure

Download Testing Data and other weights needed

We upload Droid testing dataset that we used in video diffusion model comparison, Visual Identity Pool for Bridge and Droid (used for augmentation), and our Franka real-robot data to the huggingface.

# Download the necessary Data 
hf download HikariDawn/RoboVIP_data --repo-type dataset --local-dir RoboVIP_data        # ~41GB before unzip
cd RoboVIP_data
unzip evaluation_Droid_dataset.zip 
unzip grouped_ID_Bridge_filter.zip
unzip grouped_ID_Droid.zip
unzip real_robot_sample.zip
cd ../          # Back to the main directory

# Prepare other weights (like SAM2)
mkdir weights
cd weights
wget https://huggingface.co/facebook/sam2.1-hiera-large/resolve/main/sam2.1_hiera_large.pt
cd ../          # Back to the main directory

(1) Regular Multi-view Video Generation Inference for Droid

After downloading the necessary data and checkpoints, we can be ready to run RoboVIP on Droid testing split we made.

python inference_code/test_RoboVIP_on_Droid.py

This will run the testing datasets for Droid that we used. We provide 400 cases of the testing dataset for Droid, where we will use the first 300 of them. This file will automatically load our RoboVIP model weight (Droid variant) in Huggingface.

NOTE: Read the code carefully before executing and modify the part needed to fit your computing environment.
Run with care: some scripts automatically shutil.rmtree the output folder. Double-check paths to avoid deleting the wrong directory.

(2) Augment Bridge (RLDS structure) inference

You need to first download the Bridge dataset in RLDS format; we reuse the HF dataset repo by another contributor:

# Prepare Bridge Dataset in RLDS form
hf download shihao1895/bridge-rlds --repo-type dataset --local-dir bridge_rlds     # ~124GB 

Then, execute:

python inference_code/augment_bridge_rlds_for_simulation.py

This will cost GPU memory to 77GB (because it loads a lot of models at the same time) by using 8B version of the Qwen3-VL (not 32B version). We used Qwen3-VL-32B-Instruct version in our paper, but this will cost more than 100GB of GPU memory in total (not friendly to most GPU device). You can save more GPU memory by open int4 quantization for Qwen or even reuse Cosmos-Reason-7B model to replace Qwen.

This Bridge variant will automatically load our RoboVIP model weight (Bridge variant) in Huggingface. The visual identity pool is curated from the Bridge dataset (already prepared if you follow previous RoboVIP_data instruction).

For the augmented data, we will slightly modify the sturcutre of RLDS to store it. Feature by an extra key of "augmented" and the value is compressed jpg bytes. Despite RLDS-related files, this program will also store the generated videos (with the conditioning inputs) to another folder, named "visual_results_for_simulation".

Note: Please check the setting inside carefully before executing it to avoid incorrect 'shutil.rmtree'!

(3) Augment Franka Real-Robot on Lerobot structure

This part augments Franka-collected real-world robot data in lerobot h5py files (automatically downloaded from previous RoboVIP_data huggingface).
Execute:

python inference_code/augment_franka_lerobot_for_real_robot.py

Since the real robot task is only pick cube, we directly enter the word "cube" for RoboSAM (Open-Vocab Segmentation) to accelerate and save memory. Visual Identity pool is curated from the Droid dataset (already prepared if you follow previous RoboVIP_data instruction). Compared to Bridge in step (2), this Real-Robot version will not generate all videos at once for RoboVIP video diffusion model (because there are too many frames). Instead, we generate at most 49 frames each time and needs to run our model multiple times to complete one episode.

We will also store the generated videos (with the conditioning inputs) to a folder named "visual_results_for_real_robot".

Note: Please check the setting inside carefully before executing it to avoid incorrect 'shutil.rmtree'!


Dataset Preprocess 🧩

The dataset preprocessing code can be found in preprocess folder. This is composed of the curation of Basic Parameter Filter, Caption, Robot + Interact Object Segmentation, and Visual Identity Image Curation. Further, we provide our visual identity code and its corresponding filtering techniques.


Train 🔥

To successfully train the model, you need to first curate the dataset and preprocess based on our previous instructions.
It is needed to modify csv_train_folder_paths, csv_validation_folder_path, and any other setting you feel that it is needed to change under config/ folder

# Train Droid with 8GPU
accelerate launch --config_file config/accelerate_config/8gpu.yaml  train_code/train_wan_v2v_withID_lora.py --config_path config/train_Wan_14B_MV2MV_withID_Droid.yaml

# Train Bridge with 8GPU
accelerate launch --config_file config/accelerate_config/8gpu.yaml  train_code/train_wan_v2v_withID_lora.py --config_path config/train_Wan_14B_MV2MV_withID_Brdige.yaml

The default setting we used is 8GPU, 1 Batch Size per GPU (but with 4 gradient accumulation steps). Using gradient accumulation steps to control actual batch size (instead of the superficial batch size setting) is good for variable number of frames / variable resolution inputs, where no attention mask is needed. The training time is 3-4 days for Top H-series GPU. We beleive that training shorter is also fine.


Citation 📚

If you make use of our work, please cite our paper.

@article{wang2026robovip,
  title={RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation},
  author={Wang, Boyang and Zhang, Haoran and Zhang, Shujie and Hao, Jinkun and Jia, Mingda and Lv, Qi and Mao, Yucheng and Lyu, Zhaoyang and Zeng, Jia and Xu, Xudong and others},
  journal={arXiv preprint arXiv:2601.05241},
  year={2026}
}

Acknowledgment 🤗

RoboVIP is built on diffusers and RoboEngine. We appreciate the authors for sharing their awesome codebase.

About

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages