This folder is about preprocessing code of RoboVIP.
Warning
NOTE
- The code provided here is the sample pipeline we used in curation. You need to modify settings inside each file (e.g., input/output directories).
- Read the code carefully before executing and modify the part needed to fit your computing environment.
- Run with care: some scripts automatically
shutil.rmtreethe output folder. Double-check paths to avoid deleting the wrong directory. - This preprocessing codebase needs more improvement to be user friently (like auto weight download).
We hold the dataset metadata information by a folder of csv files.
The metadata information includes absolute video path in mp4, fps, number of frames, width, height, text prompt, segmentation path, and visual identity path etc.
In our curation, we split the large dataset to multiple small sub-csv, and then use multi-GPU to run in parallel. A sample code we use can be in 'csv_merge_then_split.py'.
In this paper, we mainly invlove Bridge and Droid datasets.
For the Bridge, we use the Open X-Embodiment (OXE) version. A sample transform code from raw download files to mp4 videos + csv files can be found in file 'tfrecord_to_csv_BridgeV1.py' and 'tfrecord_to_csv_BridgeV2.py'.
However, for the Droid, since the resolution is too small for OXE version, we directly download their original files from here. We download the 5.6TB <Raw DROID dataset, non-stereo HD video only> and use 'raw_to_csv_Droid.py' to process to the format we want.
NOTE: The code we provide here is a sample download processing. You might need to check and modify based on your environment.
First, we need to filter video based on their metadata (like fps, number of frames, width length etc.) This is to ensure that we do not have outlier cases across videos and do sanity check that the video is readable.
Please check the setting inside carefully and then Execute:
python preprocess/filter_basic.pyNext, in our curation pipeline, we do Muli-view Caption for the dataset.
You can also move this caption step to the last step of the curation. This is because caption do not do any filtering and we leave 100% of the data.
Please check the setting inside carefully and then Execute:
python preprocess/caption_qwen_MV.pyThis step will segment both the robot mask and its interacted objects.
We handle different datasets differently.
For Droid, we focus more on the adapation to the wrist view, which Bridge and Fractal do not have.
Please check the setting inside carefully and then Execute:
python preprocess/segment_RobObj_Bridge.py # For Bridge & Fractal
python preprocess/segment_RobObj_Droid.py # For any with wrist view, like Droid / LiberoThis file need to install Oneformer environment. Plesae check here.
Please check the setting inside carefully and then Execute:
conda activate oneformer # Please first install Oneformer environment
python preprocess/OneFormer_segment_visual_identity.pyAfter this step, a sample dataset structure you will see will be similar to 'evaluation_Droid_dataset/' folder inside the 'RoboVIP_data/'. But preprocessing code store the absolute path of the data, the csv file in 'evaluation_Droid_dataset/' is editted to be relative path.
The following section is to collect the visual identity pool used for downstream augmentation.
We curate a large image sets classified by the type name from the output of panoptic segmentation, such that we could balance the diversity of identity in the inference stage.
A sample curated results can be found in our RoboVIP_data HF.
First, we need to copy all visual identity images across the dataset to one folder path.
Please check the setting inside carefully and then Execute:
python preprocess/collect_visual_identity_pool/prepare_visual_identity_pool.pyThe store folder is organized by:
Visual_ID_Pool/
|-- Classification_Type_NameA/
| |-- image_name1.png
| |-- image_name2.png
|-- Classification_Type_NameB/
...We apply filtering criteria to filter low quality visual identity images.
We do this for Bridge Datset used for simulation augmentation, such that we can curate a more concetrated but higher quality identity pool.
For Droid and Bridge on regular Video Diffusion Model training, we do not do the following filtering.
We score visual identity images by Size (Resolution), Image Quality Assessment (IQA), Clarity, and Completeness (Clip) criteria.
Please check the setting inside carefully and then Execute:
python preprocess/collect_visual_identity_pool/scoring_pool.pyThis will rank all score across all the images in the pool and drop the lowest XX%. The new filterred ID images will be stored in a new directory.
Please check the setting inside carefully and then Execute:
python preprocess/collect_visual_identity_pool/filter_score_with_new_store.py