Dataset Preprocessing for Video Diffusion Model Training

This folder is about preprocessing code of RoboVIP.

Warning

NOTE

The code provided here is the sample pipeline we used in curation. You need to modify settings inside each file (e.g., input/output directories).
Read the code carefully before executing and modify the part needed to fit your computing environment.
Run with care: some scripts automatically shutil.rmtree the output folder. Double-check paths to avoid deleting the wrong directory.
This preprocessing codebase needs more improvement to be user friently (like auto weight download).

We hold the dataset metadata information by a folder of csv files. The metadata information includes absolute video path in mp4, fps, number of frames, width, height, text prompt, segmentation path, and visual identity path etc.
In our curation, we split the large dataset to multiple small sub-csv, and then use multi-GPU to run in parallel. A sample code we use can be in 'csv_merge_then_split.py'.

0. Dataset Download

In this paper, we mainly invlove Bridge and Droid datasets.

For the Bridge, we use the Open X-Embodiment (OXE) version. A sample transform code from raw download files to mp4 videos + csv files can be found in file 'tfrecord_to_csv_BridgeV1.py' and 'tfrecord_to_csv_BridgeV2.py'.

However, for the Droid, since the resolution is too small for OXE version, we directly download their original files from here. We download the 5.6TB <Raw DROID dataset, non-stereo HD video only> and use 'raw_to_csv_Droid.py' to process to the format we want.

NOTE: The code we provide here is a sample download processing. You might need to check and modify based on your environment.

1. Filter Basic Parameter Setting

First, we need to filter video based on their metadata (like fps, number of frames, width length etc.) This is to ensure that we do not have outlier cases across videos and do sanity check that the video is readable.

Please check the setting inside carefully and then Execute:

python preprocess/filter_basic.py

2. Caption for Multi-View Inputs

Next, in our curation pipeline, we do Muli-view Caption for the dataset.
You can also move this caption step to the last step of the curation. This is because caption do not do any filtering and we leave 100% of the data.

Please check the setting inside carefully and then Execute:

python preprocess/caption_qwen_MV.py

3. Segment Robot Arm with Interacted Objects

This step will segment both the robot mask and its interacted objects.
We handle different datasets differently. For Droid, we focus more on the adapation to the wrist view, which Bridge and Fractal do not have.

Please check the setting inside carefully and then Execute:

python preprocess/segment_RobObj_Bridge.py     # For Bridge & Fractal
python preprocess/segment_RobObj_Droid.py       # For any with wrist view, like Droid / Libero

4. Segment Visual Identities in the Scene

This file need to install Oneformer environment. Plesae check here.

Please check the setting inside carefully and then Execute:

conda activate oneformer    # Please first install Oneformer environment
python preprocess/OneFormer_segment_visual_identity.py

After this step, a sample dataset structure you will see will be similar to 'evaluation_Droid_dataset/' folder inside the 'RoboVIP_data/'. But preprocessing code store the absolute path of the data, the csv file in 'evaluation_Droid_dataset/' is editted to be relative path.

Visual Identity Pool for Downstream tasks

The following section is to collect the visual identity pool used for downstream augmentation. We curate a large image sets classified by the type name from the output of panoptic segmentation, such that we could balance the diversity of identity in the inference stage.
A sample curated results can be found in our RoboVIP_data HF.

Collect All images to One Folder

First, we need to copy all visual identity images across the dataset to one folder path.

Please check the setting inside carefully and then Execute:

python preprocess/collect_visual_identity_pool/prepare_visual_identity_pool.py

The store folder is organized by:

    Visual_ID_Pool/
    |-- Classification_Type_NameA/
    |   |-- image_name1.png
    |   |-- image_name2.png
    |-- Classification_Type_NameB/
    ...

Filter Visual Identity Images by Several Criteria

We apply filtering criteria to filter low quality visual identity images.
We do this for Bridge Datset used for simulation augmentation, such that we can curate a more concetrated but higher quality identity pool.
For Droid and Bridge on regular Video Diffusion Model training, we do not do the following filtering.

Step1: Scoring

We score visual identity images by Size (Resolution), Image Quality Assessment (IQA), Clarity, and Completeness (Clip) criteria.

Please check the setting inside carefully and then Execute:

python preprocess/collect_visual_identity_pool/scoring_pool.py

Step2: Filter based on the threshold

This will rank all score across all the images in the pool and drop the lowest XX%. The new filterred ID images will be stored in a new directory.

Please check the setting inside carefully and then Execute:

python preprocess/collect_visual_identity_pool/filter_score_with_new_store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Preprocessing for Video Diffusion Model Training

0. Dataset Download

1. Filter Basic Parameter Setting

2. Caption for Multi-View Inputs

3. Segment Robot Arm with Interacted Objects

4. Segment Visual Identities in the Scene

Visual Identity Pool for Downstream tasks

Collect All images to One Folder

Filter Visual Identity Images by Several Criteria

Step1: Scoring

Step2: Filter based on the threshold

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Preprocessing for Video Diffusion Model Training

0. Dataset Download

1. Filter Basic Parameter Setting

2. Caption for Multi-View Inputs

3. Segment Robot Arm with Interacted Objects

4. Segment Visual Identities in the Scene

Visual Identity Pool for Downstream tasks

Collect All images to One Folder

Filter Visual Identity Images by Several Criteria

Step1: Scoring

Step2: Filter based on the threshold