Official implementation of Dexterous World Models.
TL;DR: DWM is a scene-action-conditioned video diffusion model for simulating embodied dexterous actions in a given static 3D scene.
April 3, 2026: Code release. We also release the DWM WAN version together.February 21, 2026: DWM was accepted toCVPR 2026.
git clone --recursive https://github.com/snuvclab/dwm
cd dwm
# If you already cloned without submodules, run:
# git submodule update --init --recursive
conda create -n dwm python=3.10 -y
pip install -r requirements.txtAll commands below assume you run them from the repository root.
See the preprocessing guides:
The expected processed sample structure is:
<processed_root>/
└── <sample>/
├── videos/
│ └── <stem>.mp4
├── videos_static/
│ └── <stem>.mp4
├── videos_hands/
│ └── <stem>.mp4
├── prompts/
│ └── <stem>.txt
├── prompts_rewrite/
│ └── <stem>.txt
├── video_latents/
│ └── <stem>.pt
├── static_video_latents/
│ └── <stem>.pt
├── hand_video_latents/
│ └── <stem>.pt
└── prompt_embeds_rewrite/
└── <stem>.pt
You may place processed data under any root directory you prefer. Training and inference paths can be configured through the example YAML or CLI overrides.
The main training guide is:
Public example config and launcher:
training/cogvideox/configs/examples/dwm_cogvideox_5b_lora.yamltraining/cogvideox/examples/train_static_hand_concat.sh
Example smoke run:
bash training/cogvideox/examples/train_static_hand_concat.sh \
--debug \
--override data.data_root=/path/to/processed_root \
--override logging.report_to=noneInference supports either a dataset file or a single sample. Example launcher:
bash training/cogvideox/examples/infer_static_hand_concat.sh \
--checkpoint_path outputs/<date>/<experiment> \
--data_root /path/to/processed_root \
--dataset_file dataset_files/trumans_test.txt \
--output_dir outputs_infer/dwm_cogvideox_datasetExample dataset files based on the train and test splits used for the paper models are available under dataset_files/:
dataset_files/trumans_train.txtdataset_files/taste_rob_train.txtdataset_files/trumans_test.txtdataset_files/taste_rob_test.txt
Single-sample inference:
python training/cogvideox/inference.py \
--checkpoint_path outputs/<date>/<experiment> \
--experiment_config training/cogvideox/configs/examples/dwm_cogvideox_5b_lora.yaml \
--data_root /path/to/processed_root \
--video <relative/path/to/videos/00000.mp4> \
--output_dir outputs_infer/dwm_cogvideox_single- The default 5B training path typically needs an
80 GB-class GPU. - Relative dataset paths in training and inference are resolved inside
data_root. - If you use a custom processed-data root, update
data.data_rootin the example config or pass it via CLI overrides.
We thank the contributors to VideoX-Fun, finetrainers, CogVideo, and Wan for open-sourcing their work.
If you find this repository useful, please cite:
@inproceedings{kim2026dwm,
title={Dexterous World Models},
author={Kim, Byungjun and Kim, Taeksoo and Lee, Junyoung and Joo, Hanbyul},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}