Skip to content

Deepseek V4 RL Roadmap #1046

@yueming-yuan

Description

@yueming-yuan

Miles PR: #1045

Megatron for Miles PR: #28

Usage

Docker:

H200/B200 (cu129 x86):

docker pull radixark/miles:deepseek-v4

GB300 (cu130 arm64):
Coming soon

Single-node, layer-pruned model smoke test:

This run will not generate readable output - just for the minimal check of the infrastructure.

docker run -it --name flash-smoke --gpus all --privileged --network=host --ipc=host --shm-size=16g --ulimit memlock=-1 radixark/miles:deepseek-v4 bash
# in docker run
rm -rf /root/miles && git clone --depth 1 --branch deepseek-v4 https://github.com/yueming-yuan/miles.git /root/miles && cd /root/miles && git pull --ff-only origin deepseek-v4 && pip install -e . --no-deps
rm -rf /root/sglang && git clone --depth 1 --branch deepseek_v4 https://github.com/sgl-project/sglang.git /root/sglang && cd /root/sglang && git pull --ff-only origin deepseek_v4 && pip install -e python --no-deps
cd /root/miles && python scripts/run_deepseek_v4.py full-train --model-name DeepSeek-V4-Flash-FP8-4layer --num-nodes 1 --num-gpus-per-node 8

DeepSeek-V4-Flash (284B)

This command may differ, depending on your cluster settings. Check and configure in run_deepseek_v4.py for full training configs, parallel settings, and RL features.
By default, it requires 8 H200 nodes with 8 GPUs per node, shared storage for /root/models and /root/datasets, working NCCL/Gloo networking, and either the script-managed local Ray head or an existing external Ray cluster enabled via MILES_SCRIPT_EXTERNAL_RAY=1 and RAY_ADDRESS.

cd /root/miles
python scripts/run_deepseek_v4.py full-train \
    --model-name DeepSeek-V4-Flash-FP8 \
    --num-nodes 8 \
    --num-gpus-per-node 8

DeepSeek-V4-Pro (1.6T)

TODO, ETA Apr 29/30

Verified Model Sizes

  • DeepSeek-V4-Flash 284B
  • DeepSeek-V4-Pro 1.6T

Low Precisions

  • FP8 Training
  • FP4 QAT training recipe
    • MXFP4/NVFP4 support
    • FP4 QAT

Kernels

@Zhichenzzz

  • Integration of TileKernels
    • mHC forward & backward
  • Compressor kernels

Features / Optimizations

  • Anticipatory Routing
  • Ring-style CSA Context Parallel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions