[DRAFT][RAY] Add slurm ray tests by AgrawalAmey · Pull Request #409 · NVIDIA/cloudai

AgrawalAmey · 2025-03-12T02:59:24Z

Summary

This PR adds a new test category which runs ray applications with slurm.

Tested with:

cloudai dry-run \
    --system-config conf/common/system/example_slurm_cluster.toml \
    --tests-dir conf/common/test \
    --test-scenario conf/common/test_scenario/slurm_ray_container.toml

Generated sbatch file:

#!/bin/bash
#SBATCH --job-name=TestTemplate_20250312_013021
#SBATCH -N 2
#SBATCH --output=results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/stdout.txt
#SBATCH --error=results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/stderr.txt
#SBATCH --partition=partition_1
#SBATCH --gpus_per_node=8
#SBATCH --gres=gpu:8
#SBATCH --ntasks_per_node=8
#SBATCH --time_limit=00:00:00
#SBATCH --tasks-per-node=2
#SBATCH --exclusive

export SLURM_JOB_MASTER_NODE=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n 1)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MELLANOX_VISIBLE_DEVICES=0,3,4,5,6,9,10,11
export NCCL_IB_GID_INDEX=3
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TIMEOUT=20
srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --output=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/mapping-stdout.txt --error=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/mapping-stderr.txt bash -c "echo \$(date): \$(hostname):node \${SLURM_NODEID}:rank \${SLURM_PROCID}."

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --nodes=1 --ntasks=1 -w "$head_node" \
     \
    ray start --head --node-ip-address="$head_node_ip" --port=$port \
    --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &

# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --nodes=1 --ntasks=1 -w "$node_i" \
         \
        ray start --address "$ip_head" \
        --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
    sleep 5
done

srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --nodes=1 --ntasks=1 \
  -w "$head_node" --gpus-per-node=0 \
   \
  python3 examples/offline_inference/llm_engine_example.py -tp 8 -pp 2

amaslenn

@AgrawalAmey thanks a lot for your contribution! And sorry for the late feedback.

amaslenn · 2025-03-19T11:21:45Z

conf/common/test/slurm_ray_container_vllm.toml

@@ -0,0 +1,23 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


For new files please set a single year value (diagnostic we have today is misleading)

amaslenn · 2025-03-19T11:24:36Z

src/cloudai/workloads/slurm_ray_container/slurm_command_gen_strategy.py

+
+    def _get_sbatch_directives(self, args: Dict[str, Any], output_path: Path) -> Dict[str, str]:
+        sbatch_directives = super()._get_sbatch_directives(args, output_path)
+        # TODO(Amey): We probably need to figure out what to do with cpus-per-task, mem-per-cpu


This can be set with SlurmSystem.extra_sbatch_args. The downside is that it is set per System, so all tests in a scenario will have it.

Basically, i want this to be dynamic, as a fraction of total resources

Since we have to set the task per worker to 1 for ray, we need to ensure that all the resources are made available to the process.

amaslenn · 2025-03-19T11:29:22Z

src/cloudai/workloads/slurm_ray_container/slurm_command_gen_strategy.py

+        template_path = script_dir / "slurm_ray_container_template.sh.jinja"
+        template = Template(template_path.read_text())
+
+        conda_activate_command = f"conda activate {tdef.cmd_args.conda_env} && " if tdef.cmd_args.conda_env else ""


Please help me understanding this part. Isn't env for ray is ready inside a container? Why this extra env needed?

In CloudAI we have a concept of installable: items that should be "installed" before run (done with cloudai install ...). Examples: docker images, git repos with python scripts (in this case we can create venv for it), etc. Repos can be mount into a container to have files available.

Essentially, this is supposed to be an optional parameter to activate a specific environment if required. For instance, in the Vajra nightly perf test container, we have multiple envs for vllm, vajra, sglang etc.

I'm concerned that SlurmRayContainer becomes too Vajra-specific. This shouldn't be a blocker, but if we can generalize it, would be great. I don't have a good idea so far.

src/cloudai/workloads/slurm_ray_container/slurm_ray_container_template.sh.jinja

amaslenn · 2025-03-19T11:31:15Z

tests/test_acceptance.py

            ),
            SlurmContainerCommandGenStrategy,
        ),
+        "slurm_ray_container": lambda: create_test_run(


Please also update fixture.params for this one, otherwise this case will not run.

asmitks added 4 commits March 11, 2025 21:43

minor

77b559f

minor

2c352e8

minor

0978f5e

minor

a41de4e

AgrawalAmey requested review from TaekyungHeo, amaslenn, srinivas212 and srivatsankrishnan as code owners March 12, 2025 02:59

asmitks added 8 commits March 11, 2025 22:59

minor

a2321aa

minor

7d13c4c

minor

9185db9

fix ci

a01f053

minor

e408026

minor

f13a597

minor

7b982ad

minor

6c5e9dd

TaekyungHeo marked this pull request as draft March 13, 2025 00:44

amaslenn reviewed Mar 19, 2025

View reviewed changes

TaekyungHeo added the feature label Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT][RAY] Add slurm ray tests#409

[DRAFT][RAY] Add slurm ray tests#409
AgrawalAmey wants to merge 12 commits intoNVIDIA:mainfrom
AgrawalAmey:users/amey/badger_conf

AgrawalAmey commented Mar 12, 2025 •

edited

Loading

Uh oh!

amaslenn left a comment

Uh oh!

amaslenn Mar 19, 2025

Uh oh!

amaslenn Mar 19, 2025

Uh oh!

AgrawalAmey Mar 19, 2025

Uh oh!

AgrawalAmey Mar 19, 2025

Uh oh!

amaslenn Mar 19, 2025

Uh oh!

AgrawalAmey Mar 19, 2025

Uh oh!

amaslenn Mar 21, 2025

Uh oh!

Uh oh!

amaslenn Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,23 @@
		# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
		# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

AgrawalAmey commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

amaslenn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AgrawalAmey commented Mar 12, 2025 •

edited

Loading