Conversation
amaslenn
left a comment
There was a problem hiding this comment.
@AgrawalAmey thanks a lot for your contribution! And sorry for the late feedback.
| @@ -0,0 +1,23 @@ | |||
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | |||
| # Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
For new files please set a single year value (diagnostic we have today is misleading)
|
|
||
| def _get_sbatch_directives(self, args: Dict[str, Any], output_path: Path) -> Dict[str, str]: | ||
| sbatch_directives = super()._get_sbatch_directives(args, output_path) | ||
| # TODO(Amey): We probably need to figure out what to do with cpus-per-task, mem-per-cpu |
There was a problem hiding this comment.
This can be set with SlurmSystem.extra_sbatch_args. The downside is that it is set per System, so all tests in a scenario will have it.
There was a problem hiding this comment.
Basically, i want this to be dynamic, as a fraction of total resources
There was a problem hiding this comment.
Since we have to set the task per worker to 1 for ray, we need to ensure that all the resources are made available to the process.
| template_path = script_dir / "slurm_ray_container_template.sh.jinja" | ||
| template = Template(template_path.read_text()) | ||
|
|
||
| conda_activate_command = f"conda activate {tdef.cmd_args.conda_env} && " if tdef.cmd_args.conda_env else "" |
There was a problem hiding this comment.
Please help me understanding this part. Isn't env for ray is ready inside a container? Why this extra env needed?
In CloudAI we have a concept of installable: items that should be "installed" before run (done with cloudai install ...). Examples: docker images, git repos with python scripts (in this case we can create venv for it), etc. Repos can be mount into a container to have files available.
There was a problem hiding this comment.
Essentially, this is supposed to be an optional parameter to activate a specific environment if required. For instance, in the Vajra nightly perf test container, we have multiple envs for vllm, vajra, sglang etc.
There was a problem hiding this comment.
I'm concerned that SlurmRayContainer becomes too Vajra-specific. This shouldn't be a blocker, but if we can generalize it, would be great. I don't have a good idea so far.
src/cloudai/workloads/slurm_ray_container/slurm_ray_container_template.sh.jinja
Show resolved
Hide resolved
| ), | ||
| SlurmContainerCommandGenStrategy, | ||
| ), | ||
| "slurm_ray_container": lambda: create_test_run( |
There was a problem hiding this comment.
Please also update fixture.params for this one, otherwise this case will not run.
Summary
This PR adds a new test category which runs ray applications with slurm.
Tested with:
Generated sbatch file: