Updates to slurm launcher#86
Conversation
fe22e7e to
87f8b7f
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.
Changes:
- Add
--use-imageand--build-on-computebuild workflows and propagate the resulting image name into manifests/env for multi-node pulls. - Add
slurm_multilauncher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support. - Improve manifest/runtime config merging behavior for dict-valued deployment keys.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/madengine/orchestration/run_orchestrator.py | Shallow-merges dict-valued deployment config keys into runtime additional context. |
| src/madengine/orchestration/build_orchestrator.py | Implements --use-image and --build-on-compute build flows and writes manifests accordingly. |
| src/madengine/execution/docker_builder.py | Writes DOCKER_IMAGE_NAME into built_models.env_vars when a registry image exists (supports parallel pulls). |
| src/madengine/execution/container_runner.py | Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker. |
| src/madengine/deployment/slurm_node_selector.py | Adds reservation support for srun health checks / cleanup. |
| src/madengine/deployment/slurm.py | Adds slurm_multi wrapper script generation, reservation support, and existing-allocation detection (salloc). |
| src/madengine/deployment/kubernetes.py | Treats slurm_multi as a disaggregated launcher option in K8s template context and adds launcher command generators. |
| src/madengine/deployment/common.py | Updates valid launcher list and normalizes slurm-multi → slurm_multi. |
| src/madengine/deployment/base.py | Adds skip_monitoring flag to support synchronous deploy flows (e.g., within salloc). |
| src/madengine/cli/validators.py | Extends validator signature to accept use_image (currently unused). |
| src/madengine/cli/utils.py | Adjusts display/extraction formatting (e.g., model name truncation and column header rename). |
| src/madengine/cli/commands/build.py | Adds CLI flags --use-image / --build-on-compute and enforces mutual exclusivity/requirements. |
| examples/slurm-configs/minimal/slurm-multi-minimal.json | Updates launcher value to slurm_multi. |
| examples/slurm-configs/basic/slurm-multi-multi-node.json | Updates launcher value to slurm_multi. |
| examples/slurm-configs/basic/slurm-multi-custom-split.json | Updates launcher value to slurm_multi. |
| examples/k8s-configs/minimal/slurm-multi-minimal.json | Updates launcher value to slurm_multi. |
| examples/k8s-configs/basic/slurm-multi-multi-node-basic.json | Updates launcher value to slurm_multi. |
| examples/k8s-configs/basic/slurm-multi-custom-split.json | Updates launcher value to slurm_multi. |
| docs/usage.md | Documents --use-image, --build-on-compute, and slurm_multi usage. |
| docs/launchers.md | Renames launcher references and documents slurm_multi registry requirement. |
| docs/deployment.md | Documents allocation detection and self-managed slurm_multi execution model. |
| docs/cli-reference.md | Adds CLI reference entries and usage examples for new build options and slurm_multi. |
Comments suppressed due to low confidence (1)
src/madengine/cli/validators.py:307
validate_additional_context()now acceptsuse_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally viafinalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.
def validate_additional_context(
additional_context: str,
additional_context_file: Optional[str] = None,
use_image: Optional[str] = None,
) -> Dict[str, Any]:
"""
Validate and parse additional context.
Args:
additional_context: JSON string containing additional context
additional_context_file: Optional file containing additional context
use_image: Optional pre-built image to use (skips required field validation)
Returns:
Dict containing parsed additional context
Raises:
typer.Exit: If validation fails
"""
context, _ = merge_additional_context_from_sources(
additional_context, additional_context_file
)
finalize_additional_context_dict(context)
return context
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Merge model's distributed config | ||
| model_distributed = models[0].get("distributed", {}) | ||
| if model_distributed: | ||
| if "distributed" not in saved_manifest["deployment_config"]: | ||
| saved_manifest["deployment_config"]["distributed"] = {} | ||
|
|
||
| # Copy launcher and other critical fields from model config | ||
| for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]: | ||
| if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]: | ||
| saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key] | ||
|
|
||
| # Merge model's slurm config into deployment_config.slurm | ||
| # This enables run phase to auto-detect SLURM deployment without --additional-context | ||
| model_slurm = models[0].get("slurm", {}) | ||
| if model_slurm: | ||
| if "slurm" not in saved_manifest["deployment_config"]: | ||
| saved_manifest["deployment_config"]["slurm"] = {} | ||
|
|
||
| # Copy slurm settings from model config (model card fills in | ||
| # values not explicitly set by --additional-context). | ||
| # Use _original_user_slurm_keys (captured before ConfigLoader | ||
| # applies defaults) so model card values override defaults | ||
| # but user's explicit CLI values still win. | ||
| for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]: | ||
| if key in model_slurm and key not in self._original_user_slurm_keys: | ||
| saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key] |
There was a problem hiding this comment.
_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).
There was a problem hiding this comment.
Addressed in d32b06b.
We still let models[0] populate the manifest's deployment_config.distributed and deployment_config.slurm (they need a single value), but the prebuilt-image flow now warns when discovered models have divergent distributed / slurm configs so operators are not silently surprised.
| def _execute_build_on_compute( | ||
| self, | ||
| registry: Optional[str] = None, | ||
| clean_cache: bool = False, | ||
| manifest_output: str = "build_manifest.json", | ||
| batch_build_metadata: Optional[Dict] = None, | ||
| ) -> str: | ||
| """ | ||
| Execute Docker build on a SLURM compute node and push to registry. | ||
|
|
||
| Build workflow: | ||
| 1. Build on 1 compute node only | ||
| 2. Push image to registry | ||
| 3. Store registry image name in manifest | ||
| 4. Run phase will pull image in parallel on all nodes | ||
|
|
||
| Args: | ||
| registry: Registry to push images to (REQUIRED) | ||
| clean_cache: Whether to use --no-cache for Docker builds | ||
| manifest_output: Output file for build manifest | ||
| batch_build_metadata: Optional batch build metadata | ||
|
|
||
| Returns: | ||
| Path to generated build_manifest.json | ||
| """ | ||
| import subprocess | ||
| import os | ||
| import glob | ||
|
|
||
| self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]") | ||
| self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]") | ||
| self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]") | ||
| self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n") | ||
|
|
||
| # Discover models first to get SLURM config from model card | ||
| self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]") | ||
| discover_models = DiscoverModels(args=self.args) | ||
| models = discover_models.run() | ||
|
|
||
| if not models: | ||
| raise DiscoveryError( | ||
| "No models discovered for build-on-compute", | ||
| context=create_error_context( | ||
| operation="build_on_compute", | ||
| component="BuildOrchestrator", | ||
| ), | ||
| suggestions=[ | ||
| "Check if models.json exists", | ||
| "Verify --tags parameter is correct", | ||
| ], | ||
| ) | ||
|
|
||
| model = models[0] | ||
| model_name = model.get("name", "unknown") | ||
| self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n") | ||
|
|
||
| # Merge SLURM config: model card (base) + additional-context (override) | ||
| model_slurm_config = model.get("slurm", {}) | ||
| context_slurm_config = self.additional_context.get("slurm", {}) | ||
|
|
||
| # Start with model card config, then override with command-line context | ||
| slurm_config = {**model_slurm_config, **context_slurm_config} | ||
|
|
||
| self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]") | ||
| if model_slurm_config: | ||
| self.rich_console.print(f" [dim]From model card:[/dim] {list(model_slurm_config.keys())}") | ||
| if context_slurm_config: | ||
| self.rich_console.print(f" [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}") | ||
|
|
||
| # Validate required fields | ||
| partition = slurm_config.get("partition") | ||
| if not partition: | ||
| raise ConfigurationError( | ||
| "Missing required SLURM field: partition", | ||
| context=create_error_context( | ||
| operation="build_on_compute", | ||
| component="BuildOrchestrator", | ||
| ), | ||
| suggestions=[ | ||
| 'Add "partition" to model card\'s slurm section', | ||
| 'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'', | ||
| ], | ||
| ) | ||
|
|
||
| reservation = slurm_config.get("reservation", "") | ||
| time_limit = slurm_config.get("time", "02:00:00") | ||
|
|
||
| self.rich_console.print(f" Partition: {partition}") | ||
| self.rich_console.print(f" Time limit: {time_limit}") | ||
| if reservation: | ||
| self.rich_console.print(f" Reservation: {reservation}") | ||
| self.rich_console.print("") | ||
|
|
||
| # Validate registry credentials | ||
| self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]") | ||
| self.rich_console.print(f" Registry: {registry}") | ||
|
|
||
| # Check for credentials - either from environment or credential.json | ||
| dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "") | ||
| dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "") | ||
|
|
||
| # Try to load from credential.json if env vars not set | ||
| credential_file = Path("credential.json") | ||
| if not dockerhub_user and credential_file.exists(): | ||
| try: | ||
| with open(credential_file) as f: | ||
| creds = json.load(f) | ||
| dockerhub_creds = creds.get("dockerhub", {}) | ||
| dockerhub_user = dockerhub_creds.get("username", "") | ||
| dockerhub_password = dockerhub_creds.get("password", "") | ||
| if dockerhub_user: | ||
| self.rich_console.print(f" Credentials: Found in credential.json") | ||
| except (json.JSONDecodeError, IOError) as e: | ||
| self.rich_console.print(f" [yellow]Warning: Could not read credential.json: {e}[/yellow]") | ||
| elif dockerhub_user: | ||
| self.rich_console.print(f" Credentials: Found in environment (MAD_DOCKERHUB_USER)") | ||
|
|
||
| # Determine if registry requires authentication | ||
| requires_auth = True | ||
| public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"] | ||
| registry_lower = registry.lower() if registry else "" | ||
|
|
||
| # For docker.io pushes, authentication is always required | ||
| if any(pub_reg in registry_lower for pub_reg in public_registries): | ||
| if not dockerhub_user or not dockerhub_password: | ||
| raise ConfigurationError( |
There was a problem hiding this comment.
_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.
There was a problem hiding this comment.
Addressed in d32b06b.
_execute_build_on_compute now raises a ConfigurationError early when registry is None — with the suggestion to pass --registry or set it in the model card / additional-context — instead of falling into registry.lower() / .replace() / .split() with NoneType.
| result = subprocess.run( | ||
| cmd, | ||
| shell=True, | ||
| cwd=script_dir, | ||
| env=env, | ||
| timeout=timeout if timeout > 0 else None, | ||
| ) |
There was a problem hiding this comment.
subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.
There was a problem hiding this comment.
Addressed in d32b06b.
shell=True is deliberate here because cmd embeds shell features (pipes, redirects, env-var substitution) constructed earlier in this method. Added an inline trust-boundary comment that documents:
cmdis built from validated model card / manifest fields,- user
model_argsare routed through shlex-quoted assembly in the caller, - raw user input must NOT be concatenated directly into
cmd,
and tagged the line with # noqa: S602 so future linters don't flag the intentional usage.
| "#SBATCH --exclusive", | ||
| ] |
There was a problem hiding this comment.
The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.
| "#SBATCH --exclusive", | |
| ] | |
| ] | |
| if self.slurm_config.get("exclusive", True): | |
| script_lines.append("#SBATCH --exclusive") |
There was a problem hiding this comment.
Addressed in d32b06b.
The slurm_multi wrapper now appends #SBATCH --exclusive only when self.slurm_config.get('exclusive', True) is truthy, matching the conditional pattern already used for the standard SLURM template (job.sh.j2:20-22). Default behaviour stays exclusive so existing runs are unchanged.
| from madengine.utils.run_details import get_build_number, get_pipeline | ||
| from madengine.utils.path_utils import scripts_base_dir_from | ||
| import json | ||
| from typing import Optional |
There was a problem hiding this comment.
There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.
| from typing import Optional |
There was a problem hiding this comment.
Addressed in d32b06b.
Dropped the redundant from typing import Optional on line 27; Optional is still imported via the existing from typing import Any, Dict, List, Optional at line 17.
| # Add each discovered model with the pre-built image | ||
| # Use the image name as the key (matches how madengine build does it) | ||
| for model in models: | ||
| model_name = model.get("name", "unknown") | ||
| model_distributed = model.get("distributed", {}) | ||
|
|
||
| # Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase | ||
| model_env_vars = model.get("env_vars", {}).copy() | ||
| model_env_vars["DOCKER_IMAGE_NAME"] = use_image | ||
|
|
||
| # Use image name as key so slurm.py can find docker_image | ||
| manifest["built_models"][use_image] = { | ||
| "name": model_name, | ||
| "image": use_image, | ||
| "docker_image": use_image, | ||
| "dockerfile": model.get("dockerfile", ""), | ||
| "scripts": model.get("scripts", ""), | ||
| "data": model.get("data", ""), | ||
| "n_gpus": model.get("n_gpus", "8"), | ||
| "owner": model.get("owner", ""), | ||
| "training_precision": model.get("training_precision", ""), | ||
| "multiple_results": model.get("multiple_results", ""), | ||
| "tags": model.get("tags", []), | ||
| "timeout": model.get("timeout", -1), | ||
| "args": model.get("args", ""), | ||
| "slurm": model.get("slurm", {}), | ||
| "distributed": model_distributed, | ||
| "env_vars": model_env_vars, | ||
| "prebuilt": True, | ||
| } |
There was a problem hiding this comment.
In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.
There was a problem hiding this comment.
Addressed in d32b06b.
The loop now keys manifest['built_models'] by model_name instead of use_image, so multiple models that share the same pre-built image are all preserved in the manifest and the downstream consumers can iterate over them.
| # Determine if registry requires authentication | ||
| requires_auth = True | ||
| public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"] | ||
| registry_lower = registry.lower() if registry else "" | ||
|
|
||
| # For docker.io pushes, authentication is always required | ||
| if any(pub_reg in registry_lower for pub_reg in public_registries): | ||
| if not dockerhub_user or not dockerhub_password: | ||
| raise ConfigurationError( | ||
| f"Registry credentials required for pushing to {registry}", | ||
| context=create_error_context( | ||
| operation="build_on_compute", | ||
| component="BuildOrchestrator", | ||
| registry=registry, | ||
| ), | ||
| suggestions=[ | ||
| "Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD", | ||
| 'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}', | ||
| "For Docker Hub, use a Personal Access Token (PAT) as password", | ||
| f"Example: export MAD_DOCKERHUB_USER=myuser", | ||
| f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx", | ||
| ], | ||
| ) |
There was a problem hiding this comment.
The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.
There was a problem hiding this comment.
Addressed in d32b06b.
The missing-credentials error now selects per-registry suggestions:
docker.io— Docker Hub PAT /MAD_DOCKERHUB_*ghcr.io— GitHub PAT withread:packages/write:packagesgcr.io— service-account JSON key as password (MAD_DOCKERHUB_USER=_json_key)quay.io— robot account / encrypted passwordnvcr.io—$oauthtokenusername + NGC API key
Falls back to the Docker Hub guidance for unknown registries. The credential env vars themselves are still the existing MAD_DOCKERHUB_USER / MAD_DOCKERHUB_PASSWORD to keep this change a docs-only refinement of the error path; introducing a generic MAD_REGISTRY_USER/PASSWORD pair is left for a follow-up.
| srun_cmd.extend(["bash", "-c", cleanup_script]) | ||
|
|
||
| try: | ||
| srun_cmd = [ | ||
| "srun", | ||
| f"--nodelist={node}", | ||
| "--ntasks=1", | ||
| "--time=00:01:00", | ||
| "--overlap", | ||
| "--quiet", | ||
| ] | ||
| if self.reservation: | ||
| srun_cmd.append(f"--reservation={self.reservation}") | ||
| srun_cmd.extend(["bash", "-c", cleanup_script]) | ||
|
|
There was a problem hiding this comment.
cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.
| srun_cmd.extend(["bash", "-c", cleanup_script]) | |
| try: | |
| srun_cmd = [ | |
| "srun", | |
| f"--nodelist={node}", | |
| "--ntasks=1", | |
| "--time=00:01:00", | |
| "--overlap", | |
| "--quiet", | |
| ] | |
| if self.reservation: | |
| srun_cmd.append(f"--reservation={self.reservation}") | |
| srun_cmd.extend(["bash", "-c", cleanup_script]) | |
| if self.reservation: | |
| srun_cmd.append(f"--reservation={self.reservation}") | |
| srun_cmd.extend(["bash", "-c", cleanup_script]) | |
| try: |
There was a problem hiding this comment.
Addressed in d32b06b.
The pre-try: and in-try: builds of srun_cmd are merged into a single construction that conditionally appends both --job-name=<name> (when job_name is given) and --reservation=<res> (when self.reservation is set). The second reassignment that dropped --job-name is gone.
| # Merge deployment_config into additional_context (for deployment layer to use). | ||
| # For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest | ||
| # values fill in gaps while runtime --additional-context wins on conflicts. | ||
| for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]: | ||
| if key in deployment_config and key not in self.additional_context: | ||
| self.additional_context[key] = deployment_config[key] | ||
| if key in deployment_config: | ||
| if key not in self.additional_context: | ||
| self.additional_context[key] = deployment_config[key] | ||
| elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict): | ||
| merged = dict(deployment_config[key]) | ||
| merged.update(self.additional_context[key]) | ||
| self.additional_context[key] = merged |
There was a problem hiding this comment.
The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.
There was a problem hiding this comment.
Addressed in d32b06b.
Replaced the shallow merged.update(...) with a real recursive _deep_merge(base, override) helper. Nested dicts under slurm / k8s / kubernetes / distributed / vllm / env_vars / debug are now merged per-leaf; runtime --additional-context still wins on conflicts, while manifest values fill in nested gaps as the comment claims.
Address all 9 inline comments from copilot-pull-request-reviewer[bot]: ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys manifest['built_models'] by model_name (not use_image), so multiple models that share the same pre-built image are all preserved in the manifest. ROCm#2 build_orchestrator.py — warn when discovered models have differing distributed/slurm configs in the prebuilt-image flow; the post-merge step still uses models[0]'s config but operators are now told. ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises ConfigurationError early when registry is None instead of falling into registry.replace/.split/.lower with NoneType. ROCm#4 build_orchestrator.py — credentials-required error now emits per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io) instead of Docker-Hub-only PAT guidance. ROCm#5 container_runner.py — document the shell=True trust boundary on the inner subprocess.run; cmd is internally constructed and any user model_args are routed through shlex-quoted assembly in the caller. ROCm#6 slurm.py — drop duplicate `from typing import Optional` import. ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True) to match the standard SLURM template behaviour. ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once and includes both --job-name (when provided) and --reservation (when set); the second in-try reassignment that dropped --job-name is gone. ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with a real recursive _deep_merge so the comment ("deep-merge") matches the behaviour: nested dicts under slurm/k8s/distributed/etc. are merged per-leaf, runtime --additional-context still wins on conflicts. Made-with: Cursor
…slurm_multi PR ROCm#86 was originally rooted on coketaste/refactor-dist, which introduced both Kubernetes and SLURM deployment paths. The PR base is now develop and reviewers asked to keep this PR scoped to slurm_multi and the related SLURM/orchestration changes. This commit removes the Kubernetes-only artifacts from the slurm PR. The KubernetesDeployment class continues to register itself only when the kubernetes module is importable (DeploymentFactory.register_default_deployments already wraps the import in try/ImportError), so the slurm PR works without the kubernetes module present. The full Kubernetes work continues on a separate branch and PR (raviguptaamd/k8s_deployment), which can target develop independently once ROCm#86 lands. Files removed (k8s-only): - src/madengine/deployment/k8s_secrets.py - src/madengine/deployment/kubernetes.py - src/madengine/deployment/kubernetes_launcher_mixin.py - src/madengine/deployment/presets/k8s/** - src/madengine/deployment/templates/kubernetes/** - src/madengine/scripts/k8s/** - examples/k8s-configs/** - tests/unit/test_k8s.py Mixed source files (run_orchestrator, build_orchestrator, base, common, config_loader, validators) and docs that reference both slurm and k8s are intentionally left untouched in this commit: - their k8s code paths are dormant (no k8s module to call into), - touching them would create a much wider conflict surface vs develop, - the k8s PR will re-add the runtime hooks when its files come back. Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce the `slurm_multi` deployment path needed for self-managed, multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer): * SlurmDeployment: new `slurm_multi` launcher branch with model-card driven --partition / --nodes / --gpus-per-node / --time / nodelist inheritance from `models.json`, optional `--exclusive`, reservation, and registry-required validation. * SlurmNodeSelector: idle-node discovery, on-demand health check, cleanup_node() that consolidates --job-name and --reservation in a single srun (avoids duplicate srun in the previous implementation). * templates/slurm/job.sh.j2: SBATCH directives factored so user exclusivity setting is honoured, matching slurm_multi behaviour. * base.py / common.py / config_loader.py / factory.py: shared deployment infrastructure adjustments (DeploymentConfig fields, registry validation, factory registration). Incorporates the related Copilot review fixes: * slurm.py: drop duplicate `from typing import Optional`; honour slurm_config.exclusive in the slurm_multi wrapper script. * slurm_node_selector.py: cleanup_node srun_cmd built once and preserves --job-name when --reservation is also set. Co-authored-by: Cursor <cursoragent@cursor.com>
Plumb the slurm_multi launcher through the build/run/execution stack:
* orchestration/build_orchestrator.py:
- --use-image / --build-on-compute paths populate manifest's
deployment_config.distributed and deployment_config.slurm so the
run phase auto-detects slurm without --additional-context.
- Discovery yields multiple model cards: built_models manifest is now
keyed by model_name (no longer overwrites when models share an image).
- _execute_build_on_compute() validates `registry` early (no longer
falls into registry.lower()/.replace() with NoneType).
- Missing-credentials error emits per-registry hints (docker.io,
ghcr.io, gcr.io, quay.io, nvcr.io) instead of Docker Hub-only PAT
guidance.
- Warn when discovered models in a prebuilt-image batch have
divergent distributed/slurm configs.
* orchestration/run_orchestrator.py:
- True recursive deep-merge of deployment_config into
additional_context (the previous shallow `dict.update` lost nested
slurm/k8s overrides).
- Manifest reading respects deployment_config.target.
* execution/container_runner.py + helpers / docker_builder.py:
- Surface model_args / env wiring needed by the slurm_multi flow.
- Document the shell=True trust boundary on subprocess.run; cmd is
constructed from validated manifest fields and user model_args are
routed through shlex-quoted assembly in the caller.
* cli/{app, commands/run, commands/build, constants, utils, validators}.py:
- --build-on-compute and --use-image flags, deployment-config CLI
overrides, validation pre-checks for slurm targets.
* core/{constants, context, dataprovider, docker, errors, timeout}.py:
- Shared infrastructure used by the new launcher path
(DeploymentConfig fields, error classes, timeout policy).
* utils/, scripts/common/, reporting/, database/:
- Helpers and tooling that the slurm_multi flow depends on
(config_parser, gpu_config, gpu_validator, ops, rocm_tool_manager,
session_tracker, log_formatting, discover_models, mongodb perf
publishing, trace pre/post scripts, therock_detector).
Includes the Copilot review fixes that touch this layer: built_models
keying, registry None guard, registry-specific auth hints, divergent
config warning, deep-merge correctness, subprocess.run trust comment.
Co-authored-by: Cursor <cursoragent@cursor.com>
Round out the slurm_multi PR with operator-facing docs, examples, and
test additions:
* examples/slurm-configs:
- basic/slurm-multi-{custom-split,multi-node}.json (renamed from
sglang-disagg-* to reflect that the same launcher drives multiple
flows: vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer).
- minimal/slurm-multi-minimal.json
- README.md covering all the basic/minimal configs.
* examples/profiling-configs/README.md: cross-reference slurm_multi
flows where applicable.
* docs/{launchers,deployment,configuration,cli-reference,usage,
installation,profiling,README}.md and top-level README.md / CHANGELOG.md:
describe the slurm_multi launcher, --build-on-compute, --use-image,
and the manifest deployment_config flow.
* pyproject.toml / .gitignore: minor housekeeping.
* tests/{unit,integration,e2e,fixtures}: cover the new launcher path,
config_loader inheritance, deep-merge behaviour, container_runner
shell handling, and slurm node selector cleanup. test_errors.py
promoted from unit to integration.
Co-authored-by: Cursor <cursoragent@cursor.com>
8d73b3f to
50e8895
Compare
… perf.csv collection Three fixes that surfaced during the PR ROCm#86 smoke test against the Distributed_Inference_CI reservation: 1. ErrorContext kwargs sweep (build_orchestrator.py: 5 sites) ErrorContext is a dataclass with a fixed field set {operation, phase, component, model_name, node_id, file_path, additional_info}. Five call-sites passed kwargs that don't exist (`model=`, `launcher=`, `model_names=`, `registry=`, `dockerfile=`), which raised `TypeError: ErrorContext.__init__() got an unexpected keyword argument 'model'` BEFORE the real ConfigurationError was constructed. The slurm_multi guard error in particular ("requires --registry or --use-image") was never visible to users. Migrated those kwargs to the proper `model_name=` field or to `additional_info={...}` so the user-facing error messages now reach the console. 2. F1: model-card image fallback for slurm_multi (build_orchestrator.py) When a discovered model card declares `env_vars.DOCKER_IMAGE_NAME` and the user passes neither `--registry` nor `--use-image`, treat the card's image as an implicit `--use-image` and call _execute_with_prebuilt_image() directly. Avoids forcing operators to repeat the image on the CLI for slurm_multi flows whose model cards already pin it. Falls back to the existing ConfigurationError (now visible thanks to fix ROCm#1) when no card image is set or when discovered models declare divergent images. 3. F-perf: extended slurm_multi perf.csv search paths (slurm.py) `_collect_slurm_multi_results` now looks in the locations the inner scripts (MAD-private PR #186) actually write to: * `/shared_inference/<user>/<deployment_id>/perf.csv` (sglang_disagg / vllm_disagg via benchmark_xPyD.sh) * `<workspace>/slurm_output/perf_csv/*<deployment_id>*.csv` (large_ep_benchmark / kvcache_transfer_bench) in addition to the existing `slurm_config.results_dir` and `<cwd>/perf.csv` paths, with NFS-propagation retry on each candidate. Fixes the "0 perf files, 1 log files" outcome we hit on Slurm job 22607 even though `/shared_inference/ravgupta/22607/perf.csv` was actually generated. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 86 out of 87 changed files in this pull request and generated 24 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not is_rocprofv3_available(): | ||
| if tools_include_rocprof_family(tools_config): | ||
| filtered_tools: List[Dict] = [ | ||
| t | ||
| for t in tools_config | ||
| if isinstance(t, dict) and t.get("name") not in _ROCPROF_FAMILY_TOOL_NAMES | ||
| ] | ||
| if filtered_tools: | ||
| logger.warning( | ||
| "Multi-node: rocprofv3 not found on submission host; " | ||
| "omitting rocprof/rocprofv3-family tools. Remaining tools: " | ||
| + ", ".join( | ||
| str(x.get("name")) | ||
| for x in filtered_tools | ||
| if isinstance(x, dict) and x.get("name") | ||
| ) | ||
| ) | ||
| return { | ||
| "enabled": True, | ||
| "mode": "multi_node", | ||
| "tools": filtered_tools, | ||
| "per_node_collection": True, | ||
| } | ||
| logger.warning( | ||
| "╔════════════════════════════════════════════════════════════════════════════╗\n" | ||
| "║ Multi-Node Profiling Requirements Not Met ║\n" | ||
| "╠════════════════════════════════════════════════════════════════════════════╣\n" | ||
| "║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support). ║\n" | ||
| "║ ║\n" | ||
| "║ Current Status: rocprofv3 NOT FOUND on system ║\n" | ||
| "║ ║\n" | ||
| "║ rocprof/rocprofv3-family profiling will be SKIPPED for this multi-node ║\n" | ||
| "║ run. ║\n" | ||
| "║ ║\n" | ||
| "║ To enable multi-node rocprof profiling: ║\n" | ||
| "║ • Install rocprofiler-sdk package (ROCm >= 6.4.1) ║\n" | ||
| "║ • Command: apt install rocprofiler-sdk ║\n" | ||
| "║ • Or upgrade to ROCm 6.4.1 or later ║\n" | ||
| "║ ║\n" | ||
| "║ Note: Single-node profiling uses rocprof (no rocprofv3 required) ║\n" | ||
| "╚════════════════════════════════════════════════════════════════════════════╝" | ||
| ) | ||
| return { | ||
| "enabled": False, | ||
| "mode": "multi_node_unsupported", | ||
| "tools": [], | ||
| "per_node_collection": False | ||
| } | ||
| logger.info( | ||
| "Multi-node: rocprofv3 not found on submission host; keeping non-rocprof tools " | ||
| "(rocprof/rocprofv3 family requires rocprofv3 for MPI-aware capture)." | ||
| logger.warning( | ||
| "╔════════════════════════════════════════════════════════════════════════════╗\n" | ||
| "║ Multi-Node Profiling Requirements Not Met ║\n" | ||
| "╠════════════════════════════════════════════════════════════════════════════╣\n" | ||
| "║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support). ║\n" | ||
| "║ ║\n" | ||
| "║ Current Status: rocprofv3 NOT FOUND on system ║\n" | ||
| "║ ║\n" | ||
| "║ Profiling will be SKIPPED for this multi-node run. ║\n" | ||
| "║ ║\n" | ||
| "║ To enable multi-node profiling: ║\n" | ||
| "║ • Install rocprofiler-sdk package (ROCm >= 6.4.1) ║\n" | ||
| "║ • Command: apt install rocprofiler-sdk ║\n" | ||
| "║ • Or upgrade to ROCm 6.4.1 or later ║\n" | ||
| "║ ║\n" | ||
| "║ Note: Single-node profiling uses rocprof (no rocprofv3 required) ║\n" | ||
| "╚════════════════════════════════════════════════════════════════════════════╝" | ||
| ) | ||
| return { | ||
| "enabled": True, | ||
| "mode": "multi_node", | ||
| "tools": tools_config, | ||
| "per_node_collection": True, | ||
| "enabled": False, | ||
| "mode": "multi_node_unsupported", | ||
| "tools": [], | ||
| "per_node_collection": False | ||
| } |
There was a problem hiding this comment.
Addressed in 63a2eae.
configure_multi_node_profiling now only filters out the rocprof tool when rocprofv3 is missing on a multi-node run. rccl_trace, rocblas_trace, and other non-rocprof tools keep running and per_node_collection stays enabled for them. The returned config switches to a new mode multi_node_no_rocprofv3 so callers can distinguish this mixed state from full multi_node.
| if key in stored_config and isinstance(stored_config[key], dict) and isinstance(self.additional_context[key], dict): | ||
| merged = dict(stored_config[key]) | ||
| merged.update(self.additional_context[key]) | ||
| stored_config[key] = merged |
There was a problem hiding this comment.
Addressed in 63a2eae.
Promoted _deep_merge to a module-level helper and used it in _load_and_merge_manifest (which still did dict(stored_config[key]); merged.update(...)). Verified locally: an override touching slurm.node_selector.reservation now preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is now used in both the runtime-context init path and the manifest-merge path.
| credential_file = "credential.json" | ||
| if os.path.exists(credential_file): | ||
| try: | ||
| with open(credential_file) as f: | ||
| credentials = json.load(f) | ||
| except Exception as e: | ||
| print(f"Warning: Could not load credentials: {e}") |
There was a problem hiding this comment.
Addressed in 63a2eae.
RunOrchestrator._load_credentials now delegates to madengine.core.auth.load_credentials. The central helper enforces "credential.json must contain a JSON object" validation up front (raises ValueError if the file is an array/string), so malformed credential files are rejected with a clear message instead of producing a cryptic subscript/attribute error later during registry auth.
| self.ctx["rocm_path"] = self._rocm_path | ||
| # In-container ROCM_PATH is finalized at run time in run_container | ||
| # via finalize_container_rocm_path (OCI env → probe → default). | ||
| # docker_env_vars.ROCM_PATH (user-supplied) is preserved here; finalize | ||
| # normalizes it at run time. Auto-detection runs in finalize when absent. | ||
| self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path |
There was a problem hiding this comment.
Addressed in 9109f52 — file reverted to origin/develop.
core/context.py no longer forces the host ROCm root into the container env. The PR now uses upstream's madengine.utils.rocm_path_resolver.resolve_host_rocm_path for host detection and the docker_env_vars.MAD_ROCM_PATH flow (with image-baked-in env, in-container probe, then /opt/rocm fallback) for in-container resolution. Out-of-scope behaviour change for slurm_multi; reverted to keep this PR focused.
| row = [str(row_index), status, model_name] | ||
| if show_gpu_arch: | ||
| row.append(build_gpu_arch_display(item)) | ||
| row.append(item.get("architecture", "N/A")) |
There was a problem hiding this comment.
Addressed in 9109f52.
Restored the build_gpu_arch_display(item) helper (prefers gpu_architecture then architecture, defaults to "N/A") and switched both BUILD callsites (success + failed) back to it, so multi-arch builds again show the correct gpu_architecture column instead of N/A. Kept the model-name no-truncation change and the "Workload" column rename (those weren't flagged).
| - **🚀 Modern CLI** - Rich terminal output with Typer and Rich | ||
| - **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration | ||
| - **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus, vLLM, SGLang | ||
| - **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang |
There was a problem hiding this comment.
Addressed across two commits.
- 9109f52 —
README.mdreverted toorigin/develop, so theDistributed Launchersbullet, theArchitecturediagram, and theComponentssection all mention Primus alongside torchrun / DeepSpeed / Megatron-LM / TorchTitan / vLLM / SGLang again. The ROCm-path bullet and the RTL profiling-tools link are also restored. - 63a2eae — added
"primus"(and"megatron") toVALID_LAUNCHERSindeployment/common.py, sonormalize_launcherno longer silently downgrades"primus"configs to"docker".
The README is now consistent with what the code actually accepts.
| def __init__(self, seconds: int = 15) -> None: | ||
| """Constructor of the Timeout class. | ||
|
|
||
| Args: | ||
| seconds (Optional[int]): The timeout in seconds. None or 0 disables | ||
| the timeout. Negative values are treated as no timeout. | ||
| seconds (int): The timeout in seconds. | ||
| """ | ||
| self.seconds = seconds if seconds and seconds > 0 else None | ||
| self.seconds = seconds |
There was a problem hiding this comment.
Addressed in 9109f52 — file reverted to origin/develop.
Timeout.__init__ again accepts Optional[int] = 15 and stores self.seconds = seconds if seconds and seconds > 0 else None. __enter__ / __exit__ short-circuit when self.seconds is None, so programmatic callers passing timeout=None (or 0 / negative) no longer hit signal.alarm(None) with a TypeError. Out-of-scope change for slurm_multi; reverted.
| "--build-arg " | ||
| + shlex.quote(str(build_arg)) | ||
| + "=" | ||
| + shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg])) | ||
| + " " | ||
| + build_arg | ||
| + "='" | ||
| + self.context.ctx["docker_build_arg"][build_arg] | ||
| + "' " | ||
| ) | ||
|
|
||
| if run_build_arg: | ||
| for key, value in run_build_arg.items(): | ||
| build_args += "--build-arg " + shlex.quote(str(key)) + "=" + shlex.quote(str(value)) + " " | ||
| build_args += "--build-arg " + key + "='" + value + "' " |
There was a problem hiding this comment.
Addressed in 9109f52.
get_build_arg reverted to use shlex.quote(str(...)) for both keys and values:
build_args += (
"--build-arg "
+ shlex.quote(str(build_arg))
+ "="
+ shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg]))
+ " "
)(plus Optional[Dict] = None with if run_build_arg is None: run_build_arg = {} instead of the mutable-default pattern). Build-arg values containing apostrophes / $() / backticks no longer break the docker build command line.
| stop = os.path.abspath(stop_dir) | ||
|
|
||
| while current == stop or current.startswith(stop + os.sep): | ||
| while current.startswith(stop): |
There was a problem hiding this comment.
Addressed in 9109f52 — file reverted to origin/develop.
Boundary check restored:
while current == stop or current.startswith(stop + os.sep):So a parent like /repo/scripts no longer matches /repo/scripts-private/... via prefix collision — ConfigParser walks the intended subtree only. Out-of-scope change for slurm_multi; reverted.
| if sys_gpu_arch and "NVIDIA" in sys_gpu_arch: | ||
| # Normalize "NVIDIA A100-SXM4-40GB" -> "A100", "NVIDIA H100 PCIe" -> "H100" | ||
| # (mirrors get_gpu_arch() in tests/fixtures/utils.py) | ||
| after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA ") :] | ||
| sys_gpu_arch = after_prefix.split("-")[0].split()[0] | ||
| sys_gpu_arch = sys_gpu_arch.split()[1] |
There was a problem hiding this comment.
Addressed in 9109f52 — file reverted to origin/develop.
NVIDIA-arch normalization restored:
after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA "):]
sys_gpu_arch = after_prefix.split("-")[0].split()[0]so "NVIDIA A100-SXM4-40GB" → "A100" and "NVIDIA H100 PCIe" → "H100" again, and skip_gpu_arch=A100 correctly filters those models. The split()[1] regression that returned the full SKU string is gone. Out-of-scope change for slurm_multi; reverted.
Previously the per-deploy completion marker was named
`madengine_<model>.complete` -- not job-id namespaced. A second
`madengine run` of the same model tag would see the leftover marker
from the first run and immediately conclude the job was done, even
though sbatch had only just submitted the new job.
Reproduced live: re-running pyt_sglang_disagg_llama-3.1-8b right after
job 22607 finished caused madengine to return success in ~30 s while
job 22615 was still running on the reservation.
Fix:
* prepare(): emit the marker using `${SLURM_JOB_ID:-local}` inside the
generated wrapper so each sbatch run writes its own file. Falls back
to `_local` when no SLURM_JOB_ID is set (interactive/local runs,
preserves existing behaviour for that path via _completion_marker).
* monitor(): reconstruct the job-id-aware path from the basename
template and the deployment_id returned by sbatch; only fall back
to the bare marker for backwards compatibility with the
inside_allocation flow.
Co-authored-by: Cursor <cursoragent@cursor.com>
Smoke test summary on
|
| # | Workload | Model card | Slurm job | Nodes | State | Result |
|---|---|---|---|---|---|---|
| C1 | SGLang disagg 8B (madengine-driven via F1) | pyt_sglang_disagg_llama-3.1-8b |
22607 | reservation [029,040,082] | COMPLETED 8:51 | 5,977 tok/s TTFT 122.8 ms ITL 2.54 ms |
| C2 | vLLM disagg 8B (1P/1D, co-located proxy) | pyt_vllm_disagg_llama-3.1-8b |
22612 | [008,014] | COMPLETED 4:22 | router serving, but every decode request returned HTTP 500: vLLM workers crashed with nixl_connector.py:1504] AssertionError. Image-side bug, not the launcher. |
| C3 | Large EP 2-node | pyt_large_ep_benchmark_2n |
22610 | reservation [029,040] | COMPLETED 3:49 | 28 metric rows (deepep dispatch/combine FP8/BF16 RDMA/NVL) |
| C4 | KV cache transfer 2-node | pyt_kvcache_transfer_bench |
22611 | reservation [082,125] | COMPLETED 24:55 | 57 metric rows (mooncake + mori + rixl, 4KB → 1GB) |
| C1' | C1 re-run with the F-perf collector commit (5ec55ff) |
same | 22615 | reservation [029,040,082] | COMPLETED 6:54 | 5,970 tok/s; F-perf finds /shared_inference/<user>/<jobid>/perf.csv (verified) |
What the smoke test exercised in this PR
| Surface | Verification |
|---|---|
madengine build with --use-image (manifest gen for prebuilt image) |
C1 |
madengine build with no flags + model-card env_vars.DOCKER_IMAGE_NAME (F1 implicit---use-image) |
C1' — log line slurm_multi: no --registry/--use-image given; using DOCKER_IMAGE_NAME from model card -> rocm/pytorch-private:sglang_disagg_mori_20260502 |
madengine run --manifest-file over slurm_multi |
C1, C1' |
slurm_multi SBATCH wrapper generation, --reservation, enable_node_check=false honoured |
All |
_collect_slurm_multi_results finds /shared_inference/<user>/<jobid>/perf.csv (sglang/vllm) |
C1' |
SlurmDeployment direct sbatch (no madengine on top) |
C2, C3, C4 |
slurm_multi co-located-proxy topology for vLLM disagg (xP + yD nodes) |
C2 — topology was launched correctly; C2 failure happened inside vLLM nixl_connector after Application startup complete. |
Bugs surfaced by the smoke test (all addressed)
| Bug | Where | Commit |
|---|---|---|
5 invalid kwargs to ErrorContext (model=, launcher=, model_names=, registry=, dockerfile=) — masked the real error message |
build_orchestrator.py |
5ec55ff |
F1: model-card image not auto-promoted to --use-image for slurm_multi |
build_orchestrator.py |
5ec55ff |
F-perf: _collect_slurm_multi_results only looked in <cwd>/perf.csv |
slurm.py |
5ec55ff |
Completion marker not job-id namespaced — second madengine run of same model tag returned success in ~30 s while the new sbatch was still running |
slurm.py |
710b51a |
Out of scope
- Image-side vLLM NixlConnector crash on Llama-3.1-8B disagg under
rocm/pytorch-private:20260407_itej89_vllm_mori_docker(filed separately as orthogonal). - Node-health precheck threshold (180 GB memory used / 4 processes flags reservation nodes as "Dirty"; tracked for follow-up).
PR head: 710b51a. Mergeable, blocking on review.
Round-2 Copilot review identified 15 changes outside the slurm_multi
scope that regressed upstream/develop behaviour. Revert each to
origin/develop:
core/context.py drop forced host-ROCm-into-container override
core/docker.py restore shlex.quote + exact-match container filter
core/dataprovider.py restore "unknown data name" guard, drop debug prints
core/timeout.py restore Optional[int] None/0 handling
cli/app.py restore versioningit-driven --version
utils/rocm_tool_manager.py capture primary_stderr before fallback overwrites
utils/discover_models.py restore shlex.quote in cp; raise TypeError, not assert
utils/config_parser.py fix boundary check (== or startswith stop+os.sep)
orchestration/image_filtering.py
restore proper "NVIDIA <ARCH>" normalization
(split()[1] returned A100-SXM4-40GB, not A100)
reporting/update_perf_csv.py
restore configs pop, isinstance instead of type ==
README.md restore Primus mention, ROCm path docs, RTL link
Surgical reverts that preserve scope-relevant changes:
cli/utils.py
- restore build_gpu_arch_display() helper + the two BUILD callsites
(success and failed BUILD branches), so multi-arch builds show the
gpu_architecture column instead of N/A
- keep the model-name no-truncation change (no [:20])
- keep the "Workload" column rename and matching format-precision tweak
execution/docker_builder.py
- revert auth/shlex paths to upstream (login_to_registry delegated to
madengine.core.auth, shlex.quote on build args, exact-match container
filter, primus dockercontext detection)
- PRESERVE the slurm_multi-specific block at line 369:
# Update built_models with registry image name for parallel pull
# in slurm_multi
so the registry-image -> env_vars["DOCKER_IMAGE_NAME"] propagation
that the slurm_multi parallel-pull path depends on is unchanged.
slurm_multi infrastructure is untouched: zero bytes diff vs PR HEAD on
deployment/{slurm,slurm_node_selector,base,common,factory,config_loader}.py,
orchestration/{build_orchestrator,run_orchestrator}.py,
execution/{container_runner,container_runner_helpers}.py,
cli/{commands/build,commands/run,validators}.py.
Smoke checks:
DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm']
normalize_launcher('slurm_multi','slurm') = 'slurm_multi'
normalize_launcher('slurm-multi','slurm') = 'slurm_multi'
Addresses Copilot review comments ROCm#4, ROCm#5, ROCm#6, ROCm#9, ROCm#13, ROCm#14, ROCm#16, ROCm#17,
Co-authored-by: Cursor <cursoragent@cursor.com>
ROCm#18, ROCm#19, ROCm#20, ROCm#21, ROCm#22, ROCm#23, ROCm#24.
Six Round-2 review comments on slurm_multi-touching code paths fixed.
run_orchestrator.py:
* Real recursive deep-merge: promote the inline `_deep_merge` from
`_init_runtime_context_from_manifest` to a module-level helper and
use it in `_load_and_merge_manifest` too. Previously the manifest
path still did a shallow `dict.update()`, so a runtime override
touching a single leaf inside e.g. `slurm.node_selector` or
`distributed.primus` wholesale-replaced the inner object and
silently dropped the manifest's other leaves.
(Copilot review ROCm#2)
* Restore manifest's `context.docker_env_vars` into the runtime
context. Manifest-only runs (`madengine run --manifest-file`)
used to load tools / pre_scripts / post_scripts / encapsulate_script
but had stopped restoring `docker_env_vars`, so any env captured
at build time (incl. MAD_SECRETS*, model-specific values) was
dropped before ContainerRunner rendered `--env` flags.
(Copilot review ROCm#12)
* Replace the local `_load_credentials` reimplementation with a
delegation to `madengine.core.auth.load_credentials`, which keeps
the "credential.json must be a JSON object" validation that was
silently lost. Without it, a malformed file (array/string) would
propagate unchanged and surface much later as a cryptic
subscript/attribute error during registry auth.
(Copilot review ROCm#3)
deployment/common.py:
* `configure_multi_node_profiling`: when `rocprofv3` is unavailable
on a multi-node run, only filter out the `rocprof` tool (which is
the one that actually requires rocprofv3) instead of disabling
ALL multi-node tools. `rccl_trace`, `rocblas_trace`, and friends
work independently and must keep running. Returns
`multi_node_no_rocprofv3` in that mixed mode.
(Copilot review ROCm#1)
* `VALID_LAUNCHERS`: add `"megatron"` and `"primus"` alongside the
existing `"megatron-lm"`. The rest of the codebase (slurm.py,
kubernetes.py, the `job.yaml.j2` template, `primus_backend.py`,
and every `examples/**/megatron-lm-*.json` config) uses the bare
`"megatron"` / `"primus"` strings, so without these entries
`normalize_launcher` was silently downgrading them to `"docker"`.
(Copilot review ROCm#10)
deployment/base.py:
* Widen `_parse_node_log_for_perf` regex to accept scientific
notation (`1.23e4`, `4E+5`) for the value, and `\\S+` for the
metric name (covers `tokens/sec`, `samples_per_second`, `tok/s`,
etc.). The narrower `[\\d.]+` previously dropped legitimate
perf lines emitted by training scripts.
(Copilot review ROCm#11)
Smoke checks:
DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm']
normalize_launcher('megatron','slurm') = 'megatron'
normalize_launcher('primus','kubernetes') = 'primus'
normalize_launcher('slurm_multi','slurm') = 'slurm_multi'
_deep_merge nested-leaf preservation: verified
Co-authored-by: Cursor <cursoragent@cursor.com>
Three Round-2 Copilot comments on doc-only broken navigation: docs/deployment.md (review ROCm#7): distributed-launchers.md → launchers.md (renamed long ago) user-guide.md → usage.md (no user-guide.md exists) docs/launchers.md (review ROCm#8): Drop the K8s `slurm-multi-*.json` example references — those files do not exist under examples/k8s-configs/. slurm_multi is a SLURM-side launcher; K8s disagg flows continue to use the existing examples/k8s-configs/{minimal,basic}/sglang-disagg-*.json configs. Keep and verify the three SLURM example paths that actually exist on disk. docs/installation.md (review ROCm#15): user-guide.md → usage.md (no user-guide.md exists) how-to-quick-start.md → cli-reference.md (no how-to-quick-start.md exists; usage.md already covers Quick Start and is linked above, so point readers at the CLI reference instead). All link targets verified to exist on disk: docs/launchers.md, docs/usage.md, docs/deployment.md, docs/cli-reference.md examples/slurm-configs/minimal/slurm-multi-minimal.json examples/slurm-configs/basic/slurm-multi-multi-node.json examples/slurm-configs/basic/slurm-multi-custom-split.json Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 75 out of 76 changed files in this pull request and generated 13 comments.
Comments suppressed due to low confidence (2)
src/madengine/orchestration/run_orchestrator.py:120
- Context.init still only accepts additional_context, additional_context_file, build_only_mode, and detect_local_gpu_arch. Passing rocm_path here raises TypeError, so every local run that initializes the runtime context will fail before execution starts.
src/madengine/orchestration/run_orchestrator.py:594 - Build manifests still persist context.docker_env_vars, docker_mounts, and docker_build_arg, but this restore block now copies only tools/scripts back into the runtime Context. Execution-only runs from a manifest therefore lose build-time env/mount context before ContainerRunner starts, including values that Context auto-imported into docker_env_vars such as MAD_SECRETS*.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for env_arg in run_env: | ||
| if not self._ENV_KEY_RE.match(env_arg): | ||
| raise ValueError(f"Invalid environment variable name: {env_arg!r}") | ||
| env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} " | ||
| env_args += f"--env {env_arg}='{str(run_env[env_arg])}' " | ||
|
|
||
| # Add context environment variables | ||
| if "docker_env_vars" in self.context.ctx: | ||
| for env_arg in self.context.ctx["docker_env_vars"].keys(): | ||
| if not self._ENV_KEY_RE.match(env_arg): | ||
| raise ValueError(f"Invalid environment variable name: {env_arg!r}") | ||
| value = self.context.ctx["docker_env_vars"][env_arg] | ||
| env_args += f"--env {env_arg}={shlex.quote(str(value))} " | ||
| env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' " |
There was a problem hiding this comment.
Acknowledged — this is a real concern but it is pre-existing in origin/develop, not introduced by this PR.
The f"--env {env_arg}='{str(...)}' " pattern is at the same line in origin/develop (compare git show origin/develop:src/madengine/execution/container_runner.py | sed -n '578,581p'); the --build-on-compute / --use-image work in this PR didn't change this method.
Deferring to a follow-up issue rather than expanding this PR's scope. The right fix is to switch to shlex.quote(value) (and validate the env-var name with the existing _VALID_ENV_NAME_RE style check) for both this docker_env_vars loop and the run_env loop a few lines above. Happy to file the follow-up issue (run_container env-arg shell-escape hardening) and reference it back here.
| if self.additional_context and "docker_env_vars" in self.additional_context: | ||
| merged_count = 0 | ||
| for key, value in self.additional_context["docker_env_vars"].items(): | ||
| self.context.ctx["docker_env_vars"][key] = value | ||
| merged_count += 1 | ||
| if merged_count > 0: | ||
| print(f"ℹ️ Merged {merged_count} environment variables from additional_context") | ||
|
|
||
| if self.context and str(self.context.ctx.get("gpu_vendor", "")).upper().find( | ||
| "AMD" | ||
| ) != -1: | ||
| from madengine.utils.rocm_path_resolver import finalize_container_rocm_path | ||
|
|
||
| # Determine whether the user explicitly supplied ROCM_PATH for the container. | ||
| # If they did (via docker_env_vars.ROCM_PATH in additional_context), the | ||
| # re-merge above already restored it — keep it so finalize uses it directly. | ||
| # If they did not, clear any ROCM_PATH left from a previous model run so | ||
| # finalize always re-resolves for the current docker_image (OCI config → | ||
| # in-image probe → /opt/rocm default). | ||
| user_supplied_rocm_path = ( | ||
| str( | ||
| (self.additional_context or {}) | ||
| .get("docker_env_vars", {}) | ||
| .get("ROCM_PATH", "") | ||
| ).strip() | ||
| ) | ||
| if not user_supplied_rocm_path: | ||
| self.context.ctx["docker_env_vars"].pop("ROCM_PATH", None) | ||
|
|
||
| finalize_container_rocm_path( | ||
| self.context.ctx["docker_env_vars"], | ||
| docker_image, | ||
| ) | ||
| # Merge env_vars from model_info (models.json) into docker_env_vars | ||
| if "env_vars" in model_info and model_info["env_vars"]: | ||
| model_env_count = 0 | ||
| for key, value in model_info["env_vars"].items(): | ||
| self.context.ctx["docker_env_vars"][key] = str(value) | ||
| model_env_count += 1 | ||
| if model_env_count > 0: | ||
| print(f"ℹ️ Merged {model_env_count} environment variables from model_info (models.json)") |
There was a problem hiding this comment.
This thread is showing as outdated and is already addressed by Cat B in 9109f52.
core/context.py was reverted to origin/develop, which means:
- No more forced
self.ctx['docker_env_vars']['ROCM_PATH'] = self._rocm_pathat init. - In-container
ROCM_PATHis resolved at run time viafinalize_container_rocm_path()(OCI metadata → in-image probe →/opt/rocm), so user/modeldocker_env_varsoverrides survive the merge.
The ROCm-path behaviour the comment was concerned about is preserved by upstream's existing implementation, not by this PR.
| ) | ||
|
|
||
| self.rich_console.print(f"[green]✓ Discovered {len(models)} model(s) for tags: {tags}[/green]\n") | ||
|
|
||
| # Initialize build-only context for manifest generation |
There was a problem hiding this comment.
This thread is showing as outdated and is already addressed by Cat B in 9109f52.
core/context.py was reverted to origin/develop, so the Context build-only constructor no longer accepts a rocm_path keyword. The synthetic-manifest path that called Context(rocm_path=...) was the source of the TypeError referenced here, and that constructor signature is back to upstream's. No call site in this PR passes rocm_path=... anymore.
| if self.additional_context: | ||
| # Merge deployment_config | ||
| if "deployment_config" in manifest: | ||
| stored_config = manifest["deployment_config"] | ||
| # Runtime --additional-context overrides stored config | ||
| # Runtime --additional-context overrides stored config. | ||
| # For dict-valued keys, recursively deep-merge so manifest values |
There was a problem hiding this comment.
This thread is showing as outdated and is already addressed in 63a2eae — reposting the link here for clarity.
run_orchestrator._load_and_merge_manifest no longer uses dict(stored_config[key]); merged.update(...). It now calls the module-level _deep_merge(stored_config[key], self.additional_context[key]) helper, which recurses into nested dicts.
Verified locally: an override touching slurm.node_selector.reservation preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is used in both the runtime-context init path and the manifest-merge path.
| # NOTE: keep this in sync with the launcher branches in | ||
| # deployment/{slurm.py,kubernetes.py}, the templates/{slurm,kubernetes}/*.j2 | ||
| # Jinja conditions, and examples/**/launcher fields. Currently the rest of | ||
| # the codebase (and shipped examples) use the bare "megatron" / "primus" | ||
| # strings, while "megatron-lm" is the docs-friendly synonym. | ||
| VALID_LAUNCHERS = [ | ||
| "torchrun", | ||
| "torchtitan", | ||
| "deepspeed", |
There was a problem hiding this comment.
This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 63a2eae.
Current src/madengine/deployment/common.py has both "megatron" and "primus" in VALID_LAUNCHERS (alongside the docs-friendly "megatron-lm"):
VALID_LAUNCHERS = [
"torchrun", "torchtitan", "deepspeed",
"megatron", "megatron-lm", "primus",
"vllm", "sglang", "slurm_multi",
]Verified at HEAD (2200cbd): normalize_launcher('megatron','slurm') → 'megatron', normalize_launcher('primus','kubernetes') → 'primus'. The example configs (examples/{slurm,k8s}-configs/**/megatron-lm-*.json, examples/k8s-configs/minimal/megatron-lm-*.json) round-trip correctly without being downgraded to "docker"/"native".
| **Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s | ||
| deployment of disagg workloads continues to use the existing | ||
| `sglang-disagg-*.json` configs): | ||
| - SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json` | ||
| - SLURM Basic: `examples/slurm-configs/basic/slurm-multi-multi-node.json` | ||
| - SLURM Custom: `examples/slurm-configs/basic/slurm-multi-custom-split.json` |
There was a problem hiding this comment.
This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 0156d33.
Current docs/launchers.md (around line 470) only lists the three SLURM-side examples that actually exist on disk:
**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s
deployment of disagg workloads continues to use the existing
`sglang-disagg-*.json` configs):
- SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json`
- SLURM Basic: `examples/slurm-configs/basic/slurm-multi-multi-node.json`
- SLURM Custom: `examples/slurm-configs/basic/slurm-multi-custom-split.json`
The three K8s slurm-multi-*.json references that didn't exist were dropped. All three SLURM paths verified to exist at HEAD (2200cbd).
| @@ -126,7 +102,7 @@ madengine build --tags your_model --registry your-registry:5000 | |||
|
|
|||
| # Deploy to SLURM | |||
| madengine run --manifest-file build_manifest.json \ | |||
| --additional-context-file examples/profiling-configs/rocprofv3_multi_node.json | |||
| --additional-context-file examples/profiling-configs/rocprofv3_multinode.json | |||
There was a problem hiding this comment.
Addressed in 2200cbd.
Fixed both occurrences (lines 89 and 105). The actual checked-in profiling config is examples/profiling-configs/rocprofv3_multi_node.json (with underscore); the README now references that exact filename, so the copy-paste command no longer fails with a missing-file error.
| import os | ||
| import tempfile | ||
|
|
||
| import pytest |
There was a problem hiding this comment.
Addressed in 2200cbd.
Dropped import pytest. Confirmed there are no pytest.* references anywhere in tests/unit/test_reporting.py (no @pytest.mark.*, no pytest.raises, no pytest.fixture), so the import was genuinely unused. flake8 lint will be clean again.
| ctx = Context(build_only_mode=True, rocm_path="/opt/rocm") | ||
| assert ctx._rocm_path == "/opt/rocm" |
There was a problem hiding this comment.
Addressed in 2200cbd — file reverted to origin/develop.
With the Cat B revert of core/context.py in 9109f52, this PR no longer adds the Context(rocm_path=...) keyword API; tests for that PR-specific surface belong to whichever future PR re-adds it. Upstream's tests/unit/test_rocm_path.py already exercises ROCmToolManager(rocm_path=...) (still in the codebase), so reverting restores green tests without changing semantics. Net result: this test file is byte-identical to origin/develop.
| ctx = Context(rocm_path="/my/rocm") | ||
| assert ctx.ctx.get("rocm_path") == "/my/rocm" | ||
| assert ctx.ctx["docker_env_vars"].get("ROCM_PATH") == "/my/rocm" |
There was a problem hiding this comment.
Addressed in 2200cbd (same fix as the sibling thread). The whole test_rocm_path.py file was reverted to origin/develop, so this assertion that ctx['docker_env_vars']['ROCM_PATH'] == ... immediately after construction is gone — the test now matches upstream's behaviour where in-container ROCM_PATH is finalized at run time via finalize_container_rocm_path().
Three Round-3 review comments fixed in this small commit; the remaining Round-3 items are either stale (already fixed in 9109f52 / 63a2eae / 0156d33; Copilot was reviewing a pre-fix snapshot) or pre-existing concerns in upstream/develop that are deferred to a follow-up issue (replied inline). examples/profiling-configs/README.md (R3-6): `rocprofv3_multinode.json` -> `rocprofv3_multi_node.json` (lines 89 and 105). The actual checked-in file is `rocprofv3_multi_node.json` (with underscore), so copy-pasting the README's command would fail with a missing-file error. tests/unit/test_reporting.py (R3-7): Drop the unused `import pytest` (no `pytest.*` references in the file). The repo's pre-commit runs flake8, so the unused import would fail lint/CI. tests/unit/test_rocm_path.py (R3-8 + R3-9): Revert to origin/develop. With the Cat B revert of `core/context.py` (commit 9109f52), the PR no longer adds the `Context(rocm_path=...)` keyword API; tests for that PR-specific API now belong to whichever future PR re-adds it. Upstream's version of this test file already exercises `ROCmToolManager(rocm_path=...)` -- which is still in the codebase -- so reverting restores green tests without changing semantics. Co-authored-by: Cursor <cursoragent@cursor.com>
Surfaced live: `madengine build --tags pyt_kvcache_transfer_bench
--build-on-compute --registry docker.io/rocm/pytorch-private` failed
its docker build step with:
ERROR: failed to build: invalid tag
"ci-pyt_kvcache_transfer_bench_Dockerfile":
repository name must be lowercase
Root cause: in `_execute_build_on_compute`, `local_image_name` is
assembled as `ci-<model_name>_<dockerfile_basename>`, where
`dockerfile_basename` strips `.Dockerfile` and `.ubuntu.amd` from
`Path(dockerfile_path).name`. When the model card's `dockerfile`
field is just `scripts/<dir>/Dockerfile` (no `.ubuntu.amd.Dockerfile`
suffix), neither `.replace()` matches, so `dockerfile_basename`
stays as `Dockerfile` (capital D) and the assembled tag is rejected
by `docker build -t`.
Fix: `.lower()` the final assembled local image name. This:
* fixes the kvcache_transfer_bench card (`scripts/.../Dockerfile`),
* is a no-op for the four already-passing cards whose dockerfile
paths follow the `*.ubuntu.amd.Dockerfile` convention (the
`_basename` is already lowercase after the suffix strips), and
* is robust against any uppercase character anywhere in
`model_name` or `dockerfile_basename`.
Verified by re-running the failing case to completion:
Local image name: ci-pyt_kvcache_transfer_bench_dockerfile
Image pushed: docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench
[✓] All builds completed successfully
(slurm job 22936, ~18 min full Dockerfile build of mooncake + MORI +
RIXL + nixlbench, 24 layers)
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 74 out of 75 changed files in this pull request and generated 24 comments.
Comments suppressed due to low confidence (3)
docs/cli-reference.md:396
- The CLI docs still say
--timeout 0means “no timeout”, butmadengine runno longer translates0to an unlimited timeout display-wise and the validation message no longer mentions0. Either update the docs to match the current behavior, or restore the explicit0 => no timeouthandling so help text, docs, and runtime all align.
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |
pyproject.toml:31
pytestis included in the package’s runtime dependencies even though it’s only used for testing (it’s already listed under[project.optional-dependencies].dev). Keeping pytest independenciesincreases install size and can conflict with downstream environments. Move it back to dev/test extras only unless the library imports pytest at runtime (it currently doesn’t).
dependencies = [
"pandas",
"GitPython",
"jsondiff",
"sqlalchemy",
"setuptools-rust",
"paramiko",
"tqdm",
"pytest",
"typing-extensions",
"pymongo",
"toml",
"typer>=0.9.0",
"rich>=13.0.0",
"click>=8.0.0",
"jinja2>=3.0.0",
"pyyaml>=6.0",
]
src/madengine/execution/container_runner_helpers.py:216
make_run_log_file_path()no longer normalizes full image references (e.g.registry/ns/name:tag), so the generated log filename can include:(and very long sanitized refs), and run-log naming can diverge from build-log naming for CI-style tags. Consider reinstating a normalization helper (strip digest, collapseregistry/...:ci-*to just the tag, and replace:,/,@with_) to keep filenames stable and filesystem-friendly.
| rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None)) | ||
| self.context = Context( | ||
| additional_context=context_string, | ||
| build_only_mode=False, | ||
| rocm_path=rocm_path, | ||
| ) |
There was a problem hiding this comment.
Addressed in 3fc7587.
This is a real latent bug introduced by my Cat B revert: I reverted core/context.py to origin/develop (which removes the rocm_path keyword), but missed that run_orchestrator.py still passes it at lines 138 and 456. Today's slurm_multi tests went through _execute_distributed so didn't hit it, but madengine run against a local target hits _init_runtime_context() here and crashes with TypeError: Context.__init__() got an unexpected keyword argument 'rocm_path'.
Fix in 3fc7587: drop the rocm_path = get_rocm_path(...) line and the rocm_path=rocm_path kwarg at both call sites; drop the now-unused from madengine.core.constants import get_rocm_path import.
Verified at HEAD (3fc7587) via inspect.signature(Context.__init__): params are ['self','additional_context','additional_context_file','build_only_mode','detect_local_gpu_arch'] -- no more rocm_path. End-to-end smoke: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 metric rows.
| rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None)) | ||
| build_context = Context( | ||
| additional_context=context_string, | ||
| build_only_mode=True, | ||
| rocm_path=rocm_path, | ||
| ) |
There was a problem hiding this comment.
Addressed in 3fc7587 (same root cause as the sibling thread on line 138). The synthetic-manifest path in _create_manifest_from_local_image was passing the same now-removed rocm_path kwarg. Both call sites are fixed in this commit; from madengine.core.constants import get_rocm_path import is gone too.
| from madengine.core.errors import ( | ||
| BuildError, | ||
| ConfigurationError, | ||
| ExecutionError, | ||
| create_error_context, | ||
| handle_error, | ||
| ) | ||
| from madengine.core.constants import get_rocm_path |
There was a problem hiding this comment.
Addressed in 3fc7587. Dropped handle_error from the import set in from madengine.core.errors import (...) since the only references in this module were removed in earlier rounds. flake8 F401 should be clean.
| # Ensure credential values are strings | ||
| username = str(creds["username"]) | ||
| password = str(creds["password"]) | ||
|
|
||
| # Perform docker login | ||
| login_command = f"echo '{password}' | docker login" | ||
|
|
||
| if registry and registry.lower() not in ["docker.io", "dockerhub"]: | ||
| login_command += f" {registry}" | ||
|
|
||
| login_command += f" --username {username} --password-stdin" | ||
|
|
||
| try: | ||
| self.console.sh(login_command, secret=True) | ||
| self.rich_console.print(f"[green]✅ Successfully logged in to registry: {registry or 'DockerHub'}[/green]") | ||
| except Exception as e: | ||
| self.rich_console.print(f"[red]❌ Failed to login to registry {registry}: {e}[/red]") |
There was a problem hiding this comment.
Addressed in 3fc7587.
ContainerRunner.login_to_registry now delegates to madengine.core.auth.login_to_registry (the same helper DockerBuilder.login_to_registry already uses). The previous inline form built
login_command = f"echo '{password}' | docker login ... --password-stdin"
self.console.sh(login_command, secret=True)which -- even though docker reads the password via --password-stdin -- still embedded the password in echo's argv visible through ps//proc. The central helper writes the password to docker's stdin via subprocess.run(..., input=password.encode()), so it never appears on a process command line.
The ~50 lines of inline credential-shape validation are also gone -- the central helper does the same checks (and was the path Round-2 R2-#18 already moved DockerBuilder onto, just in a sister file).
| def get_env_arg(self, run_env: typing.Dict) -> str: | ||
| """Get the environment arguments for docker run.""" | ||
| env_args = "" | ||
|
|
||
| # Add custom environment variables | ||
| if run_env: | ||
| for env_arg in run_env: | ||
| if not self._ENV_KEY_RE.match(env_arg): | ||
| raise ValueError(f"Invalid environment variable name: {env_arg!r}") | ||
| env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} " | ||
| env_args += f"--env {env_arg}='{str(run_env[env_arg])}' " | ||
|
|
||
| # Add context environment variables | ||
| if "docker_env_vars" in self.context.ctx: | ||
| for env_arg in self.context.ctx["docker_env_vars"].keys(): | ||
| if not self._ENV_KEY_RE.match(env_arg): | ||
| raise ValueError(f"Invalid environment variable name: {env_arg!r}") | ||
| value = self.context.ctx["docker_env_vars"][env_arg] | ||
| env_args += f"--env {env_arg}={shlex.quote(str(value))} " | ||
| env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' " | ||
|
|
||
| print(f"Env arguments: {env_args}") | ||
| return env_args |
There was a problem hiding this comment.
Addressed in 3fc7587. get_env_arg no longer dumps the full assembled --env KEY='value' ... string into stdout. It now prints only the count:
print(f"Env arguments: {env_args.count('--env')} variables set")Values can include HF tokens, registry passwords, MAD_SECRETS_*, etc.; the assembled string is still passed to docker run (so the container sees them as expected), but no longer leaks into stdout / CI artifacts.
| try: | ||
| power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle) | ||
| if isinstance(power_info, dict): | ||
| # If both current and average power are 0, it's likely a secondary die | ||
| current_power = power_info.get('current_socket_power', -1) | ||
| avg_power = power_info.get('average_socket_power', -1) | ||
| if current_power == 0 and avg_power == 0: | ||
| return True | ||
| except Exception: | ||
| except: | ||
| # If we can't get power info, might be secondary die | ||
| return True |
| - `sglang-disagg` - Disaggregated SGLang (multi-node) | ||
|
|
||
| See [Launchers Guide](launchers.md) for details. | ||
| See [Distributed Launchers Guide](distributed-launchers.md) for details. |
|
|
||
| # Add upload timestamp if not present | ||
| if f"{prefix}_uploaded_at" not in doc: | ||
| doc[f"{prefix}_uploaded_at"] = datetime.now(timezone.utc) | ||
| doc[f"{prefix}_uploaded_at"] = datetime.utcnow() | ||
|
|
||
| # Preserve original created_date if present | ||
| if "created_date" not in doc: | ||
| doc["created_date"] = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S") | ||
| doc["created_date"] = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S") | ||
|
|
| @@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg | |||
| ) | |||
|
|
|||
|
|
|||
| # Backward compatibility alias | |||
| RuntimeError = ExecutionError | |||
|
|
|||
| class TestConfigParser: | ||
| """Test cases for ConfigParser functionality.""" | ||
|
|
||
| @pytest.fixture | ||
| def tmp_dir(self): | ||
| def test_dir(self): |
…pend docker.io/)
Surfaced live: `madengine build --build-on-compute --registry rocm/pytorch-private`
failed inside the build sbatch with `docker login rocm` -> dial tcp ...
connect: connection refused. Dockerhub-shorthand without the `docker.io/`
prefix is a routine input shape (matches what Docker CLI accepts for
pull/run); the build-on-compute path silently fed the first segment of
`registry` to `docker login <host>` and got a DNS NXDOMAIN.
`_execute_build_on_compute` now normalizes `--registry` early:
1. Strip whitespace + trailing slash; reject empty after trim.
2. Reject inputs containing whitespace or `@` anywhere (catches typos
like `@bad/foo` BEFORE the auto-prefix step would mask them by
overwriting the first path segment to `docker.io`).
3. If the first path segment doesn't look like a registry host
(no `.`, no `:`, not `localhost`), prepend `docker.io/` and print
a transparent `[dim]auto-prefixing 'docker.io/'[/dim]` notice so
the user can see what the build is actually pushing to.
4. Use the resulting normalized value for everything downstream
(login host extraction, push path construction, manifest entry,
per-registry credential-hint matching).
Effect on real-world inputs (verified end-to-end against
docker.io/rocm/pytorch-private):
rocm/pytorch-private -> docker.io/rocm/pytorch-private (auto)
rocmshared -> docker.io/rocmshared (auto)
docker.io/rocm/pytorch-private -> unchanged
ghcr.io/myorg/myrepo -> unchanged
registry.example.com:5000/myrepo -> unchanged
localhost:5000/myrepo -> unchanged
192.168.1.10:5000/myrepo -> unchanged
@bad/foo -> ConfigurationError (fail-fast,
includes per-registry usage hints)
"" -> ConfigurationError (existing
None-guard, before this change)
Live verification (rocmshared PAT, write+read on rocm/pytorch-private):
Test 1 `rocm/pytorch-private` -> sbatch 22954 ✅ pushed
docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench
Test 2 `@bad/foo` -> ConfigurationError (no sbatch)
"Invalid --registry value: '@bad/foo'.
First segment '@bad' is not a valid
registry host."
Test 3 `docker.io/rocm/pytorch-private` -> sbatch 22956 ✅ pushed (no auto-prefix
message, value unchanged)
Test 4 "" -> caught by existing build CLI guard
Suggestion strings use parens `(...)` rather than `[...]` to avoid
Rich's markdown parser interpreting them as closing tags.
Co-authored-by: Cursor <cursoragent@cursor.com>
Round-4 Copilot review (auto-fired post 2200cbd) identified 24 inline comments. 14 are pre-existing flake8/style nits in `origin/develop`- identical files unchanged by this PR (tests/fixtures/**, scripts/common/ tools/**, csv_parser.py, utils/{ops,log_formatting}.py, core/errors.py, database/mongodb.py, etc.); those are out of scope for this PR. This commit addresses the 8 findings that DO touch code we own: P0 (real latent bugs / known-broken): src/madengine/orchestration/run_orchestrator.py: * Drop `Context(rocm_path=...)` kwarg from both call sites (`_init_runtime_context` line 138; `_create_manifest_from_local_image` line 456). Cat B's revert of `core/context.py` to origin/develop removed `rocm_path` from `Context.__init__`, but those two call sites still passed the kwarg. Latent bug: today's slurm_multi tests went through `_execute_distributed` so didn't hit it, but `madengine run` against a local target (`_execute_local`) and the synthetic-manifest path (`MAD_CONTAINER_IMAGE` env) crashed with `TypeError: Context.__init__() got an unexpected keyword argument 'rocm_path'`. * Drop now-unused `from madengine.core.constants import get_rocm_path` (only callers were the two lines above). * Drop unused `handle_error` from `madengine.core.errors` import set (flake8 F401). src/madengine/execution/container_runner.py: * Widen `perf_pattern` in run-result extraction to mirror the `deployment/base.py:_parse_node_log_for_perf` regex we already widened in 63a2eae: before: r'performance:\s+([0-9][0-9.eE+-]*)\s+([a-zA-Z_][a-zA-Z0-9_]*)' after: r'performance:\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+(\S+)' Accepts scientific notation values and arbitrary metric tokens (`tokens/sec`, `tok/s`, `samples_per_second`, etc.). docs/deployment.md: * `[Distributed Launchers Guide](distributed-launchers.md)` → `(launchers.md)` at line 145. Second occurrence of the broken link the 0156d33 fix missed (it only fixed line 543). Yellow-tier hardening (security / leak / shell-quoting): src/madengine/execution/container_runner.py: * `login_to_registry` now delegates to `madengine.core.auth.login_to_registry` instead of constructing `f"echo '{password}' | docker login ..."` inline. The previous form put the password in argv (visible via `ps`/`/proc`) even though docker itself read it from stdin. Same delegation pattern DockerBuilder already uses. * `get_env_arg` no longer prints the assembled `--env KEY='value'` string (which can include HF tokens, registry passwords, `MAD_SECRETS_*`). Print only the count. * Self-managed launcher path (`_execute_self_managed_launcher`): `cmd = f"bash {script_path} {model_args}"` with `shell=True` no longer string-concats user-supplied `model_args` into the shell command. Use `shlex.quote(script_path)` and `shlex.split(model_args)` + per-arg `shlex.quote`, so embedded `$()`/backticks/`;` become literal arguments to the script instead of host-shell directives. Verified: * py_compile clean on both files. * Import smoke: DeploymentFactory.available_deployments() == ['k8s','kubernetes','slurm']; Context() rejects `rocm_path=` arg; _deep_merge nested-leaf preservation; run_orchestrator no longer references get_rocm_path/handle_error/rocm_path=; container_runner perf_pattern widened, env_args print is count-only, login_to_registry delegates to core.auth, self-managed shlex.quote applied. * Live: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi completed in 8 min with 28 ✅ Success rows -> perf.csv at /shared_inference/ravgupta/<jobid>/ as expected. Addresses Copilot review comments R4-1, R4-2, R4-3, R4-4, R4-5, R4-6, R4-7, R4-18. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 74 out of 75 changed files in this pull request and generated 16 comments.
Comments suppressed due to low confidence (2)
tests/unit/test_reporting_superset.py:39
- This file contains whitespace on otherwise blank lines (e.g. between the class docstring and the first fixture). The repo’s pre-commit
trailing-whitespacehook (and flake8 W293) will fail on this; please remove trailing spaces on blank lines.
tests/unit/test_reporting_superset.py:171 - Additional whitespace-only lines in this section will also fail the pre-commit
trailing-whitespacehook. Please strip the trailing spaces from blank lines here as well.
| dependencies = [ | ||
| "pandas", | ||
| "GitPython", | ||
| "jsondiff", | ||
| "sqlalchemy", | ||
| "setuptools-rust", | ||
| "paramiko", | ||
| "tqdm", | ||
| "pytest", | ||
| "typing-extensions", | ||
| "pymongo", |
There was a problem hiding this comment.
Addressed in afbb666: dropped pytest from [project].dependencies. It was already present in [project.optional-dependencies].dev and .all, so no install path that was running tests is affected; only the bare pip install madengine (without [dev]/[all]) no longer pulls a test framework into runtime installs.
| result = builder.get_build_arg() | ||
|
|
||
| assert f"--build-arg ARG1={shlex.quote('value1')}" in result | ||
| assert f"--build-arg ARG2={shlex.quote('value2')}" in result | ||
| assert "--build-arg ARG1='value1'" in result | ||
| assert "--build-arg ARG2='value2'" in result | ||
|
|
There was a problem hiding this comment.
Addressed in afbb666: 5 build-arg assertions updated to match DockerBuilder.get_build_arg()'s shlex.quote() output. For shell-safe values (e.g. value1), shlex.quote returns the value unchanged, so the assertions are now --build-arg ARG1=value1 (no surrounding quotes). Verified locally: pytest -k build_arg → 4 passed.
| @pytest.mark.parametrize( | ||
| "clean_test_temp_files", [["perf_test.csv", "perf_test.html"]], indirect=True | ||
| ) | ||
| def test_commandline_argument_skip_gpu_arch( | ||
| self, global_data, clean_test_temp_files, dynamic_skip_gpu_arch_model_dir | ||
| self, global_data, clean_test_temp_files | ||
| ): | ||
| """ | ||
| Test that skip_gpu_arch command-line argument skips GPU architecture check. | ||
| Uses a dynamic MODEL_DIR with the current GPU arch added to the skip list | ||
| so the test works on any hardware. | ||
| UPDATED: Now uses python3 -m madengine.cli.app instead of legacy mad.py | ||
| """ | ||
| context = generate_additional_context_for_machine() | ||
| output = global_data["console"].sh( | ||
| "cd " | ||
| + BASE_DIR | ||
| + "; " | ||
| + "MODEL_DIR=" | ||
| + dynamic_skip_gpu_arch_model_dir | ||
| + MODEL_DIR | ||
| + " " | ||
| + f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'" | ||
| ) | ||
| if "Skipping model" not in output: | ||
| pytest.fail("Enable skipping gpu arch for running model is failed.") |
There was a problem hiding this comment.
Deferred to upstream-flake8 cleanup PR. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to origin/develop; this PR's commit 50e8895 only touched the surrounding lines (88, 99) for whitespace and a fixture-arg rename, not the assertion target itself. Tracked alongside the rest of the Round-4 / Round-5 upstream-style nits in the follow-up cleanup PR (see #86 description, "Round 4" / "Round 5" status sections).
| if deployment_type == "slurm" and in_slurm_job: | ||
| print(f"🔄 Using fresh pull policy for SLURM compute node (prevents cached layer corruption)") | ||
| # Remove any existing cached image to force fresh pull | ||
| try: | ||
| self.console.sh(f"docker rmi -f {registry_image} 2>/dev/null || true") | ||
| print(f"✓ Removed cached image layers") | ||
| except Exception: | ||
| except: | ||
| pass # It's okay if image doesn't exist |
There was a problem hiding this comment.
Addressed in afbb666: the bare except: at L424 (introduced by this PR's 6f2e11bf) was replaced with except Exception: so KeyboardInterrupt/SystemExit propagate, and flake8 E722 no longer fires on this line.
| def get_env_arg(self, run_env: typing.Dict) -> str: | ||
| """Get the environment arguments for docker run.""" | ||
| env_args = "" | ||
|
|
||
| # Add custom environment variables | ||
| if run_env: | ||
| for env_arg in run_env: | ||
| if not self._ENV_KEY_RE.match(env_arg): | ||
| raise ValueError(f"Invalid environment variable name: {env_arg!r}") | ||
| env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} " | ||
| env_args += f"--env {env_arg}='{str(run_env[env_arg])}' " | ||
|
|
||
| # Add context environment variables | ||
| if "docker_env_vars" in self.context.ctx: | ||
| for env_arg in self.context.ctx["docker_env_vars"].keys(): | ||
| if not self._ENV_KEY_RE.match(env_arg): | ||
| raise ValueError(f"Invalid environment variable name: {env_arg!r}") | ||
| value = self.context.ctx["docker_env_vars"][env_arg] | ||
| env_args += f"--env {env_arg}={shlex.quote(str(value))} " | ||
| env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' " | ||
|
|
There was a problem hiding this comment.
Addressed in afbb666: get_env_arg now passes each value through shlex.quote(...) instead of single-quote string interpolation. shlex.quote is byte-equivalent for shell-safe values (alphanumeric, _, -, ., :, /, +, =, etc.) and correctly escapes embedded single quotes, whitespace, and other metacharacters.
slurm_multi safety: the self-managed launcher path (_run_self_managed) returns at L1099 before the docker-run command assembly at L1058 that calls get_env_arg, so this change only tightens the regular-Docker run path used by non-slurm_multi workloads. Verified end-to-end with a fresh pyt_large_ep_benchmark_2n slurm_multi smoke.
| if timeout < -1: | ||
| console.print( | ||
| "❌ [red]Timeout must be -1 (default), 0 (no timeout), or a positive integer[/red]" | ||
| "❌ [red]Timeout must be -1 (default) or a positive integer[/red]" |
There was a problem hiding this comment.
Addressed in afbb666: validation message updated to match the actual accepted shape — "Timeout must be -1 (default), 0 (no timeout), or a positive integer". The --timeout help string already documents 0 for no timeout (line 67) and the validator is if timeout < -1, so 0 was always accepted; only the error string was misleading. No behaviour change.
| def is_amd_gpu(): | ||
| """Check if system has AMD GPU.""" | ||
| try: | ||
| import subprocess | ||
| result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5) | ||
| return result.returncode == 0 | ||
| except Exception: | ||
| except: | ||
| return False |
There was a problem hiding this comment.
Deferred. File is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.
| # Clean up Docker containers BEFORE test (ensure clean state) | ||
| try: | ||
| subprocess.run( | ||
| "docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f", | ||
| shell=True, | ||
| capture_output=True, | ||
| timeout=30 | ||
| ) | ||
| except Exception: | ||
| except: | ||
| pass # Ignore cleanup errors before test | ||
|
|
There was a problem hiding this comment.
Deferred. File is byte-identical to origin/develop in this PR (no edits from #86). The two bare-except instances in this file are pre-existing; develop's [2.0.1] test-suite cleanup narrowed 5 such cases elsewhere but did not touch these two. Tracked in the upstream-cleanup follow-up PR.
| # Clean up Docker containers AFTER test (avoid conflicts with next test) | ||
| try: | ||
| subprocess.run( | ||
| "docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f", | ||
| shell=True, | ||
| capture_output=True, | ||
| timeout=30 | ||
| ) | ||
| except Exception: | ||
| except: | ||
| pass # Ignore cleanup errors after test |
There was a problem hiding this comment.
Deferred. Same disposition as the L106 thread above — file byte-identical to origin/develop, pre-existing concern, tracked in the upstream-cleanup follow-up PR.
| def convert_directory_csvs_to_html( | ||
| directory_path: str, | ||
| output_file: str = "run_results.html" | ||
| ) -> Optional[str]: | ||
| ) -> str: | ||
| """Convert all CSV files in a directory to a single HTML file. | ||
|
|
||
| Args: |
There was a problem hiding this comment.
Deferred. This file is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.
Status update — ready for reviewPR head: What's in here
Kubernetes deployment work split out into #118 (independent PR against Live end-to-end test coverageAll on
The vLLM-disagg run path was also exercised end-to-end (slurm_multi topology + co-located proxy + port refactor + MoRI-EP gating all working) but the underlying Copilot review statusFour rounds total, all addressed:
The "P0" subset of Round 4 included a real latent crash I'd introduced in the Cat B revert ( Bugs surfaced and fixed during real-cluster testing
Sequencing
What I'd appreciate eyes on
Reviews welcome. cc @gargrahul @Cemberk @Rohan138 @coketaste @leconcio |
Three P1 items in PR-introduced code, plus four small cleanups Copilot
flagged on the same lines.
container_runner.py
* get_env_arg: replace single-quote string interpolation
(`--env KEY='value'`) with shlex.quote, so values containing single
quotes / whitespace / shell metacharacters no longer break the
assembled `docker run` command. shlex.quote is byte-equivalent
for safe alphanumeric values (returns the input unchanged) and
only adds escaping when needed. Not on the slurm_multi path
(self-managed launcher short-circuits at L1098 before the docker
docker_options assembly that calls get_env_arg).
* Revert the AMD GPU SMI block (~L1142) to origin/develop's bare-name
form (`amd-smi 2>/dev/null || rocm-smi 2>/dev/null || echo ...`).
The PR-introduced version interpolated the *host* `context.ctx['rocm_path']`
(or `get_rocm_path()`) into absolute `/opt/rocm-X.Y/bin/*` paths and
executed them *inside* the container, which is incorrect when the
container's ROCm install location differs from the host's. Bare
names rely on the container's PATH, which is the upstream behaviour.
* Replace bare `except:` at L424 (PR-introduced) with `except Exception:`
so KeyboardInterrupt / SystemExit propagate (flake8 E722).
* Fix a stale function-name reference in the perf-pattern comment at
L1394: deployment/base.py's function is `_parse_performance_from_log`,
not `_parse_node_log_for_perf`.
* Hoist `import shlex` to the module-level import block (was a local
import inside `_run_self_managed`) and drop the now-unused
`from madengine.core.constants import get_rocm_path`.
cli/commands/run.py
* Update timeout validation message to match the actual accepted
shape ("-1 (default), 0 (no timeout), or a positive integer").
The --timeout help string already documents `0 for no timeout`,
and the validator uses `if timeout < -1`, so 0 was always
accepted; only the error string was misleading.
pyproject.toml
* Drop `pytest` from [project].dependencies (it stays in
[project.optional-dependencies].dev and .all). Keeps test
dependencies out of `pip install madengine` runtime installs.
tests/integration/test_docker_integration.py
* Update 5 build-arg assertions to match `DockerBuilder.get_build_arg()`'s
shlex.quote() output for shell-safe values
(`--build-arg ARG1=value1`, no surrounding quotes).
Verified: `pytest -k build_arg` -> 4 passed.
slurm_multi safety: get_env_arg is reachable only from the regular
docker run path (L1058); the slurm_multi self-managed launcher
returns at L1099 before that. The SMI revert and the bare-`except`
fix are both in code paths used only outside slurm_multi (image
pull / GPU pre-check). The timeout, pyproject and comment-typo
changes have no runtime effect.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # CHANGELOG.md
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 73 out of 74 changed files in this pull request and generated 14 comments.
Comments suppressed due to low confidence (2)
src/madengine/deployment/base.py:236
_monitor_until_complete()never treats CANCELLED (or any non-SUCCESS/FAILED terminal status) as terminal. SinceDeploymentStatusstill includes CANCELLED, this loop can spin forever after a cancellation. Consider returning on CANCELLED as well (and/or treating any non-PENDING/RUNNING status as terminal).
self.console.print("[blue]Monitoring deployment...[/blue]")
while True:
status = self.monitor(deployment_id)
if status.status in [DeploymentStatus.SUCCESS, DeploymentStatus.FAILED]:
return status
# Still running, wait and check again
self.console.print(
f" Status: {status.status.value} - {status.message}"
)
time.sleep(30) # Check every 30 seconds
src/madengine/execution/container_runner.py:251
create_run_details_dict()indexesself.context.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"]directly. Ifdocker_env_varsor the key is missing (e.g., older manifests / non-GPU runs / context init differences), this will raiseKeyErrorand prevent perf entry generation. Use.get()with a safe fallback (or guard when the nested dict/key is absent).
| # Merge env_vars from model_info (models.json) into docker_env_vars | ||
| if "env_vars" in model_info and model_info["env_vars"]: | ||
| model_env_count = 0 | ||
| for key, value in model_info["env_vars"].items(): | ||
| self.context.ctx["docker_env_vars"][key] = str(value) | ||
| model_env_count += 1 | ||
| if model_env_count > 0: | ||
| print(f"ℹ️ Merged {model_env_count} environment variables from model_info (models.json)") |
There was a problem hiding this comment.
Addressed in bcafdbb: restored the upstream finalize_container_rocm_path() call block (originally introduced on develop in #110, removed by this PR's 6f2e11bf when the model_info env_vars merge block replaced it). The two address different concerns and must coexist — the env_vars merge copies model-card values into docker_env_vars, while finalize normalizes in-container ROCM_PATH via OCI inspect → in-image probe → /opt/rocm default. Block restored verbatim from origin/develop (including the user_supplied_rocm_path short-circuit), placed immediately after the env_vars merge.
slurm_multi safety: the self-managed launcher returns at L1099 before this code path, so this restoration only affects the regular-Docker run path used by non-slurm_multi workloads.
| model_docker.sh(f"cp {model_dir}/*_profiler_output.csv . 2>/dev/null || true") | ||
| model_docker.sh(f"cp {model_dir}/*_output.csv . 2>/dev/null || true") | ||
| model_docker.sh(f"cp {model_dir}/*_trace.csv . 2>/dev/null || true") | ||
| model_docker.sh(f"cp {model_dir}/library_trace.csv . 2>/dev/null || true") |
There was a problem hiding this comment.
Addressed in bcafdbb: model_dir is now passed through shlex.quote once and reused across the four cp ... 2>/dev/null || true calls. The trailing *.csv glob is left unquoted intentionally so the in-container shell still expands it inside the (quoted) directory:
_md_q = shlex.quote(model_dir)
model_docker.sh(f"cp {_md_q}/*_profiler_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/*_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/*_trace.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/library_trace.csv . 2>/dev/null || true")| mult_res = model_info.get("multiple_results") | ||
| if mult_res: | ||
| try: | ||
| model_docker.sh( | ||
| _cp_model_dir_file_to_cwd_cmd(model_dir, mult_res) | ||
| ) | ||
| model_docker.sh(f"cp {model_dir}/{mult_res} . 2>/dev/null || true") | ||
| except Exception: |
There was a problem hiding this comment.
Addressed in bcafdbb: both path components are now quoted, so a model-metadata multiple_results value containing whitespace or shell metacharacters can no longer break the command or be re-interpreted by the shell:
model_docker.sh(
f"cp {shlex.quote(model_dir)}/{shlex.quote(mult_res)} . "
"2>/dev/null || true"
)| """ | ||
|
|
||
| import json | ||
| import os |
There was a problem hiding this comment.
Addressed in bcafdbb: dropped the unused import os (flake8 F401). The module hasn't referenced os.* since the orchestration helpers it used were moved out.
| context = generate_additional_context_for_machine() | ||
| output = global_data["console"].sh( | ||
| "cd " | ||
| + BASE_DIR | ||
| + "; " | ||
| + "MODEL_DIR=" | ||
| + dynamic_skip_gpu_arch_model_dir | ||
| + MODEL_DIR | ||
| + " " | ||
| + f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'" | ||
| ) | ||
| if "Skipping model" not in output: | ||
| pytest.fail("Enable skipping gpu arch for running model is failed.") |
There was a problem hiding this comment.
Deferred. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to the version on origin/develop in this PR. develop's [2.0.1] already lands the hardware-agnostic fix ([2.0.1] -> "E2E tests — hardware-agnostic GPU arch skip" — test_commandline_argument_skip_gpu_arch now detects the runner's GPU arch at runtime and injects it into skip_gpu_arch, with a new get_gpu_arch() helper in tests/fixtures/utils.py). Once #86 lands and CI re-runs against develop, that fix is what executes; no #86-side change needed. Tracked alongside the other origin/develop-identical concerns in the upstream-flake8 follow-up cleanup PR (per #86 description, Round 4 / Round 5 / Round 6 status sections).
| try: | ||
| import subprocess | ||
| result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5) | ||
| return result.returncode == 0 | ||
| except Exception: | ||
| except: | ||
| return False |
There was a problem hiding this comment.
Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L41 is from 04aac390). Tracked alongside the other upstream bare-except: cleanup in the follow-up PR.
| try: | ||
| self.rocmsmi = initRsmiBindings(silent=False) | ||
| if rocm_smi.driverInitialized() is True: | ||
| ret_init = self.rocmsmi.rsmi_init(0) | ||
| if ret_init != 0: | ||
| raise ValueError('ROCm SMI returned %s (the expected value is 0)', ret_init) | ||
| exit(ret_init) | ||
| else: | ||
| raise ImportError('Driver not initialized (amdgpu not found in modules)') | ||
| exit(0) | ||
| self.rocm6 = True | ||
| except Exception: | ||
| except: | ||
| rocm_smi.initializeRsmi() |
There was a problem hiding this comment.
Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L42 is from da0d524f (Rahul Garg) and predates #86). Tracked in the upstream-cleanup follow-up PR.
| try: | ||
| power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle) | ||
| if isinstance(power_info, dict): | ||
| # If both current and average power are 0, it's likely a secondary die | ||
| current_power = power_info.get('current_socket_power', -1) | ||
| avg_power = power_info.get('average_socket_power', -1) | ||
| if current_power == 0 and avg_power == 0: | ||
| return True | ||
| except Exception: | ||
| except: | ||
| # If we can't get power info, might be secondary die | ||
| return True |
There was a problem hiding this comment.
Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L157 is from 0f2c3010 (Stephen Shao) and predates #86). Tracked in the upstream-cleanup follow-up PR.
| from unittest.mock import MagicMock, patch, Mock | ||
| import pytest | ||
| import pandas as pd | ||
|
|
There was a problem hiding this comment.
Addressed in bcafdbb: dropped the unused import pandas as pd (flake8 F401). Confirmed there are no pd.* references in the file — the test fixtures use dict / MagicMock only, so the import was dead. The line was added by this PR's 50e8895.
| class TestConfigParser: | ||
| """Test cases for ConfigParser functionality.""" | ||
|
|
||
There was a problem hiding this comment.
Addressed in bcafdbb: trailing whitespace stripped on every blank-with-spaces line (107 occurrences in this file, including the one at L31). Pure whitespace cleanup, no functional change; flake8 W293 is now clean for this file.
Round 6 fired automatically on the develop-merge push (8fec192). 5 of the 14 comments are PR-introduced; the remaining 9 are byte-identical to origin/develop and stay on the upstream-flake8 cleanup follow-up list with the rest of R4 / R5 deferrals. container_runner.py * Restore the upstream `finalize_container_rocm_path()` call block (originally introduced on develop in PR ROCm#110, removed by this PR's 6f2e11b when the model_info `env_vars` merge block replaced it). The two blocks address different concerns and must coexist: - merge model-card env_vars into docker_env_vars (this PR), and - normalize in-container ROCM_PATH via OCI inspect → in-image probe → /opt/rocm default (upstream). Without the finalization, models on TheRock or non-`/opt/rocm` container layouts run with a wrong/empty ROCM_PATH, breaking rocm-smi / amd-smi / hipcc inside the container. The slurm_multi self-managed launcher returns at L1099 before this code path, so this restoration only affects the regular docker run path. * Quote `model_dir` (and `mult_res`) with shlex.quote on the profiler-/trace-/multi-results `cp` commands run via `model_docker.sh(... shell=True)` (lines 1672-1685 area). The `*.csv` glob is left unquoted so the in-container shell still expands it inside the (quoted) directory. `mult_res` comes straight from model metadata and was previously concatenated raw into the shell command. deployment/config_loader.py * Drop unused `import os` (flake8 F401). The module hasn't referenced `os.*` since the orchestration helpers it used were moved out. tests/unit/test_database_mongodb.py * Drop unused `import pandas as pd` (flake8 F401). The test file uses dict / Mock fixtures and never reaches into pandas. tests/unit/test_reporting_superset.py * Strip trailing whitespace (flake8 W293) on every blank-with- spaces line (107 occurrences). Pure whitespace; no functional change. slurm_multi safety: all five fixes either restore upstream behaviour on the regular-Docker run path, are flake8-only, or are in test files. None of them change anything reached by the slurm_multi self-managed launcher path; live re-smoke is unnecessary for this commit. Co-authored-by: Cursor <cursoragent@cursor.com>
…lti bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Adds a
slurm_multideployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it frommadengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.Scope (75 files vs
develop)This PR is now focused strictly on
slurm_multiand its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR againstdevelop) and will rebase after this PR lands.deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py},templates/slurm/job.sh.j2orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py— manifest-driven--build-on-compute/--use-image, deep-mergeddeployment_config, registry validationexecution/{container_runner, container_runner_helpers, docker_builder}.pycli/{app, commands/run, commands/build, constants, utils, validators}.pycore/,utils/,reporting/,database/mongodb.py,scripts/common/examples/slurm-configs/**,docs/launchers.md,docs/deployment.md,docs/configuration.md,tests/{unit,integration,e2e,fixtures}/**Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to
origin/developbyte-for-byte and tightened 4 more in-scope code paths).Commit shape
Logical commits on top of
develop:deployment: add slurm_multi launcher, node selector, SBATCH template(8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.orchestration/execution/cli: wire slurm_multi end-to-end(6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.examples/docs/tests: slurm_multi configs, README, and test coverage(50e8895) — operator-facing docs, examples renamedsglang-disagg-*→slurm-multi-*(one launcher drives multiple flows), and test additions.PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection(5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 onDistributed_Inference_CI).slurm: namespace completion marker by SLURM_JOB_ID(710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.Revert out-of-scope changes flagged by Copilot Round-2 review(9109f52) — 11 files reverted toorigin/developbyte-for-byte; surgical reverts incli/utils.pyandexecution/docker_builder.pykeep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAMEpropagation for parallel pull,build_gpu_arch_displayhelper,Workloadcolumn rename).Address Copilot Round-2 slurm_multi-related findings(63a2eae) — 6 real fixes: module-level recursive_deep_mergeused on the manifest path;manifest.context.docker_env_varsrestored into runtime context;RunOrchestrator._load_credentialsdelegated tomadengine.core.auth.load_credentials(object-validatescredential.json);configure_multi_node_profilingonly filtersrocprofwhenrocprofv3is missing (keepsrccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens);"megatron"and"primus"added toVALID_LAUNCHERS.docs: fix broken Next-Steps links and example paths(0156d33) —distributed-launchers.md→launchers.md;user-guide.md→usage.md;how-to-quick-start.md→cli-reference.md; dropped K8sslurm-multi-*.jsonpaths that don't exist.Address Copilot Round-3 small findings(2200cbd) —examples/profiling-configs/README.mdmultinode→multi_node; drop unusedimport pytestfromtests/unit/test_reporting.py; reverttests/unit/test_rocm_path.pytoorigin/develop(matches theContextrevert in 9109f52).build_orchestrator: lowercase --build-on-compute local image name(ae1996c) — surfaced live during--build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench:Path(dockerfile_path).nameforscripts/.../Dockerfile(no.ubuntu.amd.Dockerfilesuffix to strip) yieldedDockerfile, producingci-..._Dockerfileanddocker build -trejection..lower()the assembled local image name; verified end-to-end (Slurm 22936, image pushed todocker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/)(98f1ac5) — surfaced live with--registry rocm/pytorch-private:registry.split("/")[0]was treated as the docker login host (docker login rocm-> DNS NXDOMAIN)._execute_build_on_computenow normalizes shorthand<ns>/<repo>todocker.io/<ns>/<repo>with a transparent[dim]auto-prefixing 'docker.io/'[/dim]notice, and rejects malformed input (whitespace,@, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded,@bad/foorejected without sbatch).Address Copilot Round-4 findings: 5 P0 + 3 hardening(3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits inorigin/develop-identical files unchanged by this PR):run_orchestrator.py: drop now-unusedContext(rocm_path=...)kwarg from_init_runtime_context(line 138) and_create_manifest_from_local_image(line 456) call sites; dropfrom madengine.core.constants import get_rocm_pathand unusedhandle_errorimports. LatentTypeErroronmadengine runagainst local target after the Cat Bcore/context.pyrevert.container_runner.py: widenperf_patternto mirror thedeployment/base.py:_parse_node_log_for_perfregex (R2-Distributed solution - new madengine CLI #11) sotokens/sec,tok/s,samples_per_second,1.23E+4no longer get silently dropped.container_runner.py:login_to_registrydelegates tomadengine.core.auth.login_to_registryinstead off"echo '{password}' | docker login ..."inline (password no longer inps//proc-visible argv even with--password-stdin); same delegationDockerBuilder.login_to_registryalready uses.container_runner.py:get_env_argno longer prints assembled--env KEY='value'to stdout (HF tokens, registry passwords,MAD_SECRETS_*etc. could leak); print only the count.container_runner.py: self-managed launcher (_run_self_managed) nowshlex.quotesscript_pathandshlex.split+ per-argshlex.quotesmodel_args, so embedded shell metacharacters ($(), backticks,;, ...) become literal arguments to the script rather than host-shell directives.docs/deployment.md: second occurrence of[Distributed Launchers Guide](distributed-launchers.md)(line 145) →(launchers.md)(the 0156d33 commit fixed only line 543).Copilot review status
All four rounds resolved or replied-to inline:
d32b06b/5ec55ff; replies posted under each thread.9109f52/63a2eae/0156d33and replied-to inline:slurm.py: drop duplicateOptionalimport; honourslurm_config.exclusive.slurm_node_selector.py: cleanup_node merged srun; preserves--job-nameand--reservation.build_orchestrator.py: built_models keyed bymodel_name; warn on divergent configs; early registry None guard; per-registry auth hints.run_orchestrator.py: real recursive_deep_mergefordeployment_config↔additional_context; manifest'scontext.docker_env_varsrestored;_load_credentialsdelegates to central auth helper.container_runner.py: documentedshell=Truetrust boundary on innersubprocess.run.deployment/common.py:rocprofv3-fallback only filtersrocprof;VALID_LAUNCHERScovers"megatron"/"primus".deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.core/{context,docker,dataprovider,timeout}.py,cli/app.py,utils/{rocm_tool_manager,discover_models,config_parser}.py,orchestration/image_filtering.py,reporting/update_perf_csv.py,README.md) reverted toorigin/develop.docs/{installation,deployment,launchers}.md).2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns inorigin/develop:container_runner.get_env_argenv-arg shell-escape hardening;Console.sh/subprocess.run(timeout=…)translation of0→None).3fc7587(5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical toorigin/develop(utils/{ops,log_formatting}.py,scripts/common/{tools,pre_scripts}/**,tests/fixtures/utils.py,tests/integration/test_gpu_management.py,tests/unit/test_reporting_superset.py,core/errors.py,database/mongodb.py,deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.Sequencing
develop.developand merges.mainper repo policy.Test plan
DeploymentFactory.available_deployments()returnsslurm(andk8s/kubernetesfromdevelop's unchanged code, which this PR does not touch).normalize_launcherround-trip forslurm_multi/slurm-multi/megatron/primus/vllm/sglang._deep_mergenested-leaf preservation (verified: an override touchingslurm.node_selector.reservationkeeps a siblingslurm.node_selector.partitionfrom the manifest).Context()rejectsrocm_path=kwarg post-Cat-B-revert (verified viainspect.signature).--registrynormalization across all input shapes (rocm/pytorch-private,rocmshared,docker.io/rocm/pytorch-private,ghcr.io/myorg/myrepo,localhost:5000/myrepo,192.168.1.10:5000/myrepo,registry.example.com:5000/myrepo,@bad/foorejected, empty rejected).Distributed_Inference_CI(jobs 22607/22615): 5,977 tok/s (Llama-3.1-8B-Instruct).perf.csvwritten to/shared_inference/<user>/<jobid>/perf.csvand discovered by_collect_slurm_multi_results.--build-on-compute --registry docker.io/rocm/pytorch-private: 1,878 tok/s @ CON=8, 3,644 tok/s @ CON=16 (ISL/OSL=1024/1024). Image pushed todocker.io/rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8.--build-on-compute --registry: 28 perf rows (DeepEP / MoRI EP dispatch + combine across FP8/BF16 RDMA/NVL).pyt_large_ep_benchmark_2non [008,024] via slurm_multi, 8 min, 28 ✅ Success rows. Confirms the Round-4 changes didn't regress the run path.pyt_vllm_disagg_llama3.1-8b/pyt_vllm_disagg_deepseek-v3_short) once the image-sideNIXL_ERR_BACKEND(orthogonal to this PR; observed onrocm/pytorch-private:20260407_itej89_vllm_mori_docker) is resolved.