Skip to content

Updates to slurm launcher#86

Open
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher
Open

Updates to slurm launcher#86
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

Conversation

@raviguptaamd
Copy link
Copy Markdown

@raviguptaamd raviguptaamd commented Mar 12, 2026

Summary

Adds a slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it from madengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.

Scope (75 files vs develop)

This PR is now focused strictly on slurm_multi and its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR against develop) and will rebase after this PR lands.

Layer Highlights
Deployment deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py}, templates/slurm/job.sh.j2
Orchestration / Build / Run orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py — manifest-driven --build-on-compute / --use-image, deep-merged deployment_config, registry validation
Execution execution/{container_runner, container_runner_helpers, docker_builder}.py
CLI cli/{app, commands/run, commands/build, constants, utils, validators}.py
Core / Utils / Reporting / DB / Scripts shared infra used by the new path: core/, utils/, reporting/, database/mongodb.py, scripts/common/
Examples / Docs / Tests examples/slurm-configs/**, docs/launchers.md, docs/deployment.md, docs/configuration.md, tests/{unit,integration,e2e,fixtures}/**

Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to origin/develop byte-for-byte and tightened 4 more in-scope code paths).

Commit shape

Logical commits on top of develop:

  1. deployment: add slurm_multi launcher, node selector, SBATCH template (8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.
  2. orchestration/execution/cli: wire slurm_multi end-to-end (6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.
  3. examples/docs/tests: slurm_multi configs, README, and test coverage (50e8895) — operator-facing docs, examples renamed sglang-disagg-*slurm-multi-* (one launcher drives multiple flows), and test additions.
  4. PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection (5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 on Distributed_Inference_CI).
  5. slurm: namespace completion marker by SLURM_JOB_ID (710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.
  6. Revert out-of-scope changes flagged by Copilot Round-2 review (9109f52) — 11 files reverted to origin/develop byte-for-byte; surgical reverts in cli/utils.py and execution/docker_builder.py keep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAME propagation for parallel pull, build_gpu_arch_display helper, Workload column rename).
  7. Address Copilot Round-2 slurm_multi-related findings (63a2eae) — 6 real fixes: module-level recursive _deep_merge used on the manifest path; manifest.context.docker_env_vars restored into runtime context; RunOrchestrator._load_credentials delegated to madengine.core.auth.load_credentials (object-validates credential.json); configure_multi_node_profiling only filters rocprof when rocprofv3 is missing (keeps rccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens); "megatron" and "primus" added to VALID_LAUNCHERS.
  8. docs: fix broken Next-Steps links and example paths (0156d33) — distributed-launchers.mdlaunchers.md; user-guide.mdusage.md; how-to-quick-start.mdcli-reference.md; dropped K8s slurm-multi-*.json paths that don't exist.
  9. Address Copilot Round-3 small findings (2200cbd) — examples/profiling-configs/README.md multinodemulti_node; drop unused import pytest from tests/unit/test_reporting.py; revert tests/unit/test_rocm_path.py to origin/develop (matches the Context revert in 9109f52).
  10. build_orchestrator: lowercase --build-on-compute local image name (ae1996c) — surfaced live during --build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench: Path(dockerfile_path).name for scripts/.../Dockerfile (no .ubuntu.amd.Dockerfile suffix to strip) yielded Dockerfile, producing ci-..._Dockerfile and docker build -t rejection. .lower() the assembled local image name; verified end-to-end (Slurm 22936, image pushed to docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).
  11. build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/) (98f1ac5) — surfaced live with --registry rocm/pytorch-private: registry.split("/")[0] was treated as the docker login host (docker login rocm -> DNS NXDOMAIN). _execute_build_on_compute now normalizes shorthand <ns>/<repo> to docker.io/<ns>/<repo> with a transparent [dim]auto-prefixing 'docker.io/'[/dim] notice, and rejects malformed input (whitespace, @, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded, @bad/foo rejected without sbatch).
  12. Address Copilot Round-4 findings: 5 P0 + 3 hardening (3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits in origin/develop-identical files unchanged by this PR):
    • run_orchestrator.py: drop now-unused Context(rocm_path=...) kwarg from _init_runtime_context (line 138) and _create_manifest_from_local_image (line 456) call sites; drop from madengine.core.constants import get_rocm_path and unused handle_error imports. Latent TypeError on madengine run against local target after the Cat B core/context.py revert.
    • container_runner.py: widen perf_pattern to mirror the deployment/base.py:_parse_node_log_for_perf regex (R2-Distributed solution - new madengine CLI #11) so tokens/sec, tok/s, samples_per_second, 1.23E+4 no longer get silently dropped.
    • container_runner.py: login_to_registry delegates to madengine.core.auth.login_to_registry instead of f"echo '{password}' | docker login ..." inline (password no longer in ps//proc-visible argv even with --password-stdin); same delegation DockerBuilder.login_to_registry already uses.
    • container_runner.py: get_env_arg no longer prints assembled --env KEY='value' to stdout (HF tokens, registry passwords, MAD_SECRETS_* etc. could leak); print only the count.
    • container_runner.py: self-managed launcher (_run_self_managed) now shlex.quotes script_path and shlex.split + per-arg shlex.quotes model_args, so embedded shell metacharacters ($(), backticks, ;, ...) become literal arguments to the script rather than host-shell directives.
    • docs/deployment.md: second occurrence of [Distributed Launchers Guide](distributed-launchers.md) (line 145) → (launchers.md) (the 0156d33 commit fixed only line 543).

Copilot review status

All four rounds resolved or replied-to inline:

  • Round 1 (9 comments) — addressed in d32b06b/5ec55ff; replies posted under each thread.
  • Round 2 (24 comments) — fully addressed across 9109f52 / 63a2eae / 0156d33 and replied-to inline:
    • slurm.py: drop duplicate Optional import; honour slurm_config.exclusive.
    • slurm_node_selector.py: cleanup_node merged srun; preserves --job-name and --reservation.
    • build_orchestrator.py: built_models keyed by model_name; warn on divergent configs; early registry None guard; per-registry auth hints.
    • run_orchestrator.py: real recursive _deep_merge for deployment_configadditional_context; manifest's context.docker_env_vars restored; _load_credentials delegates to central auth helper.
    • container_runner.py: documented shell=True trust boundary on inner subprocess.run.
    • deployment/common.py: rocprofv3-fallback only filters rocprof; VALID_LAUNCHERS covers &#34;megatron&#34;/&#34;primus&#34;.
    • deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.
    • 15 outside-scope changes (core/{context,docker,dataprovider,timeout}.py, cli/app.py, utils/{rocm_tool_manager,discover_models,config_parser}.py, orchestration/image_filtering.py, reporting/update_perf_csv.py, README.md) reverted to origin/develop.
    • 3 broken doc links / example paths fixed (docs/{installation,deployment,launchers}.md).
  • Round 3 (13 comments) — 4 fixed in 2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns in origin/develop: container_runner.get_env_arg env-arg shell-escape hardening; Console.sh/subprocess.run(timeout=…) translation of 0None).
  • Round 4 (24 comments) — 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/test_gpu_management.py, tests/unit/test_reporting_superset.py, core/errors.py, database/mongodb.py, deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.

Sequencing

  1. Updates to slurm launcher #86 (this PR) → develop.
  2. Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
  3. After both land, follow-up cherry-picks the combined work into main per repo policy.

Test plan

  • Import-only smoke check: DeploymentFactory.available_deployments() returns slurm (and k8s/kubernetes from develop's unchanged code, which this PR does not touch).
  • normalize_launcher round-trip for slurm_multi/slurm-multi/megatron/primus/vllm/sglang.
  • _deep_merge nested-leaf preservation (verified: an override touching slurm.node_selector.reservation keeps a sibling slurm.node_selector.partition from the manifest).
  • Context() rejects rocm_path= kwarg post-Cat-B-revert (verified via inspect.signature).
  • --registry normalization across all input shapes (rocm/pytorch-private, rocmshared, docker.io/rocm/pytorch-private, ghcr.io/myorg/myrepo, localhost:5000/myrepo, 192.168.1.10:5000/myrepo, registry.example.com:5000/myrepo, @bad/foo rejected, empty rejected).
  • End-to-end SGLang-disagg DSV3 1Px8/1Dx8 (24 GPUs / 3 nodes) on Distributed_Inference_CI (jobs 22607/22615): 5,977 tok/s (Llama-3.1-8B-Instruct). perf.csv written to /shared_inference/<user>/<jobid>/perf.csv and discovered by _collect_slurm_multi_results.
  • End-to-end SGLang-disagg Llama-3.3-70b-fp8 (24 GPUs / 3 nodes) via --build-on-compute --registry docker.io/rocm/pytorch-private: 1,878 tok/s @ CON=8, 3,644 tok/s @ CON=16 (ISL/OSL=1024/1024). Image pushed to docker.io/rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8.
  • End-to-end pyt_large_ep_benchmark_2n (16 GPUs / 2 nodes) via --build-on-compute --registry: 28 perf rows (DeepEP / MoRI EP dispatch + combine across FP8/BF16 RDMA/NVL).
  • End-to-end pyt_kvcache_transfer_bench (16 GPUs / 2 nodes): 57 perf rows (mooncake + MORI + RIXL backends, 4KB → 1GB sweep).
  • Re-smoke after Round-4 fixes: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 ✅ Success rows. Confirms the Round-4 changes didn't regress the run path.
  • CI on the rebased branch.
  • Optional vLLM-MoRI end-to-end (pyt_vllm_disagg_llama3.1-8b / pyt_vllm_disagg_deepseek-v3_short) once the image-side NIXL_ERR_BACKEND (orthogonal to this PR; observed on rocm/pytorch-private:20260407_itej89_vllm_mori_docker) is resolved.

@raviguptaamd raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from fe22e7e to 87f8b7f Compare April 16, 2026 08:05
Copilot AI review requested due to automatic review settings April 16, 2026 08:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.

Changes:

  • Add --use-image and --build-on-compute build workflows and propagate the resulting image name into manifests/env for multi-node pulls.
  • Add slurm_multi launcher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support.
  • Improve manifest/runtime config merging behavior for dict-valued deployment keys.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/madengine/orchestration/run_orchestrator.py Shallow-merges dict-valued deployment config keys into runtime additional context.
src/madengine/orchestration/build_orchestrator.py Implements --use-image and --build-on-compute build flows and writes manifests accordingly.
src/madengine/execution/docker_builder.py Writes DOCKER_IMAGE_NAME into built_models.env_vars when a registry image exists (supports parallel pulls).
src/madengine/execution/container_runner.py Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker.
src/madengine/deployment/slurm_node_selector.py Adds reservation support for srun health checks / cleanup.
src/madengine/deployment/slurm.py Adds slurm_multi wrapper script generation, reservation support, and existing-allocation detection (salloc).
src/madengine/deployment/kubernetes.py Treats slurm_multi as a disaggregated launcher option in K8s template context and adds launcher command generators.
src/madengine/deployment/common.py Updates valid launcher list and normalizes slurm-multislurm_multi.
src/madengine/deployment/base.py Adds skip_monitoring flag to support synchronous deploy flows (e.g., within salloc).
src/madengine/cli/validators.py Extends validator signature to accept use_image (currently unused).
src/madengine/cli/utils.py Adjusts display/extraction formatting (e.g., model name truncation and column header rename).
src/madengine/cli/commands/build.py Adds CLI flags --use-image / --build-on-compute and enforces mutual exclusivity/requirements.
examples/slurm-configs/minimal/slurm-multi-minimal.json Updates launcher value to slurm_multi.
examples/slurm-configs/basic/slurm-multi-multi-node.json Updates launcher value to slurm_multi.
examples/slurm-configs/basic/slurm-multi-custom-split.json Updates launcher value to slurm_multi.
examples/k8s-configs/minimal/slurm-multi-minimal.json Updates launcher value to slurm_multi.
examples/k8s-configs/basic/slurm-multi-multi-node-basic.json Updates launcher value to slurm_multi.
examples/k8s-configs/basic/slurm-multi-custom-split.json Updates launcher value to slurm_multi.
docs/usage.md Documents --use-image, --build-on-compute, and slurm_multi usage.
docs/launchers.md Renames launcher references and documents slurm_multi registry requirement.
docs/deployment.md Documents allocation detection and self-managed slurm_multi execution model.
docs/cli-reference.md Adds CLI reference entries and usage examples for new build options and slurm_multi.
Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:307

  • validate_additional_context() now accepts use_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally via finalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.
def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +651 to +676
# Merge model's distributed config
model_distributed = models[0].get("distributed", {})
if model_distributed:
if "distributed" not in saved_manifest["deployment_config"]:
saved_manifest["deployment_config"]["distributed"] = {}

# Copy launcher and other critical fields from model config
for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]:
if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]:
saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key]

# Merge model's slurm config into deployment_config.slurm
# This enables run phase to auto-detect SLURM deployment without --additional-context
model_slurm = models[0].get("slurm", {})
if model_slurm:
if "slurm" not in saved_manifest["deployment_config"]:
saved_manifest["deployment_config"]["slurm"] = {}

# Copy slurm settings from model config (model card fills in
# values not explicitly set by --additional-context).
# Use _original_user_slurm_keys (captured before ConfigLoader
# applies defaults) so model card values override defaults
# but user's explicit CLI values still win.
for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]:
if key in model_slurm and key not in self._original_user_slurm_keys:
saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key]
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

We still let models[0] populate the manifest's deployment_config.distributed and deployment_config.slurm (they need a single value), but the prebuilt-image flow now warns when discovered models have divergent distributed / slurm configs so operators are not silently surprised.

Comment on lines +779 to +904
def _execute_build_on_compute(
self,
registry: Optional[str] = None,
clean_cache: bool = False,
manifest_output: str = "build_manifest.json",
batch_build_metadata: Optional[Dict] = None,
) -> str:
"""
Execute Docker build on a SLURM compute node and push to registry.

Build workflow:
1. Build on 1 compute node only
2. Push image to registry
3. Store registry image name in manifest
4. Run phase will pull image in parallel on all nodes

Args:
registry: Registry to push images to (REQUIRED)
clean_cache: Whether to use --no-cache for Docker builds
manifest_output: Output file for build manifest
batch_build_metadata: Optional batch build metadata

Returns:
Path to generated build_manifest.json
"""
import subprocess
import os
import glob

self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]")
self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]")
self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]")
self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n")

# Discover models first to get SLURM config from model card
self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]")
discover_models = DiscoverModels(args=self.args)
models = discover_models.run()

if not models:
raise DiscoveryError(
"No models discovered for build-on-compute",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
),
suggestions=[
"Check if models.json exists",
"Verify --tags parameter is correct",
],
)

model = models[0]
model_name = model.get("name", "unknown")
self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n")

# Merge SLURM config: model card (base) + additional-context (override)
model_slurm_config = model.get("slurm", {})
context_slurm_config = self.additional_context.get("slurm", {})

# Start with model card config, then override with command-line context
slurm_config = {**model_slurm_config, **context_slurm_config}

self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]")
if model_slurm_config:
self.rich_console.print(f" [dim]From model card:[/dim] {list(model_slurm_config.keys())}")
if context_slurm_config:
self.rich_console.print(f" [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}")

# Validate required fields
partition = slurm_config.get("partition")
if not partition:
raise ConfigurationError(
"Missing required SLURM field: partition",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
),
suggestions=[
'Add "partition" to model card\'s slurm section',
'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'',
],
)

reservation = slurm_config.get("reservation", "")
time_limit = slurm_config.get("time", "02:00:00")

self.rich_console.print(f" Partition: {partition}")
self.rich_console.print(f" Time limit: {time_limit}")
if reservation:
self.rich_console.print(f" Reservation: {reservation}")
self.rich_console.print("")

# Validate registry credentials
self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]")
self.rich_console.print(f" Registry: {registry}")

# Check for credentials - either from environment or credential.json
dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "")
dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "")

# Try to load from credential.json if env vars not set
credential_file = Path("credential.json")
if not dockerhub_user and credential_file.exists():
try:
with open(credential_file) as f:
creds = json.load(f)
dockerhub_creds = creds.get("dockerhub", {})
dockerhub_user = dockerhub_creds.get("username", "")
dockerhub_password = dockerhub_creds.get("password", "")
if dockerhub_user:
self.rich_console.print(f" Credentials: Found in credential.json")
except (json.JSONDecodeError, IOError) as e:
self.rich_console.print(f" [yellow]Warning: Could not read credential.json: {e}[/yellow]")
elif dockerhub_user:
self.rich_console.print(f" Credentials: Found in environment (MAD_DOCKERHUB_USER)")

# Determine if registry requires authentication
requires_auth = True
public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
registry_lower = registry.lower() if registry else ""

# For docker.io pushes, authentication is always required
if any(pub_reg in registry_lower for pub_reg in public_registries):
if not dockerhub_user or not dockerhub_password:
raise ConfigurationError(
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

_execute_build_on_compute now raises a ConfigurationError early when registry is None — with the suggestion to pass --registry or set it in the model card / additional-context — instead of falling into registry.lower() / .replace() / .split() with NoneType.

Comment on lines +787 to +793
result = subprocess.run(
cmd,
shell=True,
cwd=script_dir,
env=env,
timeout=timeout if timeout > 0 else None,
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

shell=True is deliberate here because cmd embeds shell features (pipes, redirects, env-var substitution) constructed earlier in this method. Added an inline trust-boundary comment that documents:

  1. cmd is built from validated model card / manifest fields,
  2. user model_args are routed through shlex-quoted assembly in the caller,
  3. raw user input must NOT be concatenated directly into cmd,

and tagged the line with # noqa: S602 so future linters don't flag the intentional usage.

Comment thread src/madengine/deployment/slurm.py Outdated
Comment on lines +434 to +435
"#SBATCH --exclusive",
]
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.

Suggested change
"#SBATCH --exclusive",
]
]
if self.slurm_config.get("exclusive", True):
script_lines.append("#SBATCH --exclusive")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The slurm_multi wrapper now appends #SBATCH --exclusive only when self.slurm_config.get('exclusive', True) is truthy, matching the conditional pattern already used for the standard SLURM template (job.sh.j2:20-22). Default behaviour stays exclusive so existing runs are unchanged.

Comment thread src/madengine/deployment/slurm.py Outdated
from madengine.utils.run_details import get_build_number, get_pipeline
from madengine.utils.path_utils import scripts_base_dir_from
import json
from typing import Optional
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.

Suggested change
from typing import Optional

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

Dropped the redundant from typing import Optional on line 27; Optional is still imported via the existing from typing import Any, Dict, List, Optional at line 17.

Comment on lines +603 to +632
# Add each discovered model with the pre-built image
# Use the image name as the key (matches how madengine build does it)
for model in models:
model_name = model.get("name", "unknown")
model_distributed = model.get("distributed", {})

# Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
model_env_vars = model.get("env_vars", {}).copy()
model_env_vars["DOCKER_IMAGE_NAME"] = use_image

# Use image name as key so slurm.py can find docker_image
manifest["built_models"][use_image] = {
"name": model_name,
"image": use_image,
"docker_image": use_image,
"dockerfile": model.get("dockerfile", ""),
"scripts": model.get("scripts", ""),
"data": model.get("data", ""),
"n_gpus": model.get("n_gpus", "8"),
"owner": model.get("owner", ""),
"training_precision": model.get("training_precision", ""),
"multiple_results": model.get("multiple_results", ""),
"tags": model.get("tags", []),
"timeout": model.get("timeout", -1),
"args": model.get("args", ""),
"slurm": model.get("slurm", {}),
"distributed": model_distributed,
"env_vars": model_env_vars,
"prebuilt": True,
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The loop now keys manifest['built_models'] by model_name instead of use_image, so multiple models that share the same pre-built image are all preserved in the manifest and the downstream consumers can iterate over them.

Comment on lines +896 to +918
# Determine if registry requires authentication
requires_auth = True
public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
registry_lower = registry.lower() if registry else ""

# For docker.io pushes, authentication is always required
if any(pub_reg in registry_lower for pub_reg in public_registries):
if not dockerhub_user or not dockerhub_password:
raise ConfigurationError(
f"Registry credentials required for pushing to {registry}",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
registry=registry,
),
suggestions=[
"Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD",
'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}',
"For Docker Hub, use a Personal Access Token (PAT) as password",
f"Example: export MAD_DOCKERHUB_USER=myuser",
f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx",
],
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The missing-credentials error now selects per-registry suggestions:

  • docker.io — Docker Hub PAT / MAD_DOCKERHUB_*
  • ghcr.io — GitHub PAT with read:packages / write:packages
  • gcr.io — service-account JSON key as password (MAD_DOCKERHUB_USER=_json_key)
  • quay.io — robot account / encrypted password
  • nvcr.io$oauthtoken username + NGC API key

Falls back to the Docker Hub guidance for unknown registries. The credential env vars themselves are still the existing MAD_DOCKERHUB_USER / MAD_DOCKERHUB_PASSWORD to keep this change a docs-only refinement of the error path; introducing a generic MAD_REGISTRY_USER/PASSWORD pair is left for a follow-up.

Comment on lines +331 to +345
srun_cmd.extend(["bash", "-c", cleanup_script])

try:
srun_cmd = [
"srun",
f"--nodelist={node}",
"--ntasks=1",
"--time=00:01:00",
"--overlap",
"--quiet",
]
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.

Suggested change
srun_cmd.extend(["bash", "-c", cleanup_script])
try:
srun_cmd = [
"srun",
f"--nodelist={node}",
"--ntasks=1",
"--time=00:01:00",
"--overlap",
"--quiet",
]
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])
try:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The pre-try: and in-try: builds of srun_cmd are merged into a single construction that conditionally appends both --job-name=<name> (when job_name is given) and --reservation=<res> (when self.reservation is set). The second reassignment that dropped --job-name is gone.

Comment on lines +233 to +243
# Merge deployment_config into additional_context (for deployment layer to use).
# For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest
# values fill in gaps while runtime --additional-context wins on conflicts.
for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]:
if key in deployment_config and key not in self.additional_context:
self.additional_context[key] = deployment_config[key]
if key in deployment_config:
if key not in self.additional_context:
self.additional_context[key] = deployment_config[key]
elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict):
merged = dict(deployment_config[key])
merged.update(self.additional_context[key])
self.additional_context[key] = merged
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

Replaced the shallow merged.update(...) with a real recursive _deep_merge(base, override) helper. Nested dicts under slurm / k8s / kubernetes / distributed / vllm / env_vars / debug are now merged per-leaf; runtime --additional-context still wins on conflicts, while manifest values fill in nested gaps as the comment claims.

@coketaste coketaste changed the base branch from coketaste/refactor-dis to develop April 29, 2026 14:58
@coketaste coketaste requested a review from Copilot April 29, 2026 15:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 1, 2026
Address all 9 inline comments from copilot-pull-request-reviewer[bot]:

ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys
   manifest['built_models'] by model_name (not use_image), so multiple
   models that share the same pre-built image are all preserved in the
   manifest.

ROCm#2 build_orchestrator.py — warn when discovered models have differing
   distributed/slurm configs in the prebuilt-image flow; the post-merge
   step still uses models[0]'s config but operators are now told.

ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises
   ConfigurationError early when registry is None instead of falling
   into registry.replace/.split/.lower with NoneType.

ROCm#4 build_orchestrator.py — credentials-required error now emits
   per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io)
   instead of Docker-Hub-only PAT guidance.

ROCm#5 container_runner.py — document the shell=True trust boundary on the
   inner subprocess.run; cmd is internally constructed and any user
   model_args are routed through shlex-quoted assembly in the caller.

ROCm#6 slurm.py — drop duplicate `from typing import Optional` import.

ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes
   `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True)
   to match the standard SLURM template behaviour.

ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once
   and includes both --job-name (when provided) and --reservation (when
   set); the second in-try reassignment that dropped --job-name is gone.

ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with
   a real recursive _deep_merge so the comment ("deep-merge") matches the
   behaviour: nested dicts under slurm/k8s/distributed/etc. are merged
   per-leaf, runtime --additional-context still wins on conflicts.

Made-with: Cursor
raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 1, 2026
…slurm_multi

PR ROCm#86 was originally rooted on coketaste/refactor-dist, which introduced
both Kubernetes and SLURM deployment paths. The PR base is now develop and
reviewers asked to keep this PR scoped to slurm_multi and the related
SLURM/orchestration changes.

This commit removes the Kubernetes-only artifacts from the slurm PR. The
KubernetesDeployment class continues to register itself only when the
kubernetes module is importable (DeploymentFactory.register_default_deployments
already wraps the import in try/ImportError), so the slurm PR works without
the kubernetes module present.

The full Kubernetes work continues on a separate branch and PR
(raviguptaamd/k8s_deployment), which can target develop independently
once ROCm#86 lands.

Files removed (k8s-only):
  - src/madengine/deployment/k8s_secrets.py
  - src/madengine/deployment/kubernetes.py
  - src/madengine/deployment/kubernetes_launcher_mixin.py
  - src/madengine/deployment/presets/k8s/**
  - src/madengine/deployment/templates/kubernetes/**
  - src/madengine/scripts/k8s/**
  - examples/k8s-configs/**
  - tests/unit/test_k8s.py

Mixed source files (run_orchestrator, build_orchestrator, base, common,
config_loader, validators) and docs that reference both slurm and k8s
are intentionally left untouched in this commit:
  - their k8s code paths are dormant (no k8s module to call into),
  - touching them would create a much wider conflict surface vs develop,
  - the k8s PR will re-add the runtime hooks when its files come back.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 1, 2026 16:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@raviguptaamd
Copy link
Copy Markdown
Author

Kubernetes deployment work split out into #118 to keep this PR focused on slurm_multi and related changes (per reviewer ask). #118 targets develop and will rebase / sequence after this PR lands.

raviguptaamd and others added 3 commits May 2, 2026 23:44
Introduce the `slurm_multi` deployment path needed for self-managed,
multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg,
large-EP, kvcache-transfer):

* SlurmDeployment: new `slurm_multi` launcher branch with model-card
  driven --partition / --nodes / --gpus-per-node / --time / nodelist
  inheritance from `models.json`, optional `--exclusive`, reservation,
  and registry-required validation.
* SlurmNodeSelector: idle-node discovery, on-demand health check,
  cleanup_node() that consolidates --job-name and --reservation in a
  single srun (avoids duplicate srun in the previous implementation).
* templates/slurm/job.sh.j2: SBATCH directives factored so user
  exclusivity setting is honoured, matching slurm_multi behaviour.
* base.py / common.py / config_loader.py / factory.py: shared
  deployment infrastructure adjustments (DeploymentConfig fields,
  registry validation, factory registration).

Incorporates the related Copilot review fixes:
* slurm.py: drop duplicate `from typing import Optional`; honour
  slurm_config.exclusive in the slurm_multi wrapper script.
* slurm_node_selector.py: cleanup_node srun_cmd built once and
  preserves --job-name when --reservation is also set.

Co-authored-by: Cursor <cursoragent@cursor.com>
Plumb the slurm_multi launcher through the build/run/execution stack:

* orchestration/build_orchestrator.py:
  - --use-image / --build-on-compute paths populate manifest's
    deployment_config.distributed and deployment_config.slurm so the
    run phase auto-detects slurm without --additional-context.
  - Discovery yields multiple model cards: built_models manifest is now
    keyed by model_name (no longer overwrites when models share an image).
  - _execute_build_on_compute() validates `registry` early (no longer
    falls into registry.lower()/.replace() with NoneType).
  - Missing-credentials error emits per-registry hints (docker.io,
    ghcr.io, gcr.io, quay.io, nvcr.io) instead of Docker Hub-only PAT
    guidance.
  - Warn when discovered models in a prebuilt-image batch have
    divergent distributed/slurm configs.

* orchestration/run_orchestrator.py:
  - True recursive deep-merge of deployment_config into
    additional_context (the previous shallow `dict.update` lost nested
    slurm/k8s overrides).
  - Manifest reading respects deployment_config.target.

* execution/container_runner.py + helpers / docker_builder.py:
  - Surface model_args / env wiring needed by the slurm_multi flow.
  - Document the shell=True trust boundary on subprocess.run; cmd is
    constructed from validated manifest fields and user model_args are
    routed through shlex-quoted assembly in the caller.

* cli/{app, commands/run, commands/build, constants, utils, validators}.py:
  - --build-on-compute and --use-image flags, deployment-config CLI
    overrides, validation pre-checks for slurm targets.

* core/{constants, context, dataprovider, docker, errors, timeout}.py:
  - Shared infrastructure used by the new launcher path
    (DeploymentConfig fields, error classes, timeout policy).

* utils/, scripts/common/, reporting/, database/:
  - Helpers and tooling that the slurm_multi flow depends on
    (config_parser, gpu_config, gpu_validator, ops, rocm_tool_manager,
    session_tracker, log_formatting, discover_models, mongodb perf
    publishing, trace pre/post scripts, therock_detector).

Includes the Copilot review fixes that touch this layer: built_models
keying, registry None guard, registry-specific auth hints, divergent
config warning, deep-merge correctness, subprocess.run trust comment.

Co-authored-by: Cursor <cursoragent@cursor.com>
Round out the slurm_multi PR with operator-facing docs, examples, and
test additions:

* examples/slurm-configs:
  - basic/slurm-multi-{custom-split,multi-node}.json (renamed from
    sglang-disagg-* to reflect that the same launcher drives multiple
    flows: vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer).
  - minimal/slurm-multi-minimal.json
  - README.md covering all the basic/minimal configs.

* examples/profiling-configs/README.md: cross-reference slurm_multi
  flows where applicable.

* docs/{launchers,deployment,configuration,cli-reference,usage,
  installation,profiling,README}.md and top-level README.md / CHANGELOG.md:
  describe the slurm_multi launcher, --build-on-compute, --use-image,
  and the manifest deployment_config flow.

* pyproject.toml / .gitignore: minor housekeeping.

* tests/{unit,integration,e2e,fixtures}: cover the new launcher path,
  config_loader inheritance, deep-merge behaviour, container_runner
  shell handling, and slurm node selector cleanup. test_errors.py
  promoted from unit to integration.

Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from 8d73b3f to 50e8895 Compare May 2, 2026 23:46
… perf.csv collection

Three fixes that surfaced during the PR ROCm#86 smoke test against the
Distributed_Inference_CI reservation:

1. ErrorContext kwargs sweep (build_orchestrator.py: 5 sites)
   ErrorContext is a dataclass with a fixed field set
   {operation, phase, component, model_name, node_id, file_path,
    additional_info}. Five call-sites passed kwargs that don't exist
   (`model=`, `launcher=`, `model_names=`, `registry=`, `dockerfile=`),
   which raised `TypeError: ErrorContext.__init__() got an unexpected
   keyword argument 'model'` BEFORE the real ConfigurationError was
   constructed. The slurm_multi guard error in particular ("requires
   --registry or --use-image") was never visible to users.

   Migrated those kwargs to the proper `model_name=` field or to
   `additional_info={...}` so the user-facing error messages now
   reach the console.

2. F1: model-card image fallback for slurm_multi (build_orchestrator.py)
   When a discovered model card declares
   `env_vars.DOCKER_IMAGE_NAME` and the user passes neither
   `--registry` nor `--use-image`, treat the card's image as an
   implicit `--use-image` and call _execute_with_prebuilt_image()
   directly. Avoids forcing operators to repeat the image on the CLI
   for slurm_multi flows whose model cards already pin it.

   Falls back to the existing ConfigurationError (now visible thanks
   to fix ROCm#1) when no card image is set or when discovered models
   declare divergent images.

3. F-perf: extended slurm_multi perf.csv search paths (slurm.py)
   `_collect_slurm_multi_results` now looks in the locations the
   inner scripts (MAD-private PR #186) actually write to:
     * `/shared_inference/<user>/<deployment_id>/perf.csv`
       (sglang_disagg / vllm_disagg via benchmark_xPyD.sh)
     * `<workspace>/slurm_output/perf_csv/*<deployment_id>*.csv`
       (large_ep_benchmark / kvcache_transfer_bench)
   in addition to the existing `slurm_config.results_dir` and
   `<cwd>/perf.csv` paths, with NFS-propagation retry on each
   candidate. Fixes the "0 perf files, 1 log files" outcome we hit
   on Slurm job 22607 even though `/shared_inference/ravgupta/22607/perf.csv`
   was actually generated.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 3, 2026 03:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 86 out of 87 changed files in this pull request and generated 24 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 120 to 144
if not is_rocprofv3_available():
if tools_include_rocprof_family(tools_config):
filtered_tools: List[Dict] = [
t
for t in tools_config
if isinstance(t, dict) and t.get("name") not in _ROCPROF_FAMILY_TOOL_NAMES
]
if filtered_tools:
logger.warning(
"Multi-node: rocprofv3 not found on submission host; "
"omitting rocprof/rocprofv3-family tools. Remaining tools: "
+ ", ".join(
str(x.get("name"))
for x in filtered_tools
if isinstance(x, dict) and x.get("name")
)
)
return {
"enabled": True,
"mode": "multi_node",
"tools": filtered_tools,
"per_node_collection": True,
}
logger.warning(
"╔════════════════════════════════════════════════════════════════════════════╗\n"
"║ Multi-Node Profiling Requirements Not Met ║\n"
"╠════════════════════════════════════════════════════════════════════════════╣\n"
"║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support). ║\n"
"║ ║\n"
"║ Current Status: rocprofv3 NOT FOUND on system ║\n"
"║ ║\n"
"║ rocprof/rocprofv3-family profiling will be SKIPPED for this multi-node ║\n"
"║ run. ║\n"
"║ ║\n"
"║ To enable multi-node rocprof profiling: ║\n"
"║ • Install rocprofiler-sdk package (ROCm >= 6.4.1) ║\n"
"║ • Command: apt install rocprofiler-sdk ║\n"
"║ • Or upgrade to ROCm 6.4.1 or later ║\n"
"║ ║\n"
"║ Note: Single-node profiling uses rocprof (no rocprofv3 required) ║\n"
"╚════════════════════════════════════════════════════════════════════════════╝"
)
return {
"enabled": False,
"mode": "multi_node_unsupported",
"tools": [],
"per_node_collection": False
}
logger.info(
"Multi-node: rocprofv3 not found on submission host; keeping non-rocprof tools "
"(rocprof/rocprofv3 family requires rocprofv3 for MPI-aware capture)."
logger.warning(
"╔════════════════════════════════════════════════════════════════════════════╗\n"
"║ Multi-Node Profiling Requirements Not Met ║\n"
"╠════════════════════════════════════════════════════════════════════════════╣\n"
"║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support). ║\n"
"║ ║\n"
"║ Current Status: rocprofv3 NOT FOUND on system ║\n"
"║ ║\n"
"║ Profiling will be SKIPPED for this multi-node run. ║\n"
"║ ║\n"
"║ To enable multi-node profiling: ║\n"
"║ • Install rocprofiler-sdk package (ROCm >= 6.4.1) ║\n"
"║ • Command: apt install rocprofiler-sdk ║\n"
"║ • Or upgrade to ROCm 6.4.1 or later ║\n"
"║ ║\n"
"║ Note: Single-node profiling uses rocprof (no rocprofv3 required) ║\n"
"╚════════════════════════════════════════════════════════════════════════════╝"
)
return {
"enabled": True,
"mode": "multi_node",
"tools": tools_config,
"per_node_collection": True,
"enabled": False,
"mode": "multi_node_unsupported",
"tools": [],
"per_node_collection": False
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63a2eae.

configure_multi_node_profiling now only filters out the rocprof tool when rocprofv3 is missing on a multi-node run. rccl_trace, rocblas_trace, and other non-rocprof tools keep running and per_node_collection stays enabled for them. The returned config switches to a new mode multi_node_no_rocprofv3 so callers can distinguish this mixed state from full multi_node.

Comment on lines +531 to +534
if key in stored_config and isinstance(stored_config[key], dict) and isinstance(self.additional_context[key], dict):
merged = dict(stored_config[key])
merged.update(self.additional_context[key])
stored_config[key] = merged
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63a2eae.

Promoted _deep_merge to a module-level helper and used it in _load_and_merge_manifest (which still did dict(stored_config[key]); merged.update(...)). Verified locally: an override touching slurm.node_selector.reservation now preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is now used in both the runtime-context init path and the manifest-merge path.

Comment on lines +1018 to +1024
credential_file = "credential.json"
if os.path.exists(credential_file):
try:
with open(credential_file) as f:
credentials = json.load(f)
except Exception as e:
print(f"Warning: Could not load credentials: {e}")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63a2eae.

RunOrchestrator._load_credentials now delegates to madengine.core.auth.load_credentials. The central helper enforces "credential.json must contain a JSON object" validation up front (raises ValueError if the file is an array/string), so malformed credential files are rejected with a clear message instead of producing a cryptic subscript/attribute error later during registry auth.

Comment thread src/madengine/core/context.py Outdated
Comment on lines +257 to +258
self.ctx["rocm_path"] = self._rocm_path
# In-container ROCM_PATH is finalized at run time in run_container
# via finalize_container_rocm_path (OCI env → probe → default).
# docker_env_vars.ROCM_PATH (user-supplied) is preserved here; finalize
# normalizes it at run time. Auto-detection runs in finalize when absent.
self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

core/context.py no longer forces the host ROCm root into the container env. The PR now uses upstream's madengine.utils.rocm_path_resolver.resolve_host_rocm_path for host detection and the docker_env_vars.MAD_ROCM_PATH flow (with image-baked-in env, in-container probe, then /opt/rocm fallback) for in-container resolution. Out-of-scope behaviour change for slurm_multi; reverted to keep this PR focused.

Comment thread src/madengine/cli/utils.py Outdated
row = [str(row_index), status, model_name]
if show_gpu_arch:
row.append(build_gpu_arch_display(item))
row.append(item.get("architecture", "N/A"))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52.

Restored the build_gpu_arch_display(item) helper (prefers gpu_architecture then architecture, defaults to "N/A") and switched both BUILD callsites (success + failed) back to it, so multi-arch builds again show the correct gpu_architecture column instead of N/A. Kept the model-name no-truncation change and the "Workload" column rename (those weren't flagged).

Comment thread README.md Outdated
- **🚀 Modern CLI** - Rich terminal output with Typer and Rich
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus, vLLM, SGLang
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed across two commits.

  • 9109f52README.md reverted to origin/develop, so the Distributed Launchers bullet, the Architecture diagram, and the Components section all mention Primus alongside torchrun / DeepSpeed / Megatron-LM / TorchTitan / vLLM / SGLang again. The ROCm-path bullet and the RTL profiling-tools link are also restored.
  • 63a2eae — added "primus" (and "megatron") to VALID_LAUNCHERS in deployment/common.py, so normalize_launcher no longer silently downgrades "primus" configs to "docker".

The README is now consistent with what the code actually accepts.

Comment thread src/madengine/core/timeout.py Outdated
Comment on lines +20 to +26
def __init__(self, seconds: int = 15) -> None:
"""Constructor of the Timeout class.

Args:
seconds (Optional[int]): The timeout in seconds. None or 0 disables
the timeout. Negative values are treated as no timeout.
seconds (int): The timeout in seconds.
"""
self.seconds = seconds if seconds and seconds > 0 else None
self.seconds = seconds
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

Timeout.__init__ again accepts Optional[int] = 15 and stores self.seconds = seconds if seconds and seconds > 0 else None. __enter__ / __exit__ short-circuit when self.seconds is None, so programmatic callers passing timeout=None (or 0 / negative) no longer hit signal.alarm(None) with a TypeError. Out-of-scope change for slurm_multi; reverted.

Comment on lines +77 to +86
"--build-arg "
+ shlex.quote(str(build_arg))
+ "="
+ shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg]))
+ " "
+ build_arg
+ "='"
+ self.context.ctx["docker_build_arg"][build_arg]
+ "' "
)

if run_build_arg:
for key, value in run_build_arg.items():
build_args += "--build-arg " + shlex.quote(str(key)) + "=" + shlex.quote(str(value)) + " "
build_args += "--build-arg " + key + "='" + value + "' "
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52.

get_build_arg reverted to use shlex.quote(str(...)) for both keys and values:

build_args += (
    "--build-arg "
    + shlex.quote(str(build_arg))
    + "="
    + shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg]))
    + " "
)

(plus Optional[Dict] = None with if run_build_arg is None: run_build_arg = {} instead of the mutable-default pattern). Build-arg values containing apostrophes / $() / backticks no longer break the docker build command line.

Comment thread src/madengine/utils/config_parser.py Outdated
stop = os.path.abspath(stop_dir)

while current == stop or current.startswith(stop + os.sep):
while current.startswith(stop):
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

Boundary check restored:

while current == stop or current.startswith(stop + os.sep):

So a parent like /repo/scripts no longer matches /repo/scripts-private/... via prefix collision — ConfigParser walks the intended subtree only. Out-of-scope change for slurm_multi; reverted.

Comment on lines +96 to +97
if sys_gpu_arch and "NVIDIA" in sys_gpu_arch:
# Normalize "NVIDIA A100-SXM4-40GB" -> "A100", "NVIDIA H100 PCIe" -> "H100"
# (mirrors get_gpu_arch() in tests/fixtures/utils.py)
after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA ") :]
sys_gpu_arch = after_prefix.split("-")[0].split()[0]
sys_gpu_arch = sys_gpu_arch.split()[1]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

NVIDIA-arch normalization restored:

after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA "):]
sys_gpu_arch = after_prefix.split("-")[0].split()[0]

so "NVIDIA A100-SXM4-40GB""A100" and "NVIDIA H100 PCIe""H100" again, and skip_gpu_arch=A100 correctly filters those models. The split()[1] regression that returned the full SKU string is gone. Out-of-scope change for slurm_multi; reverted.

Previously the per-deploy completion marker was named
`madengine_<model>.complete` -- not job-id namespaced. A second
`madengine run` of the same model tag would see the leftover marker
from the first run and immediately conclude the job was done, even
though sbatch had only just submitted the new job.

Reproduced live: re-running pyt_sglang_disagg_llama-3.1-8b right after
job 22607 finished caused madengine to return success in ~30 s while
job 22615 was still running on the reservation.

Fix:

* prepare(): emit the marker using `${SLURM_JOB_ID:-local}` inside the
  generated wrapper so each sbatch run writes its own file. Falls back
  to `_local` when no SLURM_JOB_ID is set (interactive/local runs,
  preserves existing behaviour for that path via _completion_marker).

* monitor(): reconstruct the job-id-aware path from the basename
  template and the deployment_id returned by sbatch; only fall back
  to the bare marker for backwards compatibility with the
  inside_allocation flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd
Copy link
Copy Markdown
Author

Smoke test summary on Distributed_Inference_CI reservation

Drove the slurm_multi launcher end-to-end against four real workloads with the companion model cards in ROCm/MAD-private#186. Three out of four produced real benchmark numbers, the fourth surfaced a vLLM/NixlConnector issue inside the docker image (orthogonal to this PR).

Critical-path runs

# Workload Model card Slurm job Nodes State Result
C1 SGLang disagg 8B (madengine-driven via F1) pyt_sglang_disagg_llama-3.1-8b 22607 reservation [029,040,082] COMPLETED 8:51 5,977 tok/s TTFT 122.8 ms ITL 2.54 ms
C2 vLLM disagg 8B (1P/1D, co-located proxy) pyt_vllm_disagg_llama-3.1-8b 22612 [008,014] COMPLETED 4:22 router serving, but every decode request returned HTTP 500: vLLM workers crashed with nixl_connector.py:1504] AssertionError. Image-side bug, not the launcher.
C3 Large EP 2-node pyt_large_ep_benchmark_2n 22610 reservation [029,040] COMPLETED 3:49 28 metric rows (deepep dispatch/combine FP8/BF16 RDMA/NVL)
C4 KV cache transfer 2-node pyt_kvcache_transfer_bench 22611 reservation [082,125] COMPLETED 24:55 57 metric rows (mooncake + mori + rixl, 4KB → 1GB)
C1' C1 re-run with the F-perf collector commit (5ec55ff) same 22615 reservation [029,040,082] COMPLETED 6:54 5,970 tok/s; F-perf finds /shared_inference/<user>/<jobid>/perf.csv (verified)

What the smoke test exercised in this PR

Surface Verification
madengine build with --use-image (manifest gen for prebuilt image) C1
madengine build with no flags + model-card env_vars.DOCKER_IMAGE_NAME (F1 implicit---use-image) C1' — log line slurm_multi: no --registry/--use-image given; using DOCKER_IMAGE_NAME from model card -> rocm/pytorch-private:sglang_disagg_mori_20260502
madengine run --manifest-file over slurm_multi C1, C1'
slurm_multi SBATCH wrapper generation, --reservation, enable_node_check=false honoured All
_collect_slurm_multi_results finds /shared_inference/<user>/<jobid>/perf.csv (sglang/vllm) C1'
SlurmDeployment direct sbatch (no madengine on top) C2, C3, C4
slurm_multi co-located-proxy topology for vLLM disagg (xP + yD nodes) C2 — topology was launched correctly; C2 failure happened inside vLLM nixl_connector after Application startup complete.

Bugs surfaced by the smoke test (all addressed)

Bug Where Commit
5 invalid kwargs to ErrorContext (model=, launcher=, model_names=, registry=, dockerfile=) — masked the real error message build_orchestrator.py 5ec55ff
F1: model-card image not auto-promoted to --use-image for slurm_multi build_orchestrator.py 5ec55ff
F-perf: _collect_slurm_multi_results only looked in <cwd>/perf.csv slurm.py 5ec55ff
Completion marker not job-id namespaced — second madengine run of same model tag returned success in ~30 s while the new sbatch was still running slurm.py 710b51a

Out of scope

  • Image-side vLLM NixlConnector crash on Llama-3.1-8B disagg under rocm/pytorch-private:20260407_itej89_vllm_mori_docker (filed separately as orthogonal).
  • Node-health precheck threshold (180 GB memory used / 4 processes flags reservation nodes as "Dirty"; tracked for follow-up).

PR head: 710b51a. Mergeable, blocking on review.

Round-2 Copilot review identified 15 changes outside the slurm_multi
scope that regressed upstream/develop behaviour. Revert each to
origin/develop:

  core/context.py            drop forced host-ROCm-into-container override
  core/docker.py             restore shlex.quote + exact-match container filter
  core/dataprovider.py       restore "unknown data name" guard, drop debug prints
  core/timeout.py            restore Optional[int] None/0 handling
  cli/app.py                 restore versioningit-driven --version
  utils/rocm_tool_manager.py capture primary_stderr before fallback overwrites
  utils/discover_models.py   restore shlex.quote in cp; raise TypeError, not assert
  utils/config_parser.py     fix boundary check (== or startswith stop+os.sep)
  orchestration/image_filtering.py
                             restore proper "NVIDIA <ARCH>" normalization
                             (split()[1] returned A100-SXM4-40GB, not A100)
  reporting/update_perf_csv.py
                             restore configs pop, isinstance instead of type ==
  README.md                  restore Primus mention, ROCm path docs, RTL link

Surgical reverts that preserve scope-relevant changes:

  cli/utils.py
    - restore build_gpu_arch_display() helper + the two BUILD callsites
      (success and failed BUILD branches), so multi-arch builds show the
      gpu_architecture column instead of N/A
    - keep the model-name no-truncation change (no [:20])
    - keep the "Workload" column rename and matching format-precision tweak

  execution/docker_builder.py
    - revert auth/shlex paths to upstream (login_to_registry delegated to
      madengine.core.auth, shlex.quote on build args, exact-match container
      filter, primus dockercontext detection)
    - PRESERVE the slurm_multi-specific block at line 369:
        # Update built_models with registry image name for parallel pull
        # in slurm_multi
      so the registry-image -> env_vars["DOCKER_IMAGE_NAME"] propagation
      that the slurm_multi parallel-pull path depends on is unchanged.

slurm_multi infrastructure is untouched: zero bytes diff vs PR HEAD on
deployment/{slurm,slurm_node_selector,base,common,factory,config_loader}.py,
orchestration/{build_orchestrator,run_orchestrator}.py,
execution/{container_runner,container_runner_helpers}.py,
cli/{commands/build,commands/run,validators}.py.

Smoke checks:
  DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm']
  normalize_launcher('slurm_multi','slurm')  = 'slurm_multi'
  normalize_launcher('slurm-multi','slurm')  = 'slurm_multi'

Addresses Copilot review comments ROCm#4, ROCm#5, ROCm#6, ROCm#9, ROCm#13, ROCm#14, ROCm#16, ROCm#17,

Co-authored-by: Cursor <cursoragent@cursor.com>
ROCm#18, ROCm#19, ROCm#20, ROCm#21, ROCm#22, ROCm#23, ROCm#24.
Copilot AI review requested due to automatic review settings May 6, 2026 01:05
raviguptaamd and others added 2 commits May 6, 2026 01:10
Six Round-2 review comments on slurm_multi-touching code paths fixed.

run_orchestrator.py:
  * Real recursive deep-merge: promote the inline `_deep_merge` from
    `_init_runtime_context_from_manifest` to a module-level helper and
    use it in `_load_and_merge_manifest` too. Previously the manifest
    path still did a shallow `dict.update()`, so a runtime override
    touching a single leaf inside e.g. `slurm.node_selector` or
    `distributed.primus` wholesale-replaced the inner object and
    silently dropped the manifest's other leaves.
    (Copilot review ROCm#2)

  * Restore manifest's `context.docker_env_vars` into the runtime
    context. Manifest-only runs (`madengine run --manifest-file`)
    used to load tools / pre_scripts / post_scripts / encapsulate_script
    but had stopped restoring `docker_env_vars`, so any env captured
    at build time (incl. MAD_SECRETS*, model-specific values) was
    dropped before ContainerRunner rendered `--env` flags.
    (Copilot review ROCm#12)

  * Replace the local `_load_credentials` reimplementation with a
    delegation to `madengine.core.auth.load_credentials`, which keeps
    the "credential.json must be a JSON object" validation that was
    silently lost. Without it, a malformed file (array/string) would
    propagate unchanged and surface much later as a cryptic
    subscript/attribute error during registry auth.
    (Copilot review ROCm#3)

deployment/common.py:
  * `configure_multi_node_profiling`: when `rocprofv3` is unavailable
    on a multi-node run, only filter out the `rocprof` tool (which is
    the one that actually requires rocprofv3) instead of disabling
    ALL multi-node tools. `rccl_trace`, `rocblas_trace`, and friends
    work independently and must keep running. Returns
    `multi_node_no_rocprofv3` in that mixed mode.
    (Copilot review ROCm#1)

  * `VALID_LAUNCHERS`: add `"megatron"` and `"primus"` alongside the
    existing `"megatron-lm"`. The rest of the codebase (slurm.py,
    kubernetes.py, the `job.yaml.j2` template, `primus_backend.py`,
    and every `examples/**/megatron-lm-*.json` config) uses the bare
    `"megatron"` / `"primus"` strings, so without these entries
    `normalize_launcher` was silently downgrading them to `"docker"`.
    (Copilot review ROCm#10)

deployment/base.py:
  * Widen `_parse_node_log_for_perf` regex to accept scientific
    notation (`1.23e4`, `4E+5`) for the value, and `\\S+` for the
    metric name (covers `tokens/sec`, `samples_per_second`, `tok/s`,
    etc.). The narrower `[\\d.]+` previously dropped legitimate
    perf lines emitted by training scripts.
    (Copilot review ROCm#11)

Smoke checks:
  DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm']
  normalize_launcher('megatron','slurm')        = 'megatron'
  normalize_launcher('primus','kubernetes')     = 'primus'
  normalize_launcher('slurm_multi','slurm')     = 'slurm_multi'
  _deep_merge nested-leaf preservation: verified

Co-authored-by: Cursor <cursoragent@cursor.com>
Three Round-2 Copilot comments on doc-only broken navigation:

  docs/deployment.md  (review ROCm#7):
    distributed-launchers.md → launchers.md (renamed long ago)
    user-guide.md            → usage.md     (no user-guide.md exists)

  docs/launchers.md   (review ROCm#8):
    Drop the K8s `slurm-multi-*.json` example references — those
    files do not exist under examples/k8s-configs/. slurm_multi is
    a SLURM-side launcher; K8s disagg flows continue to use the
    existing examples/k8s-configs/{minimal,basic}/sglang-disagg-*.json
    configs. Keep and verify the three SLURM example paths that
    actually exist on disk.

  docs/installation.md (review ROCm#15):
    user-guide.md          → usage.md       (no user-guide.md exists)
    how-to-quick-start.md  → cli-reference.md (no how-to-quick-start.md
                              exists; usage.md already covers Quick Start
                              and is linked above, so point readers at
                              the CLI reference instead).

All link targets verified to exist on disk:
  docs/launchers.md, docs/usage.md, docs/deployment.md, docs/cli-reference.md
  examples/slurm-configs/minimal/slurm-multi-minimal.json
  examples/slurm-configs/basic/slurm-multi-multi-node.json
  examples/slurm-configs/basic/slurm-multi-custom-split.json

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 75 out of 76 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (2)

src/madengine/orchestration/run_orchestrator.py:120

  • Context.init still only accepts additional_context, additional_context_file, build_only_mode, and detect_local_gpu_arch. Passing rocm_path here raises TypeError, so every local run that initializes the runtime context will fail before execution starts.
    src/madengine/orchestration/run_orchestrator.py:594
  • Build manifests still persist context.docker_env_vars, docker_mounts, and docker_build_arg, but this restore block now copies only tools/scripts back into the runtime Context. Execution-only runs from a manifest therefore lose build-time env/mount context before ContainerRunner starts, including values that Context auto-imported into docker_env_vars such as MAD_SECRETS*.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +574 to +580
for env_arg in run_env:
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

# Add context environment variables
if "docker_env_vars" in self.context.ctx:
for env_arg in self.context.ctx["docker_env_vars"].keys():
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
value = self.context.ctx["docker_env_vars"][env_arg]
env_args += f"--env {env_arg}={shlex.quote(str(value))} "
env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a real concern but it is pre-existing in origin/develop, not introduced by this PR.

The f"--env {env_arg}='{str(...)}' " pattern is at the same line in origin/develop (compare git show origin/develop:src/madengine/execution/container_runner.py | sed -n '578,581p'); the --build-on-compute / --use-image work in this PR didn't change this method.

Deferring to a follow-up issue rather than expanding this PR's scope. The right fix is to switch to shlex.quote(value) (and validate the env-var name with the existing _VALID_ENV_NAME_RE style check) for both this docker_env_vars loop and the run_env loop a few lines above. Happy to file the follow-up issue (run_container env-arg shell-escape hardening) and reference it back here.

Comment on lines 1027 to +1042
if self.additional_context and "docker_env_vars" in self.additional_context:
merged_count = 0
for key, value in self.additional_context["docker_env_vars"].items():
self.context.ctx["docker_env_vars"][key] = value
merged_count += 1
if merged_count > 0:
print(f"ℹ️ Merged {merged_count} environment variables from additional_context")

if self.context and str(self.context.ctx.get("gpu_vendor", "")).upper().find(
"AMD"
) != -1:
from madengine.utils.rocm_path_resolver import finalize_container_rocm_path

# Determine whether the user explicitly supplied ROCM_PATH for the container.
# If they did (via docker_env_vars.ROCM_PATH in additional_context), the
# re-merge above already restored it — keep it so finalize uses it directly.
# If they did not, clear any ROCM_PATH left from a previous model run so
# finalize always re-resolves for the current docker_image (OCI config →
# in-image probe → /opt/rocm default).
user_supplied_rocm_path = (
str(
(self.additional_context or {})
.get("docker_env_vars", {})
.get("ROCM_PATH", "")
).strip()
)
if not user_supplied_rocm_path:
self.context.ctx["docker_env_vars"].pop("ROCM_PATH", None)

finalize_container_rocm_path(
self.context.ctx["docker_env_vars"],
docker_image,
)
# Merge env_vars from model_info (models.json) into docker_env_vars
if "env_vars" in model_info and model_info["env_vars"]:
model_env_count = 0
for key, value in model_info["env_vars"].items():
self.context.ctx["docker_env_vars"][key] = str(value)
model_env_count += 1
if model_env_count > 0:
print(f"ℹ️ Merged {model_env_count} environment variables from model_info (models.json)")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is showing as outdated and is already addressed by Cat B in 9109f52.

core/context.py was reverted to origin/develop, which means:

  • No more forced self.ctx['docker_env_vars']['ROCM_PATH'] = self._rocm_path at init.
  • In-container ROCM_PATH is resolved at run time via finalize_container_rocm_path() (OCI metadata → in-image probe → /opt/rocm), so user/model docker_env_vars overrides survive the merge.

The ROCm-path behaviour the comment was concerned about is preserved by upstream's existing implementation, not by this PR.

Comment on lines 444 to 448
)

self.rich_console.print(f"[green]✓ Discovered {len(models)} model(s) for tags: {tags}[/green]\n")

# Initialize build-only context for manifest generation
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is showing as outdated and is already addressed by Cat B in 9109f52.

core/context.py was reverted to origin/develop, so the Context build-only constructor no longer accepts a rocm_path keyword. The synthetic-manifest path that called Context(rocm_path=...) was the source of the TypeError referenced here, and that constructor signature is back to upstream's. No call site in this PR passes rocm_path=... anymore.

Comment on lines 529 to +534
if self.additional_context:
# Merge deployment_config
if "deployment_config" in manifest:
stored_config = manifest["deployment_config"]
# Runtime --additional-context overrides stored config
# Runtime --additional-context overrides stored config.
# For dict-valued keys, recursively deep-merge so manifest values
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is showing as outdated and is already addressed in 63a2eae — reposting the link here for clarity.

run_orchestrator._load_and_merge_manifest no longer uses dict(stored_config[key]); merged.update(...). It now calls the module-level _deep_merge(stored_config[key], self.additional_context[key]) helper, which recurses into nested dicts.

Verified locally: an override touching slurm.node_selector.reservation preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is used in both the runtime-context init path and the manifest-merge path.

Comment on lines +15 to 23
# NOTE: keep this in sync with the launcher branches in
# deployment/{slurm.py,kubernetes.py}, the templates/{slurm,kubernetes}/*.j2
# Jinja conditions, and examples/**/launcher fields. Currently the rest of
# the codebase (and shipped examples) use the bare "megatron" / "primus"
# strings, while "megatron-lm" is the docs-friendly synonym.
VALID_LAUNCHERS = [
"torchrun",
"torchtitan",
"deepspeed",
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 63a2eae.

Current src/madengine/deployment/common.py has both "megatron" and "primus" in VALID_LAUNCHERS (alongside the docs-friendly "megatron-lm"):

VALID_LAUNCHERS = [
    "torchrun", "torchtitan", "deepspeed",
    "megatron", "megatron-lm", "primus",
    "vllm", "sglang", "slurm_multi",
]

Verified at HEAD (2200cbd): normalize_launcher('megatron','slurm')'megatron', normalize_launcher('primus','kubernetes')'primus'. The example configs (examples/{slurm,k8s}-configs/**/megatron-lm-*.json, examples/k8s-configs/minimal/megatron-lm-*.json) round-trip correctly without being downgraded to "docker"/"native".

Comment thread docs/launchers.md
Comment on lines +471 to +476
**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s
deployment of disagg workloads continues to use the existing
`sglang-disagg-*.json` configs):
- SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json`
- SLURM Basic: `examples/slurm-configs/basic/slurm-multi-multi-node.json`
- SLURM Custom: `examples/slurm-configs/basic/slurm-multi-custom-split.json`
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 0156d33.

Current docs/launchers.md (around line 470) only lists the three SLURM-side examples that actually exist on disk:

**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s
deployment of disagg workloads continues to use the existing
`sglang-disagg-*.json` configs):
- SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json`
- SLURM Basic:   `examples/slurm-configs/basic/slurm-multi-multi-node.json`
- SLURM Custom:  `examples/slurm-configs/basic/slurm-multi-custom-split.json`

The three K8s slurm-multi-*.json references that didn't exist were dropped. All three SLURM paths verified to exist at HEAD (2200cbd).

Comment thread examples/profiling-configs/README.md Outdated
Comment on lines +89 to +105
@@ -126,7 +102,7 @@ madengine build --tags your_model --registry your-registry:5000

# Deploy to SLURM
madengine run --manifest-file build_manifest.json \
--additional-context-file examples/profiling-configs/rocprofv3_multi_node.json
--additional-context-file examples/profiling-configs/rocprofv3_multinode.json
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd.

Fixed both occurrences (lines 89 and 105). The actual checked-in profiling config is examples/profiling-configs/rocprofv3_multi_node.json (with underscore); the README now references that exact filename, so the copy-paste command no longer fails with a missing-file error.

Comment thread tests/unit/test_reporting.py Outdated
import os
import tempfile

import pytest
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd.

Dropped import pytest. Confirmed there are no pytest.* references anywhere in tests/unit/test_reporting.py (no @pytest.mark.*, no pytest.raises, no pytest.fixture), so the import was genuinely unused. flake8 lint will be clean again.

Comment thread tests/unit/test_rocm_path.py Outdated
Comment on lines +50 to +51
ctx = Context(build_only_mode=True, rocm_path="/opt/rocm")
assert ctx._rocm_path == "/opt/rocm"
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd — file reverted to origin/develop.

With the Cat B revert of core/context.py in 9109f52, this PR no longer adds the Context(rocm_path=...) keyword API; tests for that PR-specific surface belong to whichever future PR re-adds it. Upstream's tests/unit/test_rocm_path.py already exercises ROCmToolManager(rocm_path=...) (still in the codebase), so reverting restores green tests without changing semantics. Net result: this test file is byte-identical to origin/develop.

Comment thread tests/unit/test_rocm_path.py Outdated
Comment on lines +65 to +67
ctx = Context(rocm_path="/my/rocm")
assert ctx.ctx.get("rocm_path") == "/my/rocm"
assert ctx.ctx["docker_env_vars"].get("ROCM_PATH") == "/my/rocm"
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd (same fix as the sibling thread). The whole test_rocm_path.py file was reverted to origin/develop, so this assertion that ctx['docker_env_vars']['ROCM_PATH'] == ... immediately after construction is gone — the test now matches upstream's behaviour where in-container ROCM_PATH is finalized at run time via finalize_container_rocm_path().

raviguptaamd and others added 2 commits May 6, 2026 03:34
Three Round-3 review comments fixed in this small commit; the
remaining Round-3 items are either stale (already fixed in 9109f52 /
63a2eae / 0156d33; Copilot was reviewing a pre-fix snapshot) or
pre-existing concerns in upstream/develop that are deferred to a
follow-up issue (replied inline).

  examples/profiling-configs/README.md (R3-6):
    `rocprofv3_multinode.json` -> `rocprofv3_multi_node.json` (lines 89
    and 105). The actual checked-in file is `rocprofv3_multi_node.json`
    (with underscore), so copy-pasting the README's command would fail
    with a missing-file error.

  tests/unit/test_reporting.py (R3-7):
    Drop the unused `import pytest` (no `pytest.*` references in the
    file). The repo's pre-commit runs flake8, so the unused import
    would fail lint/CI.

  tests/unit/test_rocm_path.py (R3-8 + R3-9):
    Revert to origin/develop. With the Cat B revert of
    `core/context.py` (commit 9109f52), the PR no longer adds the
    `Context(rocm_path=...)` keyword API; tests for that PR-specific
    API now belong to whichever future PR re-adds it. Upstream's
    version of this test file already exercises
    `ROCmToolManager(rocm_path=...)` -- which is still in the codebase
    -- so reverting restores green tests without changing semantics.

Co-authored-by: Cursor <cursoragent@cursor.com>
Surfaced live: `madengine build --tags pyt_kvcache_transfer_bench
--build-on-compute --registry docker.io/rocm/pytorch-private` failed
its docker build step with:

    ERROR: failed to build: invalid tag
    "ci-pyt_kvcache_transfer_bench_Dockerfile":
    repository name must be lowercase

Root cause: in `_execute_build_on_compute`, `local_image_name` is
assembled as `ci-<model_name>_<dockerfile_basename>`, where
`dockerfile_basename` strips `.Dockerfile` and `.ubuntu.amd` from
`Path(dockerfile_path).name`. When the model card's `dockerfile`
field is just `scripts/<dir>/Dockerfile` (no `.ubuntu.amd.Dockerfile`
suffix), neither `.replace()` matches, so `dockerfile_basename`
stays as `Dockerfile` (capital D) and the assembled tag is rejected
by `docker build -t`.

Fix: `.lower()` the final assembled local image name. This:
  * fixes the kvcache_transfer_bench card (`scripts/.../Dockerfile`),
  * is a no-op for the four already-passing cards whose dockerfile
    paths follow the `*.ubuntu.amd.Dockerfile` convention (the
    `_basename` is already lowercase after the suffix strips), and
  * is robust against any uppercase character anywhere in
    `model_name` or `dockerfile_basename`.

Verified by re-running the failing case to completion:

    Local image name: ci-pyt_kvcache_transfer_bench_dockerfile
    Image pushed: docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench
    [✓] All builds completed successfully

(slurm job 22936, ~18 min full Dockerfile build of mooncake + MORI +
RIXL + nixlbench, 24 layers)

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 6, 2026 05:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated 24 comments.

Comments suppressed due to low confidence (3)

docs/cli-reference.md:396

  • The CLI docs still say --timeout 0 means “no timeout”, but madengine run no longer translates 0 to an unlimited timeout display-wise and the validation message no longer mentions 0. Either update the docs to match the current behavior, or restore the explicit 0 => no timeout handling so help text, docs, and runtime all align.
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |

pyproject.toml:31

  • pytest is included in the package’s runtime dependencies even though it’s only used for testing (it’s already listed under [project.optional-dependencies].dev). Keeping pytest in dependencies increases install size and can conflict with downstream environments. Move it back to dev/test extras only unless the library imports pytest at runtime (it currently doesn’t).
dependencies = [
  "pandas",
  "GitPython",
  "jsondiff",
  "sqlalchemy",
  "setuptools-rust",
  "paramiko",
  "tqdm",
  "pytest",
  "typing-extensions",
  "pymongo",
  "toml",
  "typer>=0.9.0",
  "rich>=13.0.0",
  "click>=8.0.0",
  "jinja2>=3.0.0",
  "pyyaml>=6.0",
]

src/madengine/execution/container_runner_helpers.py:216

  • make_run_log_file_path() no longer normalizes full image references (e.g. registry/ns/name:tag), so the generated log filename can include : (and very long sanitized refs), and run-log naming can diverge from build-log naming for CI-style tags. Consider reinstating a normalization helper (strip digest, collapse registry/...:ci-* to just the tag, and replace :, /, @ with _) to keep filenames stable and filesystem-friendly.

Comment on lines 133 to 138
rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
self.context = Context(
additional_context=context_string,
build_only_mode=False,
rocm_path=rocm_path,
)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587.

This is a real latent bug introduced by my Cat B revert: I reverted core/context.py to origin/develop (which removes the rocm_path keyword), but missed that run_orchestrator.py still passes it at lines 138 and 456. Today's slurm_multi tests went through _execute_distributed so didn't hit it, but madengine run against a local target hits _init_runtime_context() here and crashes with TypeError: Context.__init__() got an unexpected keyword argument 'rocm_path'.

Fix in 3fc7587: drop the rocm_path = get_rocm_path(...) line and the rocm_path=rocm_path kwarg at both call sites; drop the now-unused from madengine.core.constants import get_rocm_path import.

Verified at HEAD (3fc7587) via inspect.signature(Context.__init__): params are ['self','additional_context','additional_context_file','build_only_mode','detect_local_gpu_arch'] -- no more rocm_path. End-to-end smoke: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 metric rows.

Comment on lines 451 to 456
rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
build_context = Context(
additional_context=context_string,
build_only_mode=True,
rocm_path=rocm_path,
)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587 (same root cause as the sibling thread on line 138). The synthetic-manifest path in _create_manifest_from_local_image was passing the same now-removed rocm_path kwarg. Both call sites are fixed in this commit; from madengine.core.constants import get_rocm_path import is gone too.

Comment on lines +26 to +33
from madengine.core.errors import (
BuildError,
ConfigurationError,
ExecutionError,
create_error_context,
handle_error,
)
from madengine.core.constants import get_rocm_path
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587. Dropped handle_error from the import set in from madengine.core.errors import (...) since the only references in this module were removed in earlier rounds. flake8 F401 should be clean.

Comment on lines +415 to +431
# Ensure credential values are strings
username = str(creds["username"])
password = str(creds["password"])

# Perform docker login
login_command = f"echo '{password}' | docker login"

if registry and registry.lower() not in ["docker.io", "dockerhub"]:
login_command += f" {registry}"

login_command += f" --username {username} --password-stdin"

try:
self.console.sh(login_command, secret=True)
self.rich_console.print(f"[green]✅ Successfully logged in to registry: {registry or 'DockerHub'}[/green]")
except Exception as e:
self.rich_console.print(f"[red]❌ Failed to login to registry {registry}: {e}[/red]")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587.

ContainerRunner.login_to_registry now delegates to madengine.core.auth.login_to_registry (the same helper DockerBuilder.login_to_registry already uses). The previous inline form built

login_command = f"echo '{password}' | docker login ... --password-stdin"
self.console.sh(login_command, secret=True)

which -- even though docker reads the password via --password-stdin -- still embedded the password in echo's argv visible through ps//proc. The central helper writes the password to docker's stdin via subprocess.run(..., input=password.encode()), so it never appears on a process command line.

The ~50 lines of inline credential-shape validation are also gone -- the central helper does the same checks (and was the path Round-2 R2-#18 already moved DockerBuilder onto, just in a sister file).

Comment on lines 568 to 583
def get_env_arg(self, run_env: typing.Dict) -> str:
"""Get the environment arguments for docker run."""
env_args = ""

# Add custom environment variables
if run_env:
for env_arg in run_env:
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

# Add context environment variables
if "docker_env_vars" in self.context.ctx:
for env_arg in self.context.ctx["docker_env_vars"].keys():
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
value = self.context.ctx["docker_env_vars"][env_arg]
env_args += f"--env {env_arg}={shlex.quote(str(value))} "
env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "

print(f"Env arguments: {env_args}")
return env_args
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587. get_env_arg no longer dumps the full assembled --env KEY='value' ... string into stdout. It now prints only the count:

print(f"Env arguments: {env_args.count('--env')} variables set")

Values can include HF tokens, registry passwords, MAD_SECRETS_*, etc.; the assembled string is still passed to docker run (so the container sees them as expected), but no longer leaks into stdout / CI artifacts.

Comment on lines 147 to 157
try:
power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle)
if isinstance(power_info, dict):
# If both current and average power are 0, it's likely a secondary die
current_power = power_info.get('current_socket_power', -1)
avg_power = power_info.get('average_socket_power', -1)
if current_power == 0 and avg_power == 0:
return True
except Exception:
except:
# If we can't get power info, might be secondary die
return True
Comment thread docs/deployment.md Outdated
- `sglang-disagg` - Disaggregated SGLang (multi-node)

See [Launchers Guide](launchers.md) for details.
See [Distributed Launchers Guide](distributed-launchers.md) for details.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587. This was the second occurrence of [Distributed Launchers Guide](distributed-launchers.md) -- the earlier 0156d33 commit only fixed the one at line 543. Updated to (launchers.md) (verified to exist on disk).

Comment on lines 303 to 311

# Add upload timestamp if not present
if f"{prefix}_uploaded_at" not in doc:
doc[f"{prefix}_uploaded_at"] = datetime.now(timezone.utc)
doc[f"{prefix}_uploaded_at"] = datetime.utcnow()

# Preserve original created_date if present
if "created_date" not in doc:
doc["created_date"] = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
doc["created_date"] = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

Comment on lines 86 to +127
@@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg
)


# Backward compatibility alias
RuntimeError = ExecutionError

Comment on lines 29 to +33
class TestConfigParser:
"""Test cases for ConfigParser functionality."""

@pytest.fixture
def tmp_dir(self):
def test_dir(self):
raviguptaamd and others added 2 commits May 6, 2026 06:31
…pend docker.io/)

Surfaced live: `madengine build --build-on-compute --registry rocm/pytorch-private`
failed inside the build sbatch with `docker login rocm` -> dial tcp ...
connect: connection refused. Dockerhub-shorthand without the `docker.io/`
prefix is a routine input shape (matches what Docker CLI accepts for
pull/run); the build-on-compute path silently fed the first segment of
`registry` to `docker login <host>` and got a DNS NXDOMAIN.

`_execute_build_on_compute` now normalizes `--registry` early:

  1. Strip whitespace + trailing slash; reject empty after trim.
  2. Reject inputs containing whitespace or `@` anywhere (catches typos
     like `@bad/foo` BEFORE the auto-prefix step would mask them by
     overwriting the first path segment to `docker.io`).
  3. If the first path segment doesn't look like a registry host
     (no `.`, no `:`, not `localhost`), prepend `docker.io/` and print
     a transparent `[dim]auto-prefixing 'docker.io/'[/dim]` notice so
     the user can see what the build is actually pushing to.
  4. Use the resulting normalized value for everything downstream
     (login host extraction, push path construction, manifest entry,
     per-registry credential-hint matching).

Effect on real-world inputs (verified end-to-end against
docker.io/rocm/pytorch-private):

  rocm/pytorch-private             -> docker.io/rocm/pytorch-private  (auto)
  rocmshared                       -> docker.io/rocmshared            (auto)
  docker.io/rocm/pytorch-private   -> unchanged
  ghcr.io/myorg/myrepo             -> unchanged
  registry.example.com:5000/myrepo -> unchanged
  localhost:5000/myrepo            -> unchanged
  192.168.1.10:5000/myrepo         -> unchanged
  @bad/foo                         -> ConfigurationError (fail-fast,
                                       includes per-registry usage hints)
  ""                               -> ConfigurationError (existing
                                       None-guard, before this change)

Live verification (rocmshared PAT, write+read on rocm/pytorch-private):

  Test 1 `rocm/pytorch-private`             -> sbatch 22954 ✅ pushed
                                                docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench
  Test 2 `@bad/foo`                         -> ConfigurationError (no sbatch)
                                                "Invalid --registry value: '@bad/foo'.
                                                 First segment '@bad' is not a valid
                                                 registry host."
  Test 3 `docker.io/rocm/pytorch-private`   -> sbatch 22956 ✅ pushed (no auto-prefix
                                                message, value unchanged)
  Test 4 ""                                 -> caught by existing build CLI guard

Suggestion strings use parens `(...)` rather than `[...]` to avoid
Rich's markdown parser interpreting them as closing tags.

Co-authored-by: Cursor <cursoragent@cursor.com>
Round-4 Copilot review (auto-fired post 2200cbd) identified 24 inline
comments. 14 are pre-existing flake8/style nits in `origin/develop`-
identical files unchanged by this PR (tests/fixtures/**, scripts/common/
tools/**, csv_parser.py, utils/{ops,log_formatting}.py, core/errors.py,
database/mongodb.py, etc.); those are out of scope for this PR.

This commit addresses the 8 findings that DO touch code we own:

P0 (real latent bugs / known-broken):

  src/madengine/orchestration/run_orchestrator.py:
    * Drop `Context(rocm_path=...)` kwarg from both call sites
      (`_init_runtime_context` line 138; `_create_manifest_from_local_image`
      line 456). Cat B's revert of `core/context.py` to origin/develop
      removed `rocm_path` from `Context.__init__`, but those two call
      sites still passed the kwarg. Latent bug: today's slurm_multi
      tests went through `_execute_distributed` so didn't hit it,
      but `madengine run` against a local target (`_execute_local`)
      and the synthetic-manifest path (`MAD_CONTAINER_IMAGE` env)
      crashed with `TypeError: Context.__init__() got an unexpected
      keyword argument 'rocm_path'`.
    * Drop now-unused `from madengine.core.constants import get_rocm_path`
      (only callers were the two lines above).
    * Drop unused `handle_error` from `madengine.core.errors` import set
      (flake8 F401).

  src/madengine/execution/container_runner.py:
    * Widen `perf_pattern` in run-result extraction to mirror the
      `deployment/base.py:_parse_node_log_for_perf` regex we already
      widened in 63a2eae:
        before: r'performance:\s+([0-9][0-9.eE+-]*)\s+([a-zA-Z_][a-zA-Z0-9_]*)'
        after:  r'performance:\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+(\S+)'
      Accepts scientific notation values and arbitrary metric tokens
      (`tokens/sec`, `tok/s`, `samples_per_second`, etc.).

  docs/deployment.md:
    * `[Distributed Launchers Guide](distributed-launchers.md)` →
      `(launchers.md)` at line 145. Second occurrence of the broken
      link the 0156d33 fix missed (it only fixed line 543).

Yellow-tier hardening (security / leak / shell-quoting):

  src/madengine/execution/container_runner.py:
    * `login_to_registry` now delegates to
      `madengine.core.auth.login_to_registry` instead of constructing
      `f"echo '{password}' | docker login ..."` inline. The previous
      form put the password in argv (visible via `ps`/`/proc`) even
      though docker itself read it from stdin. Same delegation pattern
      DockerBuilder already uses.
    * `get_env_arg` no longer prints the assembled `--env KEY='value'`
      string (which can include HF tokens, registry passwords,
      `MAD_SECRETS_*`). Print only the count.
    * Self-managed launcher path (`_execute_self_managed_launcher`):
      `cmd = f"bash {script_path} {model_args}"` with `shell=True`
      no longer string-concats user-supplied `model_args` into the
      shell command. Use `shlex.quote(script_path)` and
      `shlex.split(model_args)` + per-arg `shlex.quote`, so embedded
      `$()`/backticks/`;` become literal arguments to the script
      instead of host-shell directives.

Verified:
  * py_compile clean on both files.
  * Import smoke: DeploymentFactory.available_deployments() ==
    ['k8s','kubernetes','slurm']; Context() rejects `rocm_path=` arg;
    _deep_merge nested-leaf preservation; run_orchestrator no longer
    references get_rocm_path/handle_error/rocm_path=; container_runner
    perf_pattern widened, env_args print is count-only,
    login_to_registry delegates to core.auth, self-managed shlex.quote
    applied.
  * Live: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi
    completed in 8 min with 28 ✅ Success rows -> perf.csv at
    /shared_inference/ravgupta/<jobid>/ as expected.

Addresses Copilot review comments R4-1, R4-2, R4-3, R4-4, R4-5,
R4-6, R4-7, R4-18.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 6, 2026 19:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated 16 comments.

Comments suppressed due to low confidence (2)

tests/unit/test_reporting_superset.py:39

  • This file contains whitespace on otherwise blank lines (e.g. between the class docstring and the first fixture). The repo’s pre-commit trailing-whitespace hook (and flake8 W293) will fail on this; please remove trailing spaces on blank lines.
    tests/unit/test_reporting_superset.py:171
  • Additional whitespace-only lines in this section will also fail the pre-commit trailing-whitespace hook. Please strip the trailing spaces from blank lines here as well.

Comment thread pyproject.toml
Comment on lines 14 to 24
dependencies = [
"pandas",
"GitPython",
"jsondiff",
"sqlalchemy",
"setuptools-rust",
"paramiko",
"tqdm",
"pytest",
"typing-extensions",
"pymongo",
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: dropped pytest from [project].dependencies. It was already present in [project.optional-dependencies].dev and .all, so no install path that was running tests is affected; only the bare pip install madengine (without [dev]/[all]) no longer pulls a test framework into runtime installs.

Comment on lines 152 to 156
result = builder.get_build_arg()

assert f"--build-arg ARG1={shlex.quote('value1')}" in result
assert f"--build-arg ARG2={shlex.quote('value2')}" in result
assert "--build-arg ARG1='value1'" in result
assert "--build-arg ARG2='value2'" in result

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: 5 build-arg assertions updated to match DockerBuilder.get_build_arg()'s shlex.quote() output. For shell-safe values (e.g. value1), shlex.quote returns the value unchanged, so the assertions are now --build-arg ARG1=value1 (no surrounding quotes). Verified locally: pytest -k build_arg4 passed.

Comment on lines 72 to 93
@pytest.mark.parametrize(
"clean_test_temp_files", [["perf_test.csv", "perf_test.html"]], indirect=True
)
def test_commandline_argument_skip_gpu_arch(
self, global_data, clean_test_temp_files, dynamic_skip_gpu_arch_model_dir
self, global_data, clean_test_temp_files
):
"""
Test that skip_gpu_arch command-line argument skips GPU architecture check.
Uses a dynamic MODEL_DIR with the current GPU arch added to the skip list
so the test works on any hardware.
UPDATED: Now uses python3 -m madengine.cli.app instead of legacy mad.py
"""
context = generate_additional_context_for_machine()
output = global_data["console"].sh(
"cd "
+ BASE_DIR
+ "; "
+ "MODEL_DIR="
+ dynamic_skip_gpu_arch_model_dir
+ MODEL_DIR
+ " "
+ f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'"
)
if "Skipping model" not in output:
pytest.fail("Enable skipping gpu arch for running model is failed.")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred to upstream-flake8 cleanup PR. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to origin/develop; this PR's commit 50e8895 only touched the surrounding lines (88, 99) for whitespace and a fixture-arg rename, not the assertion target itself. Tracked alongside the rest of the Round-4 / Round-5 upstream-style nits in the follow-up cleanup PR (see #86 description, "Round 4" / "Round 5" status sections).

Comment on lines 418 to 425
if deployment_type == "slurm" and in_slurm_job:
print(f"🔄 Using fresh pull policy for SLURM compute node (prevents cached layer corruption)")
# Remove any existing cached image to force fresh pull
try:
self.console.sh(f"docker rmi -f {registry_image} 2>/dev/null || true")
print(f"✓ Removed cached image layers")
except Exception:
except:
pass # It's okay if image doesn't exist
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: the bare except: at L424 (introduced by this PR's 6f2e11bf) was replaced with except Exception: so KeyboardInterrupt/SystemExit propagate, and flake8 E722 no longer fires on this line.

Comment on lines 521 to 534
def get_env_arg(self, run_env: typing.Dict) -> str:
"""Get the environment arguments for docker run."""
env_args = ""

# Add custom environment variables
if run_env:
for env_arg in run_env:
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

# Add context environment variables
if "docker_env_vars" in self.context.ctx:
for env_arg in self.context.ctx["docker_env_vars"].keys():
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
value = self.context.ctx["docker_env_vars"][env_arg]
env_args += f"--env {env_arg}={shlex.quote(str(value))} "
env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: get_env_arg now passes each value through shlex.quote(...) instead of single-quote string interpolation. shlex.quote is byte-equivalent for shell-safe values (alphanumeric, _, -, ., :, /, +, =, etc.) and correctly escapes embedded single quotes, whitespace, and other metacharacters.

slurm_multi safety: the self-managed launcher path (_run_self_managed) returns at L1099 before the docker-run command assembly at L1058 that calls get_env_arg, so this change only tightens the regular-Docker run path used by non-slurm_multi workloads. Verified end-to-end with a fresh pyt_large_ep_benchmark_2n slurm_multi smoke.

Comment thread src/madengine/cli/commands/run.py Outdated
if timeout < -1:
console.print(
"❌ [red]Timeout must be -1 (default), 0 (no timeout), or a positive integer[/red]"
"❌ [red]Timeout must be -1 (default) or a positive integer[/red]"
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: validation message updated to match the actual accepted shape — "Timeout must be -1 (default), 0 (no timeout), or a positive integer". The --timeout help string already documents 0 for no timeout (line 67) and the validator is if timeout < -1, so 0 was always accepted; only the error string was misleading. No behaviour change.

Comment on lines 34 to 41
def is_amd_gpu():
"""Check if system has AMD GPU."""
try:
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5)
return result.returncode == 0
except Exception:
except:
return False
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.

Comment thread tests/fixtures/utils.py
Comment on lines 96 to 106
# Clean up Docker containers BEFORE test (ensure clean state)
try:
subprocess.run(
"docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f",
shell=True,
capture_output=True,
timeout=30
)
except Exception:
except:
pass # Ignore cleanup errors before test

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (no edits from #86). The two bare-except instances in this file are pre-existing; develop's [2.0.1] test-suite cleanup narrowed 5 such cases elsewhere but did not touch these two. Tracked in the upstream-cleanup follow-up PR.

Comment thread tests/fixtures/utils.py
Comment on lines 118 to 127
# Clean up Docker containers AFTER test (avoid conflicts with next test)
try:
subprocess.run(
"docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f",
shell=True,
capture_output=True,
timeout=30
)
except Exception:
except:
pass # Ignore cleanup errors after test
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. Same disposition as the L106 thread above — file byte-identical to origin/develop, pre-existing concern, tracked in the upstream-cleanup follow-up PR.

Comment on lines 60 to 66
def convert_directory_csvs_to_html(
directory_path: str,
output_file: str = "run_results.html"
) -> Optional[str]:
) -> str:
"""Convert all CSV files in a directory to a single HTML file.

Args:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. This file is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.

@raviguptaamd
Copy link
Copy Markdown
Author

Status update — ready for review

PR head: 3fc7587. 75 files / +2,877 / -2,179 vs develop. All four Copilot review rounds resolved or replied-to inline.

What's in here

  • slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) with the orchestration / build / run / CLI hooks that drive it from the manifest pipeline.
  • madengine build --build-on-compute --registry <docker.io/<ns>(/<repo>) | ghcr.io/... | localhost:5000/...> flow (build on a compute node + push for parallel pull on run nodes).
  • madengine build --use-image <image> flow (skip build, use a pre-built image).
  • SBATCH wrapper generation, reservation support, existing-allocation detection (salloc), per-job-id completion marker, _collect_slurm_multi_results perf.csv discovery in /shared_inference/<user>/<jobid>/.

Kubernetes deployment work split out into #118 (independent PR against develop) per earlier reviewer ask; that PR will rebase / sequence after this one lands.

Live end-to-end test coverage

All on Distributed_Inference_CI and adjacent free nodes.

# Workload Path Result
1 SGLang-disagg Llama-3.1-8B (1Px8/1Dx8, 24 GPUs / 3 nodes) --use-image + slurm_multi 5,977 tok/s, TTFT 122.8 ms, ITL 2.54 ms
2 SGLang-disagg DeepSeek-V3 (1Px8/1Dx8, 24 GPUs / 3 nodes) --use-image + slurm_multi 853 / 1,452 tok/s @ CON=8/16, ISL/OSL=1024/1024
3 SGLang-disagg Llama-3.3-70B-FP8 (1Px8/1Dx8, 24 GPUs / 3 nodes) --build-on-compute --registry docker.io/rocm/pytorch-private -> rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8 1,878 / 3,644 tok/s @ CON=8/16
4 Large EP (16 GPUs / 2 nodes) --build-on-compute --registry -> rocm/pytorch-private:pyt_large_ep_benchmark_2n 28 perf rows (DeepEP + MoRI-EP dispatch/combine, FP8/BF16 RDMA/NVL)
5 KV cache transfer (16 GPUs / 2 nodes) --build-on-compute --registry -> rocm/pytorch-private:pyt_kvcache_transfer_bench 57 perf rows (mooncake / MORI / RIXL, 4KB → 1GB sweep)
6 Re-smoke post-R4 fixes: pyt_large_ep_benchmark_2n on [008,024] slurm_multi from manifest 8 min, 28 ✅ Success rows -- confirms R4 changes didn't regress the run path

The vLLM-disagg run path was also exercised end-to-end (slurm_multi topology + co-located proxy + port refactor + MoRI-EP gating all working) but the underlying rocm/pytorch-private:20260407_itej89_vllm_mori_docker image hits a NIXL_ERR_BACKEND after weight load. Image-side, orthogonal to this PR; tracked separately. Run-phase plumbing not gated by it.

Copilot review status

Four rounds total, all addressed:

  • Round 1 (9 inline) -- fixed in d32b06b / 5ec55ff, replied inline.
  • Round 2 (24 inline) -- 9 fixed in 63a2eae, 11 reverted to origin/develop byte-for-byte in 9109f52, 4 doc/example fixes in 0156d33, all replied inline with commit SHAs.
  • Round 3 (13 inline) -- 4 fixed in 2200cbd (profiling README typo, pytest unused-import, test_rocm_path.py revert), 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up (pre-existing in origin/develop).
  • Round 4 (24 inline) -- 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit shape entry 12 in PR description), 14 are pre-existing flake8 / style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/, core/errors.py, database/mongodb.py, deployment/base.py:645) -- not introduced by this PR; deferred to a separate cleanup PR.

The "P0" subset of Round 4 included a real latent crash I'd introduced in the Cat B revert (run_orchestrator.py still passing Context(rocm_path=...) after core/context.py was reverted to upstream's signature) -- that's now fixed and verified by the live re-smoke above.

Bugs surfaced and fixed during real-cluster testing

  • --registry parsing: shorthand rocm/pytorch-private was hitting docker login rocm (NXDOMAIN). Now auto-prefixed to docker.io/rocm/pytorch-private with a transparent notice; malformed input rejected up front with usage hints (98f1ac5).
  • --build-on-compute local image name: dockerfile path with no .ubuntu.amd.Dockerfile suffix (e.g., scripts/.../Dockerfile) yielded ci-..._Dockerfile -- Docker rejects uppercase. .lower() the assembled name (ae1996c).
  • _load_and_merge_manifest was still doing shallow dict.update() despite the Round-2 "deep merge" fix -- promoted to a recursive module-level helper used by both code paths (63a2eae).
  • manifest.context.docker_env_vars was being dropped on --manifest-file runs, losing build-time secrets / MAD_SECRETS_* (63a2eae).
  • Completion marker was not job-id namespaced -- a second madengine run of the same model tag returned success in ~30 s while the new sbatch was still running (710b51a).
  • ErrorContext was being constructed with 5 invalid kwargs across build_orchestrator.py, masking the real error (5ec55ff).
  • _collect_slurm_multi_results was only looking at <cwd>/perf.csv, missing the canonical /shared_inference/<user>/<jobid>/perf.csv location the inner scripts emit (5ec55ff).

Sequencing

  1. Updates to slurm launcher #86 (this PR) → develop.
  2. Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
  3. After both land, follow-up cherry-picks the combined work into main per repo policy.

What I'd appreciate eyes on

  • The slurm_multi launcher core (deployment/slurm.py, slurm_node_selector.py, the slurm_multi wrapper-script generation).
  • The --build-on-compute / --use-image orchestration in build_orchestrator.py (registry normalization, lowercase tagging, prebuilt-image multi-model handling, per-registry auth hints).
  • run_orchestrator._deep_merge semantics for nested slurm.node_selector / distributed.<sub> overrides.
  • container_runner._run_self_managed (the new self-managed launcher path used by slurm_multi).
  • The models.json slurm-side example configs renamed to slurm-multi-*.json and the launchers / deployment / cli-reference docs.

Reviews welcome. cc @gargrahul @Cemberk @Rohan138 @coketaste @leconcio

raviguptaamd and others added 2 commits May 6, 2026 22:52
Three P1 items in PR-introduced code, plus four small cleanups Copilot
flagged on the same lines.

container_runner.py
  * get_env_arg: replace single-quote string interpolation
    (`--env KEY='value'`) with shlex.quote, so values containing single
    quotes / whitespace / shell metacharacters no longer break the
    assembled `docker run` command. shlex.quote is byte-equivalent
    for safe alphanumeric values (returns the input unchanged) and
    only adds escaping when needed. Not on the slurm_multi path
    (self-managed launcher short-circuits at L1098 before the docker
    docker_options assembly that calls get_env_arg).
  * Revert the AMD GPU SMI block (~L1142) to origin/develop's bare-name
    form (`amd-smi 2>/dev/null || rocm-smi 2>/dev/null || echo ...`).
    The PR-introduced version interpolated the *host* `context.ctx['rocm_path']`
    (or `get_rocm_path()`) into absolute `/opt/rocm-X.Y/bin/*` paths and
    executed them *inside* the container, which is incorrect when the
    container's ROCm install location differs from the host's. Bare
    names rely on the container's PATH, which is the upstream behaviour.
  * Replace bare `except:` at L424 (PR-introduced) with `except Exception:`
    so KeyboardInterrupt / SystemExit propagate (flake8 E722).
  * Fix a stale function-name reference in the perf-pattern comment at
    L1394: deployment/base.py's function is `_parse_performance_from_log`,
    not `_parse_node_log_for_perf`.
  * Hoist `import shlex` to the module-level import block (was a local
    import inside `_run_self_managed`) and drop the now-unused
    `from madengine.core.constants import get_rocm_path`.

cli/commands/run.py
  * Update timeout validation message to match the actual accepted
    shape ("-1 (default), 0 (no timeout), or a positive integer").
    The --timeout help string already documents `0 for no timeout`,
    and the validator uses `if timeout < -1`, so 0 was always
    accepted; only the error string was misleading.

pyproject.toml
  * Drop `pytest` from [project].dependencies (it stays in
    [project.optional-dependencies].dev and .all). Keeps test
    dependencies out of `pip install madengine` runtime installs.

tests/integration/test_docker_integration.py
  * Update 5 build-arg assertions to match `DockerBuilder.get_build_arg()`'s
    shlex.quote() output for shell-safe values
    (`--build-arg ARG1=value1`, no surrounding quotes).
    Verified: `pytest -k build_arg` -> 4 passed.

slurm_multi safety: get_env_arg is reachable only from the regular
docker run path (L1058); the slurm_multi self-managed launcher
returns at L1099 before that. The SMI revert and the bare-`except`
fix are both in code paths used only outside slurm_multi (image
pull / GPU pre-check). The timeout, pyproject and comment-typo
changes have no runtime effect.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	CHANGELOG.md
Copilot AI review requested due to automatic review settings May 6, 2026 22:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 73 out of 74 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (2)

src/madengine/deployment/base.py:236

  • _monitor_until_complete() never treats CANCELLED (or any non-SUCCESS/FAILED terminal status) as terminal. Since DeploymentStatus still includes CANCELLED, this loop can spin forever after a cancellation. Consider returning on CANCELLED as well (and/or treating any non-PENDING/RUNNING status as terminal).
        self.console.print("[blue]Monitoring deployment...[/blue]")

        while True:
            status = self.monitor(deployment_id)

            if status.status in [DeploymentStatus.SUCCESS, DeploymentStatus.FAILED]:
                return status

            # Still running, wait and check again
            self.console.print(
                f"  Status: {status.status.value} - {status.message}"
            )
            time.sleep(30)  # Check every 30 seconds

src/madengine/execution/container_runner.py:251

  • create_run_details_dict() indexes self.context.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] directly. If docker_env_vars or the key is missing (e.g., older manifests / non-GPU runs / context init differences), this will raise KeyError and prevent perf entry generation. Use .get() with a safe fallback (or guard when the nested dict/key is absent).

Comment on lines +1012 to +1019
# Merge env_vars from model_info (models.json) into docker_env_vars
if "env_vars" in model_info and model_info["env_vars"]:
model_env_count = 0
for key, value in model_info["env_vars"].items():
self.context.ctx["docker_env_vars"][key] = str(value)
model_env_count += 1
if model_env_count > 0:
print(f"ℹ️ Merged {model_env_count} environment variables from model_info (models.json)")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: restored the upstream finalize_container_rocm_path() call block (originally introduced on develop in #110, removed by this PR's 6f2e11bf when the model_info env_vars merge block replaced it). The two address different concerns and must coexist — the env_vars merge copies model-card values into docker_env_vars, while finalize normalizes in-container ROCM_PATH via OCI inspect → in-image probe → /opt/rocm default. Block restored verbatim from origin/develop (including the user_supplied_rocm_path short-circuit), placed immediately after the env_vars merge.

slurm_multi safety: the self-managed launcher returns at L1099 before this code path, so this restoration only affects the regular-Docker run path used by non-slurm_multi workloads.

Comment on lines +1672 to +1675
model_docker.sh(f"cp {model_dir}/*_profiler_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {model_dir}/*_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {model_dir}/*_trace.csv . 2>/dev/null || true")
model_docker.sh(f"cp {model_dir}/library_trace.csv . 2>/dev/null || true")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: model_dir is now passed through shlex.quote once and reused across the four cp ... 2>/dev/null || true calls. The trailing *.csv glob is left unquoted intentionally so the in-container shell still expands it inside the (quoted) directory:

_md_q = shlex.quote(model_dir)
model_docker.sh(f"cp {_md_q}/*_profiler_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/*_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/*_trace.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/library_trace.csv . 2>/dev/null || true")

Comment on lines +1682 to 1686
mult_res = model_info.get("multiple_results")
if mult_res:
try:
model_docker.sh(
_cp_model_dir_file_to_cwd_cmd(model_dir, mult_res)
)
model_docker.sh(f"cp {model_dir}/{mult_res} . 2>/dev/null || true")
except Exception:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: both path components are now quoted, so a model-metadata multiple_results value containing whitespace or shell metacharacters can no longer break the command or be re-interpreted by the shell:

model_docker.sh(
    f"cp {shlex.quote(model_dir)}/{shlex.quote(mult_res)} . "
    "2>/dev/null || true"
)

"""

import json
import os
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: dropped the unused import os (flake8 F401). The module hasn't referenced os.* since the orchestration helpers it used were moved out.

Comment on lines 82 to 93
context = generate_additional_context_for_machine()
output = global_data["console"].sh(
"cd "
+ BASE_DIR
+ "; "
+ "MODEL_DIR="
+ dynamic_skip_gpu_arch_model_dir
+ MODEL_DIR
+ " "
+ f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'"
)
if "Skipping model" not in output:
pytest.fail("Enable skipping gpu arch for running model is failed.")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to the version on origin/develop in this PR. develop's [2.0.1] already lands the hardware-agnostic fix ([2.0.1] -> "E2E tests — hardware-agnostic GPU arch skip"test_commandline_argument_skip_gpu_arch now detects the runner's GPU arch at runtime and injects it into skip_gpu_arch, with a new get_gpu_arch() helper in tests/fixtures/utils.py). Once #86 lands and CI re-runs against develop, that fix is what executes; no #86-side change needed. Tracked alongside the other origin/develop-identical concerns in the upstream-flake8 follow-up cleanup PR (per #86 description, Round 4 / Round 5 / Round 6 status sections).

Comment on lines 36 to 41
try:
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5)
return result.returncode == 0
except Exception:
except:
return False
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L41 is from 04aac390). Tracked alongside the other upstream bare-except: cleanup in the follow-up PR.

Comment on lines 30 to 42
try:
self.rocmsmi = initRsmiBindings(silent=False)
if rocm_smi.driverInitialized() is True:
ret_init = self.rocmsmi.rsmi_init(0)
if ret_init != 0:
raise ValueError('ROCm SMI returned %s (the expected value is 0)', ret_init)
exit(ret_init)
else:
raise ImportError('Driver not initialized (amdgpu not found in modules)')
exit(0)
self.rocm6 = True
except Exception:
except:
rocm_smi.initializeRsmi()
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L42 is from da0d524f (Rahul Garg) and predates #86). Tracked in the upstream-cleanup follow-up PR.

Comment on lines 147 to 157
try:
power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle)
if isinstance(power_info, dict):
# If both current and average power are 0, it's likely a secondary die
current_power = power_info.get('current_socket_power', -1)
avg_power = power_info.get('average_socket_power', -1)
if current_power == 0 and avg_power == 0:
return True
except Exception:
except:
# If we can't get power info, might be secondary die
return True
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L157 is from 0f2c3010 (Stephen Shao) and predates #86). Tracked in the upstream-cleanup follow-up PR.

Comment on lines +14 to 17
from unittest.mock import MagicMock, patch, Mock
import pytest
import pandas as pd

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: dropped the unused import pandas as pd (flake8 F401). Confirmed there are no pd.* references in the file — the test fixtures use dict / MagicMock only, so the import was dead. The line was added by this PR's 50e8895.

Comment thread tests/unit/test_reporting_superset.py Outdated
class TestConfigParser:
"""Test cases for ConfigParser functionality."""

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: trailing whitespace stripped on every blank-with-spaces line (107 occurrences in this file, including the one at L31). Pure whitespace cleanup, no functional change; flake8 W293 is now clean for this file.

Round 6 fired automatically on the develop-merge push (8fec192). 5 of
the 14 comments are PR-introduced; the remaining 9 are byte-identical
to origin/develop and stay on the upstream-flake8 cleanup follow-up
list with the rest of R4 / R5 deferrals.

container_runner.py
  * Restore the upstream `finalize_container_rocm_path()` call
    block (originally introduced on develop in PR ROCm#110, removed by
    this PR's 6f2e11b when the model_info `env_vars` merge block
    replaced it). The two blocks address different concerns and
    must coexist:
      - merge model-card env_vars into docker_env_vars (this PR), and
      - normalize in-container ROCM_PATH via OCI inspect → in-image
        probe → /opt/rocm default (upstream).
    Without the finalization, models on TheRock or non-`/opt/rocm`
    container layouts run with a wrong/empty ROCM_PATH, breaking
    rocm-smi / amd-smi / hipcc inside the container.
    The slurm_multi self-managed launcher returns at L1099 before
    this code path, so this restoration only affects the regular
    docker run path.

  * Quote `model_dir` (and `mult_res`) with shlex.quote on the
    profiler-/trace-/multi-results `cp` commands run via
    `model_docker.sh(... shell=True)` (lines 1672-1685 area). The
    `*.csv` glob is left unquoted so the in-container shell still
    expands it inside the (quoted) directory. `mult_res` comes
    straight from model metadata and was previously concatenated
    raw into the shell command.

deployment/config_loader.py
  * Drop unused `import os` (flake8 F401). The module hasn't
    referenced `os.*` since the orchestration helpers it used were
    moved out.

tests/unit/test_database_mongodb.py
  * Drop unused `import pandas as pd` (flake8 F401). The test
    file uses dict / Mock fixtures and never reaches into pandas.

tests/unit/test_reporting_superset.py
  * Strip trailing whitespace (flake8 W293) on every blank-with-
    spaces line (107 occurrences). Pure whitespace; no functional
    change.

slurm_multi safety: all five fixes either restore upstream
behaviour on the regular-Docker run path, are flake8-only, or are
in test files. None of them change anything reached by the
slurm_multi self-managed launcher path; live re-smoke is
unnecessary for this commit.

Co-authored-by: Cursor <cursoragent@cursor.com>
raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 10, 2026
…lti bash branch

The earlier feat(deployment) commit ported `_run_inside_existing_allocation`
verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch).
That method constructs `DeploymentResult(..., skip_monitoring=True)` to
signal that the script ran synchronously and the monitor phase should
be skipped. Develop's `DeploymentResult` dataclass did not have that
field, so the construction raised `TypeError` at runtime, the failure
was swallowed by the orchestrator, and `madengine run` exited 0 even
when the wrapper script failed.

Two minimal additive edits to deployment/base.py:

- DeploymentResult: add `skip_monitoring: bool = False` field.
- BaseDeployment.execute(): change the monitor guard from
  `if self.config.monitor:` to
  `if self.config.monitor and not result.skip_monitoring:`
  so the bash branch can correctly bypass the SLURM job poll
  (there is no SLURM job to poll — the script ran inline).

Behavior for non-slurm_multi launchers is unchanged: they never
construct DeploymentResult with skip_monitoring=True, the field
defaults to False, and the monitor() call still fires as before.

Discovered during R5.1b cluster smoke test of
`pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants