Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,6 @@ k8s_manifests/
k8s_results/
rocprof_output/
rpd_output/
rocm_trace_lite_output/
slurm_results/
MagicMock/
.madengine_session_start
run_directory/
.madengine_session_start
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ The architecture diagram (Orchestration, Infrastructure, and Launcher layers) is
2. **Model Discovery** - Find and validate models from MAD package
3. **Orchestration** - BuildOrchestrator & RunOrchestrator manage workflows
4. **Execution Targets** - Local Docker, Kubernetes Jobs, or SLURM Jobs
5. **Distributed Launchers** - Training (torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus) and Inference (vLLM, SGLang)
5. **Distributed Launchers** - Training (torchrun, DeepSpeed, TorchTitan, Megatron-LM) and Inference (vLLM, SGLang)
6. **Performance Output** - CSV/JSON results with metrics
7. **Post-Processing** - Report generation (HTML/Email) and database upload (MongoDB)

Expand Down
192 changes: 183 additions & 9 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ madengine build [OPTIONS]
| `--target-archs` | `-a` | TEXT | `[]` | Target GPU architectures (e.g., gfx908,gfx90a,gfx942) |
| `--registry` | `-r` | TEXT | `None` | Docker registry to push images to |
| `--batch-manifest` | | TEXT | `None` | Input batch.json file for batch build mode |
| `--use-image` | | TEXT | `None` | Skip Docker build, use pre-built image. Omit value to auto-detect from model's `DOCKER_IMAGE_NAME` |
| `--build-on-compute` | | FLAG | `False` | Build Docker images on SLURM compute node instead of login node |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |
| `--clean-docker-cache` | | FLAG | `False` | Rebuild images without using cache |
Expand Down Expand Up @@ -142,6 +144,16 @@ madengine build --tags model \

# Real-time output with verbose logging
madengine build --tags model --live-output --verbose

# Use pre-built image (skip Docker build)
madengine build --tags sglang_disagg \
--use-image lmsysorg/sglang:v0.5.5.post3-rocm700-mi30x \
--additional-context-file slurm-config.json

# Build on SLURM compute node instead of login node
madengine build --tags model \
--build-on-compute \
--additional-context-file slurm-config.json
```

**Default Values:**
Expand Down Expand Up @@ -193,6 +205,172 @@ When using `--batch-manifest`, provide a JSON file with selective build configur

See [Batch Build Guide](batch-build.md) for details.

**Pre-built Image Mode (`--use-image`):**

Skip Docker build and use an existing image from a registry or local Docker cache:

```bash
# Auto-detect image from model card's DOCKER_IMAGE_NAME env var
madengine build --tags sglang_disagg \
--use-image \
--additional-context-file config.json

# Explicitly specify image from Docker Hub
madengine build --tags sglang_disagg \
--use-image lmsysorg/sglang:v0.5.5.post3-rocm700-mi30x \
--additional-context-file config.json

# Use image from NGC
madengine build --tags model \
--use-image nvcr.io/nvidia/pytorch:24.01-py3

# Use locally cached image
madengine build --tags model \
--use-image my-local-image:latest
```

**Image Resolution Priority:**
1. If `--use-image <name>` is specified, use that image
2. If `--use-image` (no value), auto-detect from model card's `DOCKER_IMAGE_NAME` env var
3. If no image found in model card, error with helpful suggestions

**Multiple Models Warning:**
When using auto-detection with multiple models that have different `DOCKER_IMAGE_NAME` values, the first model's image is used and a warning is printed.

**Mutual Exclusivity:**
- `--use-image` cannot be used with `--registry` (push requires local build)
- `--use-image` cannot be used with `--build-on-compute` (skip build vs. build on compute)

**When to use `--use-image`:**
- Using official framework images (SGLang, vLLM, etc.)
- Image is pre-cached on compute nodes
- Testing without rebuilding
- CI/CD pipelines with external images

The generated manifest marks the image as `"prebuilt": true` with `build_time: 0`.

**Relationship with `MAD_CONTAINER_IMAGE`:**

`--use-image` and `MAD_CONTAINER_IMAGE` both allow using pre-built images, but operate at different phases:

| | `--use-image` | `MAD_CONTAINER_IMAGE` |
|---|---|---|
| Phase | Build (`madengine build`) | Run (`madengine run`) |
| Output | Generates a manifest file | Creates a synthetic manifest at runtime |
| Workflow | Two-step: `build` then `run` | Single-step: `run` only |
| Use case | slurm_multi, CI pipelines, reproducible manifests | Quick local testing, ad-hoc runs |

They are complementary. Use `--use-image` when you want a persistent manifest that can be shared or re-run. Use `MAD_CONTAINER_IMAGE` when you want a quick single-command run without generating a manifest.

**Build on Compute Node (`--build-on-compute`):**

Build Docker images on a SLURM compute node, push to registry, and pull in parallel during run phase:

```bash
# Build on compute node and push to registry (--registry REQUIRED)
madengine build --tags model \
--build-on-compute \
--registry docker.io/myorg \
--additional-context-file slurm-config.json
```

**Required:** `--registry` must be specified with `--build-on-compute`.

**SLURM Config Priority:**
1. Model card's `slurm` section (base configuration)
2. `--additional-context` overrides (command line takes precedence)

If the model card already has `slurm` config, you only need to provide missing or override values:

```bash
# Model card has partition/time, just override reservation
madengine build --tags model \
--build-on-compute \
--registry docker.io/myorg \
--additional-context '{"slurm": {"reservation": "my-res"}}'
```

**When to use `--build-on-compute`:**
- Login node has limited disk space or resources
- Build requires GPU access (e.g., AOT compilation)
- Login node policies prohibit heavy workloads
- Distributing images to many compute nodes (build once, pull everywhere)

**How it works:**

*Build Phase:*
1. Discovers model and merges SLURM config (model card + additional-context)
2. Submits build job to **1 compute node** via `sbatch --wait`
3. Builds Docker image on that node
4. Pushes image to registry
5. Generates manifest with registry image name

*Run Phase:*
1. Detects `built_on_compute: true` in manifest
2. Pulls image **in parallel on ALL nodes** via `srun docker pull`
3. Executes model script

**Inside existing SLURM allocation:**

If you're already inside an `salloc` allocation, `--build-on-compute` uses `srun` directly instead of submitting a new job.

**Error Messages:**

If required SLURM fields are missing, specific errors are shown:
- Missing `partition`: "Add partition to model card's slurm section or via --additional-context"

---

**Multi-Node SLURM Launcher (`slurm_multi`):**

Models using the `slurm_multi` launcher (for multi-node distributed inference) **require** either `--registry` or `--use-image`:

```bash
# Option 1: Build and push to registry
madengine build --tags sglang_model \
--registry docker.io/myorg \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Option 2: Use pre-built image from registry
madengine build --tags sglang_model \
--use-image docker.io/myorg/sglang:latest

# Option 3: Build on compute and push
madengine build --tags sglang_model \
--build-on-compute \
--registry docker.io/myorg \
--additional-context-file config.json
```

**Why this requirement?**

Multi-node SLURM jobs run on multiple compute nodes. Each node needs access to the Docker image:
- Local builds only exist on the login/build node
- Compute nodes cannot access locally built images
- Registry images enable parallel `docker pull` on all nodes

**Parallel Image Pull:**

During `madengine run`, images from a registry are automatically pulled in parallel on all allocated nodes:

```bash
srun --nodes=$SLURM_NNODES --ntasks=$SLURM_NNODES docker pull <image>
```

This ensures fast, consistent image availability across the cluster.

**Re-using Images:**

For subsequent runs with the same image, use `--use-image` to skip building:

```bash
# First run: build and push
madengine build --tags model --registry docker.io/myorg

# Subsequent runs: use pre-built image
madengine build --tags model --use-image docker.io/myorg/model:latest
```

---

### `run` - Execute Models
Expand All @@ -211,6 +389,7 @@ madengine run [OPTIONS]
|--------|-------|------|---------|-------------|
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
Expand Down Expand Up @@ -239,13 +418,9 @@ madengine run [OPTIONS]
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Custom host ROCm path (when ROCm is not in /opt/rocm, e.g. TheRock or pip install)
madengine run --tags dummy \
--additional-context '{"MAD_ROCM_PATH": "/path/to/rocm", "gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Custom in-container ROCm path (independent from host)
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}'
# Custom ROCm path (when ROCm is not in /opt/rocm, e.g. Rock or pip install)
madengine run --tags dummy --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Run with pre-built images (manifest-based)
madengine run --manifest-file build_manifest.json
Expand Down Expand Up @@ -620,8 +795,7 @@ madengine recognizes these environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `MODEL_DIR` | Path to MAD package directory | Auto-detected |
| `ROCM_PATH` | **Host** ROCm installation root (fallback when `MAD_ROCM_PATH` is not set in additional context and auto-detect is disabled or finds nothing). In-container `ROCM_PATH` for Docker is not taken from this variable; set `docker_env_vars.MAD_ROCM_PATH` in additional context instead. | `/opt/rocm` |
| `MAD_AUTO_ROCM_PATH` | Set to `0` to disable **host** auto-detect (`ROCM_PATH` then `/opt/rocm` on the host). | (default: scan on) |
| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set) | `/opt/rocm` |
| `MAD_VERBOSE_CONFIG` | Enable verbose configuration logging | `false` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | None |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password/token | None |
Expand Down
18 changes: 4 additions & 14 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,22 +131,12 @@ Disabling the scan does **not** change performance metric extraction from the lo

### ROCm path (run only)

**Host** (where `madengine` runs validation): by default, the ROCm root is **auto-detected** (traditional `/opt/rocm`, [TheRock](https://github.com/ROCm/TheRock) `rocm-sdk` / manifest layout, or `ROCM_PATH`-like env hints). Set `MAD_AUTO_ROCM_PATH=0` to skip auto and use only legacy resolution (`ROCM_PATH` then `/opt/rocm`).
When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context):

**Overrides** (recommended for CI):
- **CLI:** `madengine run --rocm-path /path/to/rocm ...`
- **Environment:** `export ROCM_PATH=/path/to/rocm`

- **Additional context (host):** top-level `"MAD_ROCM_PATH": "/path/to/host/rocm"` — controls where madengine looks for host GPU tools (`rocminfo`, `amd-smi`, etc.).
- **Additional context (container):** `"docker_env_vars": { "MAD_ROCM_PATH": "/path/inside/image" }` — sets the in-container `ROCM_PATH` for Docker runs. If omitted, at `run` time madengine uses the image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) if present, then an in-container probe, then defaults to `/opt/rocm`. The host-resolved path is **not** mirrored into the container.

These two keys are independent, allowing host and container to use different ROCm installations without confusion.

Precedence (host): top-level `MAD_ROCM_PATH` → auto-detect (unless disabled) → `ROCM_PATH` → `/opt/rocm`.

Precedence (container, **local Docker `run`**, **AMD**): `docker_env_vars.MAD_ROCM_PATH` (maps to `ROCM_PATH` for the workload) or explicit `ROCM_PATH` in `docker_env_vars` → image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) → in-image probe → default `/opt/rocm` with a warning. Implemented in `ContainerRunner.run_container` after the run image is resolved.

This applies to the run phase; build uses build-only context (no GPU detection) but still honors `MAD_ROCM_PATH` in context when set.

At the start of each container run, a **Run Phase Environment** table is printed showing host vs container installation type (`apt install` or `therock`), ROCm/CUDA root, and version side-by-side. See [Run phase environment table](usage.md#run-phase-environment-table).
Resolution order: `--rocm-path` → `ROCM_PATH` → `/opt/rocm`. This applies only to the run phase; build does not perform GPU detection.

## Build Configuration

Expand Down
Loading