Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5cb4e27
docs: add design spec for config-driven YAML system
coketaste May 2, 2026
d1b5867
docs: add implementation plan for config-driven YAML system
coketaste May 2, 2026
e80c744
feat(config): add hydra-core and omegaconf dependencies
coketaste May 3, 2026
fb9bd60
feat(config): add root config.yaml and default config groups
coketaste May 3, 2026
491fe9b
feat(config): add append-only config groups (profile, env, tools, dat…
coketaste May 3, 2026
f4ea647
chore: exclude configs/build/ from gitignore build/ pattern
coketaste May 3, 2026
23ea925
feat(config): implement HydraConfigLoader with Compose API
coketaste May 3, 2026
28e36e2
feat(config): implement ConfigTranslator key mapping
coketaste May 3, 2026
456e887
feat(config): implement ConfigValidator with cross-field checks
coketaste May 3, 2026
cb3f781
test(config): add integration tests for load_config pipeline
coketaste May 3, 2026
2527146
feat(config): integrate --config into run command
coketaste May 3, 2026
705b8f3
feat(config): integrate --config into build command
coketaste May 3, 2026
bca5dcf
refactor(config): extract deep_merge to shared utility
coketaste May 3, 2026
6dd1427
style: apply formatting fixes from pre-commit hooks
coketaste May 3, 2026
0d26093
fix(config): resolve flake8 and mypy lint issues in new config files
coketaste May 3, 2026
d859d95
feat(config): integrate --config into run command with mutual exclusion
coketaste May 3, 2026
0d86518
fix(config): add mutual exclusion note to --config help text in run c…
coketaste May 3, 2026
9addca5
feat(config): integrate --config into build command with mutual exclu…
coketaste May 3, 2026
1d110cf
chore: apply develop branch refactor to feature branch
coketaste May 3, 2026
e9c4a3d
feat(config): add --config to refactored run and build commands with …
coketaste May 3, 2026
dd2f8cd
fix(config): exclude empty lists/dicts from translated context
coketaste May 3, 2026
c9bf997
docs: add --config YAML configuration to README and docs
coketaste May 3, 2026
f5b3fb8
fix(fixtures): set dummy_torchrun n_gpus to 4 for accurate multi-GPU …
coketaste May 3, 2026
5e549d6
docs: add examples/configs examples and fix configuration reference
coketaste May 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Thumbs.db
# Distribution / packaging
.Python
build/
!src/madengine/configs/build/
develop-eggs/
dist/
downloads/
Expand Down Expand Up @@ -144,4 +145,4 @@ rocm_trace_lite_output/
slurm_results/
MagicMock/
.madengine_session_start
run_directory/
run_directory/
121 changes: 118 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- [Performance Profiling](#-performance-profiling)
- [Reporting and Database](#-reporting-and-database)
- [Installation](#-installation)
- [YAML Configuration (`--config`)](#-yaml-configuration-config)
- [Tips & Best Practices](#-tips--best-practices)
- [Log error pattern scan](#log-error-pattern-scan)
- [Exit codes and CI](#exit-codes-and-ci)
Expand All @@ -39,6 +40,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
## ✨ Key Features

- **🚀 Modern CLI** - Rich terminal output with Typer and Rich
- **📝 YAML Config** - Composable [Hydra-based YAML configs](#-yaml-configuration-config) with config groups, hardware profiles, and CLI overrides — alternative to `--additional-context` JSON
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus, vLLM, SGLang
- **🐳 Container-Native** - Docker-based execution with GPU support (ROCm, CUDA)
Expand All @@ -64,12 +66,16 @@ madengine discover --tags dummy
# Run locally (full workflow: discover/build/run as configured by the model)
madengine run --tags dummy

# Or with explicit configuration
# Or with explicit JSON configuration
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Or with YAML config (Hydra-based, composable)
madengine run --tags dummy --config scheduler=slurm --config launcher=torchrun
madengine run --config my_job.yaml
```

> **Note**: For build operations, `gpu_vendor` defaults to `AMD` and `guest_os` defaults to `UBUNTU` if not specified. For production deployments or non-AMD/Ubuntu environments, explicitly specify these values.
> **Note**: `--config` is mutually exclusive with `--additional-context` / `--additional-context-file`. For build operations, `gpu_vendor` defaults to `AMD` and `guest_os` defaults to `UBUNTU` if not specified.

If auto-detection does not find your **host** ROCm root, set top-level `MAD_ROCM_PATH` in `--additional-context`. For a different ROCm root **inside the container**, set `docker_env_vars.MAD_ROCM_PATH` in additional context. If you omit it, madengine derives in-container `ROCM_PATH` when running Docker (from the image's baked-in env, then an in-container probe, then `/opt/rocm` — it does **not** copy the host path). You can also set `ROCM_PATH` / `MAD_AUTO_ROCM_PATH=0` for **host** behavior as documented in [docs/configuration.md](docs/configuration.md):

Expand Down Expand Up @@ -127,7 +133,7 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen
| [Usage Guide](docs/usage.md) | Commands, workflows, and examples ([`--skip-model-run`](docs/usage.md#skip-model-run-after-build)) |
| **[CLI Reference](docs/cli-reference.md)** | **Detailed command options and examples** |
| [Deployment](docs/deployment.md) | Kubernetes and SLURM deployment |
| [Configuration](docs/configuration.md) | Advanced options; [run log error pattern scan](docs/configuration.md#run-phase-log-error-pattern-scan) |
| [Configuration](docs/configuration.md) | Advanced options; [YAML config (`--config`)](docs/configuration.md#yaml-configuration-config); [run log error pattern scan](docs/configuration.md#run-phase-log-error-pattern-scan) |
| [Batch Build](docs/batch-build.md) | Selective builds for CI/CD |
| [Launchers](docs/launchers.md) | Distributed training frameworks |
| [Profiling](docs/profiling.md) | Performance analysis tools |
Expand Down Expand Up @@ -565,6 +571,115 @@ cd madengine && pip install -e ".[dev]"

See [Installation Guide](docs/installation.md) for detailed instructions.

## 📝 YAML Configuration (`--config`)

The `--config` flag provides a composable, Hydra-based YAML alternative to `--additional-context` JSON strings. It is available on both `run` and `build` commands.

> **Note**: `--config` is **mutually exclusive** with `--additional-context` and `--additional-context-file`. Using them together produces an error.

### Basic Usage

```bash
# Use a config group override
madengine run --tags dummy --config scheduler=slurm

# Combine multiple overrides
madengine run --tags dummy \
--config scheduler=slurm \
--config launcher=torchrun \
--config distributed.nnodes=4

# Use a user YAML file
madengine run --config my_job.yaml

# User YAML file with overrides
madengine run --config my_job.yaml --config distributed.nnodes=8

# Append optional config groups with '+' prefix
madengine run --tags dummy \
--config +profile=mi300x_8gpu \
--config +env=nccl_debug \
--config +tools=rocprofv3_lightweight
```

### Config Groups

madengine ships with pre-built config groups that compose together:

| Group | Default | Options | Description |
|-------|---------|---------|-------------|
| `platform` | `docker` | docker, bare_metal, singularity, podman | Execution platform |
| `scheduler` | `local` | local, slurm, k8s | Job scheduler |
| `hardware` | `amd` | amd, nvidia, cpu | GPU vendor and runtime settings |
| `launcher` | `none` | none, torchrun, deepspeed, megatron, torchtitan, vllm, sglang, sglang_disagg, primus, native | Distributed launcher |
| `+profile` | *(none)* | mi300x_8gpu, mi300x_single, mi250x_4gpu, h100_8gpu, a100_8gpu | Hardware profiles (append-only) |
| `+env` | *(none)* | nccl_debug, nccl_tuned, infiniband, miopen_defaults | Environment presets (append-only) |
| `+tools` | *(none)* | rocprofv3_lightweight, rocprofv3_comprehensive, power_profiler, vram_profiler, rocm_trace_lite | Profiling tools (append-only) |
| `+data` | *(none)* | local, s3, minio, nas | Data source config (append-only) |
| `+build` | *(none)* | default, ci, multi_arch | Build presets (append-only) |

Groups with `+` prefix are append-only — they are not loaded by default and must be explicitly added.

### User YAML Files

Create a YAML file for your job and pass it via `--config`:

```yaml
# my_job.yaml
model:
tags: [dummy]
timeout: 3600

debug: true

env_vars:
MY_VAR: test_value
NCCL_DEBUG: INFO

distributed:
enabled: true
launcher: torchrun
nnodes: 2
nproc_per_node: 4

slurm:
partition: gpu
time: "02:00:00"
```

```bash
madengine run --config my_job.yaml
```

User YAML values are merged on top of the base config and config group selections, giving them highest priority.

### Examples

```bash
# SLURM multi-node with torchrun
madengine run --tags model \
--config scheduler=slurm \
--config launcher=torchrun \
--config distributed.nnodes=4

# MI300x 8-GPU profile with NCCL debug
madengine run --tags model \
--config +profile=mi300x_8gpu \
--config +env=nccl_debug

# NVIDIA hardware with profiling
madengine run --tags model \
--config hardware=nvidia \
--config +tools=rocprofv3_lightweight

# Build with CI preset
madengine build --tags model \
--config +build=ci \
--registry docker.io/myorg
```

See [Configuration Guide](docs/configuration.md#yaml-configuration-config) for full details, and [`examples/configs/`](examples/configs/) for annotated templates and ready-to-run demo files.

## 💡 Tips & Best Practices

### General Usage
Expand Down
41 changes: 40 additions & 1 deletion docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ madengine build [OPTIONS]
| `--batch-manifest` | | TEXT | `None` | Input batch.json file for batch build mode |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |
| `--config` | | TEXT | `None` | YAML config file and/or Hydra overrides (repeatable). Mutually exclusive with `--additional-context` / `--additional-context-file`. See [Configuration — YAML config](configuration.md#yaml-configuration-config). |
| `--clean-docker-cache` | | FLAG | `False` | Rebuild images without using cache |
| `--manifest-output` | `-m` | TEXT | `build_manifest.json` | Output file for build manifest |
| `--summary-output` | `-s` | TEXT | `None` | Output file for build summary JSON |
Expand Down Expand Up @@ -142,6 +143,12 @@ madengine build --tags model \

# Real-time output with verbose logging
madengine build --tags model --live-output --verbose

# Build with YAML config (mutually exclusive with --additional-context)
madengine build --tags model --config +build=ci --registry docker.io/myorg

# Build with user YAML file
madengine build --config my_build.yaml --registry docker.io/myorg
```

**Default Values:**
Expand Down Expand Up @@ -215,6 +222,7 @@ madengine run [OPTIONS]
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |
| `--config` | | TEXT | `None` | YAML config file and/or Hydra overrides (repeatable). Mutually exclusive with `--additional-context` / `--additional-context-file`. See [Configuration — YAML config](configuration.md#yaml-configuration-config). |
| `--keep-alive` | | FLAG | `False` | Keep Docker containers alive after run |
| `--keep-model-dir` | | FLAG | `False` | Keep model directory after run |
| `--clean-docker-cache` | | FLAG | `False` | Rebuild images without using cache (full workflow) |
Expand Down Expand Up @@ -326,9 +334,23 @@ madengine run --tags model --output my_perf_results.csv
# Clean up intermediate perf files after run
madengine run --tags model --cleanup-perf

# Using configuration file
# Using JSON configuration file
madengine run --tags model \
--additional-context-file k8s-config.json

# Using YAML config (mutually exclusive with --additional-context)
madengine run --tags model \
--config scheduler=slurm \
--config launcher=torchrun \
--config distributed.nnodes=4

# YAML config with hardware profile
madengine run --tags model \
--config +profile=mi300x_8gpu \
--config +env=nccl_debug

# User YAML file with overrides
madengine run --config my_job.yaml --config distributed.nnodes=8
```

**Execution Modes:**
Expand Down Expand Up @@ -601,6 +623,23 @@ For complex configurations, use JSON files with `--additional-context-file`:

To run on specific nodes, add `"nodelist": "node01,node02"` to the `slurm` section. When set, the job runs only on those nodes and node health preflight is skipped. See [examples/slurm-configs/basic/03-multi-node-basic-nodelist.json](../examples/slurm-configs/basic/03-multi-node-basic-nodelist.json).

### YAML Configuration (`--config`)

As an alternative to JSON, use `--config` with composable Hydra-based YAML:

```bash
# Config group overrides
madengine run --tags model --config scheduler=slurm --config launcher=torchrun

# User YAML file
madengine run --config my_job.yaml

# Append-only groups (profiles, tools, env presets)
madengine run --tags model --config +profile=mi300x_8gpu --config +env=nccl_debug
```

`--config` is **mutually exclusive** with `--additional-context` / `--additional-context-file`. See [Configuration Guide — YAML Configuration](configuration.md#yaml-configuration-config) for config groups, user YAML format, and full examples.

### Run phase: log error pattern scan (optional)

These keys apply to **local Docker runs** when madengine post-processes the run log. Use them when substring matches cause false `FAILURE` status (for example benign `RuntimeError:` lines). Full details: [Configuration — Run phase: log error pattern scan](configuration.md#run-phase-log-error-pattern-scan).
Expand Down
Loading