Deploy madengine workloads to Kubernetes or SLURM clusters for distributed execution.
madengine supports two deployment backends:
- Kubernetes - Cloud-native container orchestration
- SLURM - HPC cluster job scheduling
Deployment is configured via --additional-context and happens automatically during the run phase.
┌─────────────────────────────────────────────┐
│ 1. Build Phase (Local or CI/CD) │
│ madengine build --tags model │
│ → Creates Docker image │
│ → Pushes to registry │
│ → Generates build_manifest.json │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 2. Deploy Phase (Run with Context) │
│ madengine run │
│ --manifest-file build_manifest.json │
│ --additional-context '{"deploy":...}' │
│ → Detects deployment target │
│ → Creates K8s Job or SLURM script │
│ → Submits and monitors execution │
└─────────────────────────────────────────────┘
- Kubernetes cluster with GPU support
- GPU device plugin installed (AMD or NVIDIA)
- Kubeconfig configured (
~/.kube/configor in-cluster) - Docker registry accessible from cluster
{
"k8s": {
"gpu_count": 1
}
}This automatically applies intelligent defaults for namespace, resources, image pull policy, etc.
# 1. Build image
madengine build --tags my_model \
--registry my-registry.io \
--additional-context-file k8s-config.json
# 2. Deploy to Kubernetes
madengine run \
--manifest-file build_manifest.json \
--timeout 3600The deployment target is automatically detected from the k8s key in the config.
k8s-config.json:
{
"k8s": {
"gpu_count": 2,
"namespace": "ml-team",
"gpu_vendor": "AMD",
"memory": "32Gi",
"cpu": "16",
"service_account": "madengine-sa",
"image_pull_policy": "Always"
}
}Configuration Priority:
- User config (
--additional-context-file) - Profile presets (single-gpu/multi-gpu)
- GPU vendor presets (AMD/NVIDIA)
- Base defaults
See examples/k8s-configs/ for complete examples.
By default (k8s.secrets.strategy: from_local_credentials), madengine run creates Kubernetes Secrets from a local credential.json when present: Docker Hub pull credentials (when configured) and an opaque Secret for runtime use. Credentials are not embedded in the ConfigMap in that case. For GitOps or clusters without client-side files, use existing or omit and set k8s.secrets.image_pull_secret_names / k8s.secrets.runtime_secret_name as needed. See Configuration and examples/k8s-configs/README.md.
With "debug": true in additional context, madengine run writes rendered manifests under ./k8s_manifests (or the path you configure). To lint those YAML files against the Kubernetes OpenAPI schema, install kubeconform and run from the repository root:
./tests/scripts/k8s_validate_manifests.sh ./k8s_manifestsThe script exits successfully if kubeconform is missing (skip) or if validation passes.
For distributed training across multiple nodes:
{
"k8s": {
"gpu_count": 8
},
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 4
}
}This creates:
- Kubernetes Indexed Job with 2 completions
- Headless service for pod discovery
- Automatic rank assignment via
JOB_COMPLETION_INDEX MAD_MULTI_NODE_RUNNERenvironment variable with torchrun command
Supported Launchers:
torchrun- PyTorch DDP/FSDPdeepspeed- ZeRO optimizationmegatron- Megatron-LM trainingtorchtitan- LLM pre-trainingprimus- Primus unified pretrain (Megatron / TorchTitan / MaxText YAML)vllm- LLM inferencesglang- Structured generationsglang-disagg- Disaggregated SGLang (multi-node)
See Launchers Guide for details.
# Check job status
kubectl get jobs -n your-namespace
# View pod logs
kubectl logs -f job/madengine-job-xxx -n your-namespace
# Check pod status
kubectl get pods -n your-namespaceFinished Jobs are not removed unless you set k8s.ttl_seconds_after_finished to a positive number of seconds; the Job manifest then includes ttlSecondsAfterFinished so the control plane can garbage-collect the Job after it finishes. The deploy step may still delete Secrets it created when cleaning up a failed or cancelled deploy—see runtime logs for details.
Manual cleanup:
kubectl delete job madengine-job-xxx -n your-namespace- Access to SLURM login node
- SLURM commands available (
sbatch,squeue,scontrol) - Shared filesystem for MAD package and results
- Module system or container runtime (Singularity/Apptainer)
slurm-config.json:
{
"slurm": {
"partition": "gpu",
"gpus_per_node": 4,
"time": "02:00:00",
"account": "my_account"
}
}# 1. Build image (on build node or locally)
madengine build --tags my_model \
--registry my-registry.io \
--additional-context-file slurm-config.json
# 2. SSH to SLURM login node
ssh user@hpc-login.example.com
# 3. Deploy to SLURM
cd /shared/workspace
madengine run \
--manifest-file build_manifest.json \
--timeout 7200The deployment target is automatically detected from the slurm key in the config.
slurm-config.json:
{
"slurm": {
"partition": "gpu",
"account": "research_group",
"qos": "normal",
"gpus_per_node": 8,
"nodes": 1,
"time": "24:00:00",
"mail_user": "user@example.com",
"mail_type": "ALL"
}
}Common SLURM Options:
partition: SLURM partition nameaccount: Billing accountqos: Quality of Servicegpus_per_node: Number of GPUs per nodenodes: Number of nodes (for multi-node)nodelist: Comma-separated node names to run on (e.g."node01,node02"); when set, job runs only on these nodes and node health preflight is skippedtime: Wall time limit (HH:MM:SS)mem: Memory per node (e.g., "64G")mail_user: Email for job notificationsmail_type: Notification types (BEGIN, END, FAIL, ALL)
See examples/slurm-configs/ for complete examples.
For distributed training across SLURM nodes:
{
"slurm": {
"partition": "gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "48:00:00"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 4,
"nproc_per_node": 8
}
}SLURM automatically provides:
- Node list via
$SLURM_JOB_NODELIST - Master address detection
- Network interface configuration
- Rank assignment via
$SLURM_PROCID
# Check job queue
squeue -u $USER
# Monitor job progress
squeue -j <job_id>
# View job details
scontrol show job <job_id>
# Check output logs
tail -f slurm-<job_id>.out# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER| Feature | Kubernetes | SLURM |
|---|---|---|
| Environment | Cloud, on-premise | HPC clusters |
| Orchestration | Automatic | Job scheduler |
| Dependencies | Python library (kubernetes) |
CLI commands only |
| Multi-node Setup | Headless service + DNS | SLURM env vars |
| Resource Management | Declarative (YAML) | Batch script |
| Best For | Cloud deployments, microservices | Academic HPC, supercomputers |
{
"k8s": {
"gpu_count": 1,
"namespace": "dev"
}
}{
"k8s": {
"gpu_count": 4,
"memory": "64Gi",
"cpu": "32"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 1,
"nproc_per_node": 4
}
}{
"k8s": {
"gpu_count": 8,
"namespace": "ml-training"
},
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
}
}{
"slurm": {
"partition": "gpu",
"gpus_per_node": 8,
"time": "12:00:00"
}
}{
"slurm": {
"partition": "gpu",
"nodes": 8,
"gpus_per_node": 8,
"time": "72:00:00",
"account": "research_proj"
},
"distributed": {
"launcher": "deepspeed",
"nnodes": 8,
"nproc_per_node": 8
}
}Image Pull Failures:
# Check image exists
docker pull <registry>/<image>:<tag>
# Verify image pull secrets
kubectl get secrets -n your-namespace
# Check pod events
kubectl describe pod <pod-name> -n your-namespaceNode Reported as FAILED but Pod Succeeded:
In multi-node jobs, madengine may report a node as FAILED even though Kubernetes shows the pod as Succeeded. This occurs when the kubelet on the node becomes unreachable after the job completes, preventing madengine from collecting stdout logs (and therefore parsing performance metrics).
To verify:
# Check actual pod status — if Succeeded, the workload ran fine
kubectl describe pod <pod-name> | grep Status
# Check the node's kubelet health
kubectl get nodes
kubectl describe node <node-name> | grep -A5 ConditionsPVC artifacts are still collected in this scenario. Only the API-based pod log retrieval fails, which means performance metrics for that node will be missing from the results table.
Resource Issues:
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# Check GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.'amd\.com/gpu'Job Pending:
# Check reason
squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
# Check partition status
sinfo -p gpuOut of Resources:
# Check available resources
sinfo -o "%P %.5a %.10l %.6D %.6t %N"
# Adjust resource requests in config- Use minimal configs with intelligent defaults
- Specify resource limits to prevent over-allocation
- Use appropriate namespaces for isolation
- Configure image pull policies based on registry location
- Monitor pod resource usage with
kubectl top
- Start with conservative time limits
- Use appropriate QoS for priority
- Monitor job efficiency with
seff <job_id> - Use shared filesystem for input/output
- Test with single node before scaling
- Launchers Guide - Distributed training and inference launchers
- K8s Examples - Complete Kubernetes configurations
- SLURM Examples - Complete SLURM configurations
- Usage Guide - General usage instructions