Deployment Guide

Deploy madengine workloads to Kubernetes or SLURM clusters for distributed execution.

Overview

madengine supports two deployment backends:

Kubernetes - Cloud-native container orchestration
SLURM - HPC cluster job scheduling

Deployment is configured via --additional-context and happens automatically during the run phase.

Deployment Workflow

┌─────────────────────────────────────────────┐
│  1. Build Phase (Local or CI/CD)           │
│     madengine build --tags model       │
│     → Creates Docker image                  │
│     → Pushes to registry                    │
│     → Generates build_manifest.json         │
└─────────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────┐
│  2. Deploy Phase (Run with Context)         │
│     madengine run                       │
│       --manifest-file build_manifest.json   │
│       --additional-context '{"deploy":...}' │
│     → Detects deployment target             │
│     → Creates K8s Job or SLURM script       │
│     → Submits and monitors execution        │
└─────────────────────────────────────────────┘

Kubernetes Deployment

Prerequisites

Kubernetes cluster with GPU support
GPU device plugin installed (AMD or NVIDIA)
Kubeconfig configured (~/.kube/config or in-cluster)
Docker registry accessible from cluster

Quick Start

Minimal Configuration (Recommended)

{
  "k8s": {
    "gpu_count": 1
  }
}

This automatically applies intelligent defaults for namespace, resources, image pull policy, etc.

Build and Deploy

# 1. Build image
madengine build --tags my_model \
  --registry my-registry.io \
  --additional-context-file k8s-config.json

# 2. Deploy to Kubernetes
madengine run \
  --manifest-file build_manifest.json \
  --timeout 3600

The deployment target is automatically detected from the k8s key in the config.

Configuration Options

k8s-config.json:

{
  "k8s": {
    "gpu_count": 2,
    "namespace": "ml-team",
    "gpu_vendor": "AMD",
    "memory": "32Gi",
    "cpu": "16",
    "service_account": "madengine-sa",
    "image_pull_policy": "Always"
  }
}

Configuration Priority:

User config (--additional-context-file)
Profile presets (single-gpu/multi-gpu)
GPU vendor presets (AMD/NVIDIA)
Base defaults

See examples/k8s-configs/ for complete examples.

Secrets and credentials

By default (k8s.secrets.strategy: from_local_credentials), madengine run creates Kubernetes Secrets from a local credential.json when present: Docker Hub pull credentials (when configured) and an opaque Secret for runtime use. Credentials are not embedded in the ConfigMap in that case. For GitOps or clusters without client-side files, use existing or omit and set k8s.secrets.image_pull_secret_names / k8s.secrets.runtime_secret_name as needed. See Configuration and examples/k8s-configs/README.md.

Validating rendered manifests

With "debug": true in additional context, madengine run writes rendered manifests under ./k8s_manifests (or the path you configure). To lint those YAML files against the Kubernetes OpenAPI schema, install kubeconform and run from the repository root:

./tests/scripts/k8s_validate_manifests.sh ./k8s_manifests

The script exits successfully if kubeconform is missing (skip) or if validation passes.

Multi-Node Training

For distributed training across multiple nodes:

{
  "k8s": {
    "gpu_count": 8
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 4
  }
}

This creates:

Kubernetes Indexed Job with 2 completions
Headless service for pod discovery
Automatic rank assignment via JOB_COMPLETION_INDEX
MAD_MULTI_NODE_RUNNER environment variable with torchrun command

Supported Launchers:

torchrun - PyTorch DDP/FSDP
deepspeed - ZeRO optimization
megatron - Megatron-LM training
torchtitan - LLM pre-training
primus - Primus unified pretrain (Megatron / TorchTitan / MaxText YAML)
vllm - LLM inference
sglang - Structured generation
sglang-disagg - Disaggregated SGLang (multi-node)

See Launchers Guide for details.

Monitoring

# Check job status
kubectl get jobs -n your-namespace

# View pod logs
kubectl logs -f job/madengine-job-xxx -n your-namespace

# Check pod status
kubectl get pods -n your-namespace

Cleanup

Finished Jobs are not removed unless you set k8s.ttl_seconds_after_finished to a positive number of seconds; the Job manifest then includes ttlSecondsAfterFinished so the control plane can garbage-collect the Job after it finishes. The deploy step may still delete Secrets it created when cleaning up a failed or cancelled deploy—see runtime logs for details.

Manual cleanup:

kubectl delete job madengine-job-xxx -n your-namespace

SLURM Deployment

Prerequisites

Access to SLURM login node
SLURM commands available (sbatch, squeue, scontrol)
Shared filesystem for MAD package and results
Module system or container runtime (Singularity/Apptainer)

Quick Start

Configuration

slurm-config.json:

{
  "slurm": {
    "partition": "gpu",
    "gpus_per_node": 4,
    "time": "02:00:00",
    "account": "my_account"
  }
}

Build and Deploy

# 1. Build image (on build node or locally)
madengine build --tags my_model \
  --registry my-registry.io \
  --additional-context-file slurm-config.json

# 2. SSH to SLURM login node
ssh user@hpc-login.example.com

# 3. Deploy to SLURM
cd /shared/workspace
madengine run \
  --manifest-file build_manifest.json \
  --timeout 7200

The deployment target is automatically detected from the slurm key in the config.

Configuration Options

slurm-config.json:

{
  "slurm": {
    "partition": "gpu",
    "account": "research_group",
    "qos": "normal",
    "gpus_per_node": 8,
    "nodes": 1,
    "time": "24:00:00",
    "mail_user": "user@example.com",
    "mail_type": "ALL"
  }
}

Common SLURM Options:

partition: SLURM partition name
account: Billing account
qos: Quality of Service
gpus_per_node: Number of GPUs per node
nodes: Number of nodes (for multi-node)
nodelist: Comma-separated node names to run on (e.g. "node01,node02"); when set, job runs only on these nodes and node health preflight is skipped
time: Wall time limit (HH:MM:SS)
mem: Memory per node (e.g., "64G")
mail_user: Email for job notifications
mail_type: Notification types (BEGIN, END, FAIL, ALL)

See examples/slurm-configs/ for complete examples.

Multi-Node Training

For distributed training across SLURM nodes:

{
  "slurm": {
    "partition": "gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "48:00:00"
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

SLURM automatically provides:

Node list via $SLURM_JOB_NODELIST
Master address detection
Network interface configuration
Rank assignment via $SLURM_PROCID

Monitoring

# Check job queue
squeue -u $USER

# Monitor job progress
squeue -j <job_id>

# View job details
scontrol show job <job_id>

# Check output logs
tail -f slurm-<job_id>.out

Cancellation

# Cancel job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

Deployment Comparison

Feature	Kubernetes	SLURM
Environment	Cloud, on-premise	HPC clusters
Orchestration	Automatic	Job scheduler
Dependencies	Python library (`kubernetes`)	CLI commands only
Multi-node Setup	Headless service + DNS	SLURM env vars
Resource Management	Declarative (YAML)	Batch script
Best For	Cloud deployments, microservices	Academic HPC, supercomputers

Configuration Examples

Single-GPU Development (K8s)

{
  "k8s": {
    "gpu_count": 1,
    "namespace": "dev"
  }
}

Multi-GPU Training (K8s)

{
  "k8s": {
    "gpu_count": 4,
    "memory": "64Gi",
    "cpu": "32"
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 1,
    "nproc_per_node": 4
  }
}

Multi-Node Training (K8s)

{
  "k8s": {
    "gpu_count": 8,
    "namespace": "ml-training"
  },
  "distributed": {
    "launcher": "torchtitan",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

Single-Node SLURM

{
  "slurm": {
    "partition": "gpu",
    "gpus_per_node": 8,
    "time": "12:00:00"
  }
}

Multi-Node SLURM

{
  "slurm": {
    "partition": "gpu",
    "nodes": 8,
    "gpus_per_node": 8,
    "time": "72:00:00",
    "account": "research_proj"
  },
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 8,
    "nproc_per_node": 8
  }
}

Troubleshooting

Kubernetes Issues

Image Pull Failures:

# Check image exists
docker pull <registry>/<image>:<tag>

# Verify image pull secrets
kubectl get secrets -n your-namespace

# Check pod events
kubectl describe pod <pod-name> -n your-namespace

Node Reported as FAILED but Pod Succeeded:

In multi-node jobs, madengine may report a node as FAILED even though Kubernetes shows the pod as Succeeded. This occurs when the kubelet on the node becomes unreachable after the job completes, preventing madengine from collecting stdout logs (and therefore parsing performance metrics).

To verify:

# Check actual pod status — if Succeeded, the workload ran fine
kubectl describe pod <pod-name> | grep Status

# Check the node's kubelet health
kubectl get nodes
kubectl describe node <node-name> | grep -A5 Conditions

PVC artifacts are still collected in this scenario. Only the API-based pod log retrieval fails, which means performance metrics for that node will be missing from the results table.

Resource Issues:

# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"

# Check GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.'amd\.com/gpu'

SLURM Issues

Job Pending:

# Check reason
squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

# Check partition status
sinfo -p gpu

Out of Resources:

# Check available resources
sinfo -o "%P %.5a %.10l %.6D %.6t %N"

# Adjust resource requests in config

Best Practices

For Kubernetes

Use minimal configs with intelligent defaults
Specify resource limits to prevent over-allocation
Use appropriate namespaces for isolation
Configure image pull policies based on registry location
Monitor pod resource usage with kubectl top

For SLURM

Start with conservative time limits
Use appropriate QoS for priority
Monitor job efficiency with seff <job_id>
Use shared filesystem for input/output
Test with single node before scaling

Next Steps

Launchers Guide - Distributed training and inference launchers
K8s Examples - Complete Kubernetes configurations
SLURM Examples - Complete SLURM configurations
Usage Guide - General usage instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Guide

Overview

Deployment Workflow

Kubernetes Deployment

Prerequisites

Quick Start

Minimal Configuration (Recommended)

Build and Deploy

Configuration Options

Secrets and credentials

Validating rendered manifests

Multi-Node Training

Monitoring

Cleanup

SLURM Deployment

Prerequisites

Quick Start

Configuration

Build and Deploy

Configuration Options

Multi-Node Training

Monitoring

Cancellation

Deployment Comparison

Configuration Examples

Single-GPU Development (K8s)

Multi-GPU Training (K8s)

Multi-Node Training (K8s)

Single-Node SLURM

Multi-Node SLURM

Troubleshooting

Kubernetes Issues

SLURM Issues

Best Practices

For Kubernetes

For SLURM

Next Steps

FilesExpand file tree

deployment.md

Latest commit

History

deployment.md

File metadata and controls

Deployment Guide

Overview

Deployment Workflow

Kubernetes Deployment

Prerequisites

Quick Start

Minimal Configuration (Recommended)

Build and Deploy

Configuration Options

Secrets and credentials

Validating rendered manifests

Multi-Node Training

Monitoring

Cleanup

SLURM Deployment

Prerequisites

Quick Start

Configuration

Build and Deploy

Configuration Options

Multi-Node Training

Monitoring

Cancellation

Deployment Comparison

Configuration Examples

Single-GPU Development (K8s)

Multi-GPU Training (K8s)

Multi-Node Training (K8s)

Single-Node SLURM

Multi-Node SLURM

Troubleshooting

Kubernetes Issues

SLURM Issues

Best Practices

For Kubernetes

For SLURM

Next Steps