|
| 1 | +--- |
| 2 | +id: verify-hami |
| 3 | +title: Verify HAMi (Quick Start) |
| 4 | +sidebar_label: Verify HAMi |
| 5 | +--- |
| 6 | + |
| 7 | +# Verify HAMi (Quick Start) |
| 8 | + |
| 9 | +This guide provides a rapid, end-to-end setup to verify that GPU workloads run correctly in a Kubernetes cluster with HAMi. |
| 10 | + |
| 11 | +What "working" actually means: A successful HAMi setup goes beyond just running pods or a successful Helm installation. It means the GPU is accessible inside a container, Kubernetes correctly advertises the resources, and vGPU isolation (like memory limits) behaves predictably. |
| 12 | + |
| 13 | +## Step 0: Configure Node Container Runtime (If not already done) |
| 14 | +HAMi requires the `nvidia-container-toolkit` to be installed and set as the default low-level runtime on all your GPU nodes. |
| 15 | + |
| 16 | +### 1. Install nvidia-container-toolkit (Debian/Ubuntu example) |
| 17 | +``` |
| 18 | +distribution=$(. /etc/os-release;echo $ID$VERSION_ID) |
| 19 | +curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \ |
| 20 | + | sudo tee /etc/apt/sources.list.d/libnvidia-container.list |
| 21 | +curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - |
| 22 | +sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit |
| 23 | +``` |
| 24 | + |
| 25 | +### 2. Configure your runtime |
| 26 | +* For containerd: Edit `/etc/containerd/config.toml` to set the default runtime name to `"nvidia"` and the binary name to `"/usr/bin/nvidia-container-runtime"`. |
| 27 | + * Restart: `sudo systemctl daemon-reload && systemctl restart containerd` |
| 28 | +* For Docker: Edit `/etc/docker/daemon.json` to set `"default-runtime": "nvidia"`. |
| 29 | + * Restart: `sudo systemctl daemon-reload && systemctl restart docker` |
| 30 | + |
| 31 | +## Step 1: Validate the Native GPU Stack (Crucial Pre-flight Check) |
| 32 | +Before installing HAMi, you must prove that Kubernetes can natively access the GPU. |
| 33 | + |
| 34 | +This step validates your GPU stack independently of HAMi. |
| 35 | + |
| 36 | +### 1. Deploy a native test pod |
| 37 | +``` |
| 38 | +cat <<EOF | kubectl apply -f - |
| 39 | +apiVersion: v1 |
| 40 | +kind: Pod |
| 41 | +metadata: |
| 42 | + name: cuda-test |
| 43 | +spec: |
| 44 | + restartPolicy: Never |
| 45 | + containers: |
| 46 | + - name: cuda |
| 47 | + image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04 |
| 48 | + command: ["nvidia-smi"] |
| 49 | + resources: |
| 50 | + limits: |
| 51 | + nvidia.com/gpu: 1 |
| 52 | +EOF |
| 53 | +``` |
| 54 | +Expected: You see valid `nvidia-smi` output. If this fails, do NOT continue. Fix your GPU setup first. |
| 55 | + |
| 56 | +### 2. Verify execution |
| 57 | +``` |
| 58 | +kubectl wait --for=condition=Succeeded pod/cuda-test --timeout=60s |
| 59 | +kubectl logs cuda-test |
| 60 | +``` |
| 61 | +Note: You must see the standard `nvidia-smi` output. Do not proceed if this fails. |
| 62 | + |
| 63 | +## Step 2: Install HAMi |
| 64 | +Once the baseline is verified, label your node so the HAMi scheduler can manage it, and deploy via Helm. |
| 65 | + |
| 66 | +### 1. Label the node |
| 67 | +``` |
| 68 | +kubectl label nodes $(hostname) gpu=on --overwrite |
| 69 | +``` |
| 70 | + |
| 71 | +### 2. Deploy using Helm |
| 72 | +``` |
| 73 | +helm repo add hami-charts https://project-hami.github.io/HAMi/ |
| 74 | +helm install hami hami-charts/hami -n kube-system |
| 75 | +``` |
| 76 | + |
| 77 | +### 3. Verify components |
| 78 | +``` |
| 79 | +kubectl get pods -n kube-system | grep hami |
| 80 | +``` |
| 81 | +Expected: Both `hami-scheduler` and `vgpu-device-plugin` pods should be in the `Running` state. |
| 82 | + |
| 83 | +## Step 3: Launch and Verify a vGPU Task |
| 84 | +Let's prove HAMi is enforcing fractional resource limits (vGPU). |
| 85 | + |
| 86 | +### 1. Submit a vGPU demo task |
| 87 | +``` |
| 88 | +cat <<EOF | kubectl apply -f - |
| 89 | +apiVersion: v1 |
| 90 | +kind: Pod |
| 91 | +metadata: |
| 92 | + name: gpu-pod |
| 93 | +spec: |
| 94 | + containers: |
| 95 | + - name: ubuntu-container |
| 96 | + image: ubuntu:18.04 |
| 97 | + command: ["bash", "-c", "sleep 86400"] |
| 98 | + resources: |
| 99 | + limits: |
| 100 | + nvidia.com/gpu: 1 |
| 101 | + nvidia.com/gpumem: 10240 |
| 102 | +EOF |
| 103 | +``` |
| 104 | + |
| 105 | +### 2. Verify resource control inside the container |
| 106 | +``` |
| 107 | +kubectl wait --for=condition=Ready pod/gpu-pod --timeout=60s |
| 108 | +kubectl exec -it gpu-pod -- nvidia-smi |
| 109 | +``` |
| 110 | +Expected: You will see the `[HAMI-core Msg...]` initialization lines, and the `nvidia-smi` table will show exactly `10240MiB` of Total Memory, proving vGPU isolation is active. |
| 111 | + |
| 112 | +## Troubleshooting Order |
| 113 | +If you encounter issues, follow this sequence: |
| 114 | +1. Hardware/Drivers: Run `nvidia-smi` directly on the host. |
| 115 | +2. Container Runtime: Ensure `sudo ctr run` or `docker run` works outside K8s. |
| 116 | +3. Stale Plugins: Remove conflicting plugins: `kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system --ignore-not-found`. |
| 117 | +4. Node Resources: Verify K8s sees the GPU: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | grep -i nvidia`. |
| 118 | +5. Scheduler Layer: Check HAMi logs: `kubectl logs -n kube-system -l app=hami-scheduler`. |
| 119 | + |
| 120 | +## Cleanup |
| 121 | +``` |
| 122 | +kubectl delete pod cuda-test gpu-pod --ignore-not-found |
| 123 | +``` |
0 commit comments