Skip to content

Commit 76552b6

Browse files
committed
docs: add quick start guide to verify HAMi in Kubernetes
Signed-off-by: Mesut Oezdil <mesut.oezdil@adfinis.com>
1 parent c5310b1 commit 76552b6

1 file changed

Lines changed: 123 additions & 0 deletions

File tree

docs/get-started/verify-hami.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
id: verify-hami
3+
title: Verify HAMi (Quick Start)
4+
sidebar_label: Verify HAMi
5+
---
6+
7+
# Verify HAMi (Quick Start)
8+
9+
This guide provides a rapid, end-to-end setup to verify that GPU workloads run correctly in a Kubernetes cluster with HAMi.
10+
11+
What "working" actually means: A successful HAMi setup goes beyond just running pods or a successful Helm installation. It means the GPU is accessible inside a container, Kubernetes correctly advertises the resources, and vGPU isolation (like memory limits) behaves predictably.
12+
13+
## Step 0: Configure Node Container Runtime (If not already done)
14+
HAMi requires the `nvidia-container-toolkit` to be installed and set as the default low-level runtime on all your GPU nodes.
15+
16+
### 1. Install nvidia-container-toolkit (Debian/Ubuntu example)
17+
```
18+
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
19+
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
20+
| sudo tee /etc/apt/sources.list.d/libnvidia-container.list
21+
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
22+
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
23+
```
24+
25+
### 2. Configure your runtime
26+
* For containerd: Edit `/etc/containerd/config.toml` to set the default runtime name to `"nvidia"` and the binary name to `"/usr/bin/nvidia-container-runtime"`.
27+
* Restart: `sudo systemctl daemon-reload && systemctl restart containerd`
28+
* For Docker: Edit `/etc/docker/daemon.json` to set `"default-runtime": "nvidia"`.
29+
* Restart: `sudo systemctl daemon-reload && systemctl restart docker`
30+
31+
## Step 1: Validate the Native GPU Stack (Crucial Pre-flight Check)
32+
Before installing HAMi, you must prove that Kubernetes can natively access the GPU.
33+
34+
This step validates your GPU stack independently of HAMi.
35+
36+
### 1. Deploy a native test pod
37+
```
38+
cat <<EOF | kubectl apply -f -
39+
apiVersion: v1
40+
kind: Pod
41+
metadata:
42+
name: cuda-test
43+
spec:
44+
restartPolicy: Never
45+
containers:
46+
- name: cuda
47+
image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
48+
command: ["nvidia-smi"]
49+
resources:
50+
limits:
51+
nvidia.com/gpu: 1
52+
EOF
53+
```
54+
Expected: You see valid `nvidia-smi` output. If this fails, do NOT continue. Fix your GPU setup first.
55+
56+
### 2. Verify execution
57+
```
58+
kubectl wait --for=condition=Succeeded pod/cuda-test --timeout=60s
59+
kubectl logs cuda-test
60+
```
61+
Note: You must see the standard `nvidia-smi` output. Do not proceed if this fails.
62+
63+
## Step 2: Install HAMi
64+
Once the baseline is verified, label your node so the HAMi scheduler can manage it, and deploy via Helm.
65+
66+
### 1. Label the node
67+
```
68+
kubectl label nodes $(hostname) gpu=on --overwrite
69+
```
70+
71+
### 2. Deploy using Helm
72+
```
73+
helm repo add hami-charts https://project-hami.github.io/HAMi/
74+
helm install hami hami-charts/hami -n kube-system
75+
```
76+
77+
### 3. Verify components
78+
```
79+
kubectl get pods -n kube-system | grep hami
80+
```
81+
Expected: Both `hami-scheduler` and `vgpu-device-plugin` pods should be in the `Running` state.
82+
83+
## Step 3: Launch and Verify a vGPU Task
84+
Let's prove HAMi is enforcing fractional resource limits (vGPU).
85+
86+
### 1. Submit a vGPU demo task
87+
```
88+
cat <<EOF | kubectl apply -f -
89+
apiVersion: v1
90+
kind: Pod
91+
metadata:
92+
name: gpu-pod
93+
spec:
94+
containers:
95+
- name: ubuntu-container
96+
image: ubuntu:18.04
97+
command: ["bash", "-c", "sleep 86400"]
98+
resources:
99+
limits:
100+
nvidia.com/gpu: 1
101+
nvidia.com/gpumem: 10240
102+
EOF
103+
```
104+
105+
### 2. Verify resource control inside the container
106+
```
107+
kubectl wait --for=condition=Ready pod/gpu-pod --timeout=60s
108+
kubectl exec -it gpu-pod -- nvidia-smi
109+
```
110+
Expected: You will see the `[HAMI-core Msg...]` initialization lines, and the `nvidia-smi` table will show exactly `10240MiB` of Total Memory, proving vGPU isolation is active.
111+
112+
## Troubleshooting Order
113+
If you encounter issues, follow this sequence:
114+
1. Hardware/Drivers: Run `nvidia-smi` directly on the host.
115+
2. Container Runtime: Ensure `sudo ctr run` or `docker run` works outside K8s.
116+
3. Stale Plugins: Remove conflicting plugins: `kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system --ignore-not-found`.
117+
4. Node Resources: Verify K8s sees the GPU: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | grep -i nvidia`.
118+
5. Scheduler Layer: Check HAMi logs: `kubectl logs -n kube-system -l app=hami-scheduler`.
119+
120+
## Cleanup
121+
```
122+
kubectl delete pod cuda-test gpu-pod --ignore-not-found
123+
```

0 commit comments

Comments
 (0)