[Workload]: gpu-compute

### Workload Name

gpu-compute

### Workload Description

GPU stress workload that runs sustained compute operations on a passthrough or vGPU device inside a KubeVirt VM. Produces continuous GPU utilization, memory bandwidth pressure, and thermal load to validate the GPU operator, device plugin, and driver stack on OpenShift Virtualization.

NVIDIA is a major Red Hat partner and GPU passthrough is a growing use case for OpenShift Virtualization — AI/ML inference, VDI (Virtual Desktop Infrastructure), HPC, and video transcoding all run in VMs with GPU access. The GPU device plugin allocates GPUs to VM pods, the NVIDIA GPU Operator manages drivers, and vGPU (via NVIDIA GRID) enables GPU sharing across VMs. This workload validates the entire stack under sustained load, not just device attachment.

No existing virtwork workload exercises GPU hardware. CPU, memory, disk, and network are all host-side resources managed by KubeVirt's standard resource model. GPU passthrough requires PCI device assignment, IOMMU configuration, and driver installation inside the VM — a completely separate resource path that needs its own validation signal.

### Tooling and Packages

- Tool: `gpu-burn` (CUDA stress test) for NVIDIA GPUs, or `hashcat` (GPU-accelerated hash cracking) as a portable alternative
- RPM packages:
  - NVIDIA driver + CUDA toolkit (installed via cloud-init or pre-baked in a golden image)
  - `gpu-burn`: built from source (small C/CUDA program) or pre-compiled binary
  - Alternative: `hashcat` (available in EPEL, supports CUDA and OpenCL)
- systemd service command:
  - gpu-burn: `gpu_burn -d 0` (run indefinitely on all GPUs, 0 = no timeout)
  - hashcat: `hashcat -b --force` (benchmark mode, exercises all detected GPUs)
- Configurable parameters:
  - `gpu-tool`: which stress tool to use (default: `gpu-burn`)
  - `gpu-memory-percent`: percentage of GPU memory to use (default: 90)
  - `gpu-duration`: burn duration per cycle in seconds (default: 0 = indefinite)
  - `gpu-precision`: float precision — `single`, `double`, `half` (default: `single`)

### VM Count Model

Single VM (like cpu, memory, disk)

### Required Resources

- [ ] Persistent storage (DataVolume)
- [ ] Kubernetes Service (for inter-VM communication)
- [ ] Kubernetes Secret (for credentials or config)
- [x] Additional CPU/memory beyond defaults
- [x] GPU or special device passthrough

The VM spec must request GPU resources via `spec.domain.devices.gpus` (for passthrough) or resource limits (`nvidia.com/gpu: 1`). This requires:
- NVIDIA GPU Operator installed on the cluster
- GPU device plugin configured for passthrough or vGPU
- IOMMU enabled on the host nodes
- Sufficient GPU capacity on the cluster

### Cloud-Init Details

```yaml
packages:
  - pciutils
  - gcc
  - make
  - git
write_files:
  - path: /usr/local/bin/virtwork-gpu-setup.sh
    permissions: '0755'
    content: |
      #!/bin/bash
      set -euo pipefail

      # Detect GPU
      if ! lspci | grep -qi nvidia; then
        echo "ERROR: No NVIDIA GPU detected" >&2
        exit 1
      fi

      # Install NVIDIA driver and CUDA toolkit
      # This assumes a RHEL/Fedora base with access to NVIDIA repos
      dnf config-manager --add-repo \
        https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
      dnf install -y cuda-toolkit nvidia-driver

      # Build gpu-burn
      git clone https://github.com/wilicc/gpu-burn.git /opt/gpu-burn
      cd /opt/gpu-burn
      make

      # Verify GPU is functional
      nvidia-smi
  - path: /etc/systemd/system/virtwork-gpu-compute.service
    content: |
      [Unit]
      Description=Virtwork GPU compute stress workload
      After=multi-user.target nvidia-persistenced.service
      Wants=nvidia-persistenced.service
      [Service]
      Type=simple
      WorkingDirectory=/opt/gpu-burn
      ExecStart=/opt/gpu-burn/gpu_burn 0
      Restart=always
      RestartSec=10
      [Install]
      WantedBy=multi-user.target
runcmd:
  - /usr/local/bin/virtwork-gpu-setup.sh
  - systemctl enable --now virtwork-gpu-compute.service
```

### Use Case

- **NVIDIA partnership validation:** NVIDIA is a strategic Red Hat partner. Validating GPU passthrough and vGPU under sustained compute load on OpenShift Virtualization is critical for joint solutions. gpu-burn produces GPU utilization, memory bandwidth, and thermal data that NVIDIA and Red Hat engineering use to verify the stack.
- **GPU Operator validation:** The NVIDIA GPU Operator manages driver lifecycle, device plugin, and monitoring on OpenShift. Sustained GPU load from VMs validates that the operator correctly manages driver state, handles VM restarts (driver re-initialization), and reports accurate GPU metrics via DCGM (Data Center GPU Manager) exporter.
- **AI/ML platform partners (Run:ai, Determined AI, Domino Data Lab):** Partners building GPU scheduling and orchestration platforms need sustained GPU workloads in VMs to validate their scheduler's GPU utilization tracking, fair-share allocation, and preemption behavior in a KubeVirt environment.
- **VDI partners (Citrix, VMware Horizon on OpenShift):** Virtual desktop workloads use vGPU for graphics acceleration. Partners need sustained GPU load to validate vGPU performance isolation, memory partitioning, and frame buffer management when multiple VMs share a physical GPU.
- **Monitoring partners:** GPU metrics (utilization %, memory usage, temperature, power draw, ECC errors) are exposed via `nvidia-smi` and DCGM exporter. Monitoring partners need a VM producing sustained GPU signals to validate their dashboards and alerting — idle GPU metrics are not useful for validation.
- **HPC partners:** High-performance computing workloads in VMs (molecular dynamics, CFD, financial modeling) require sustained GPU compute. Partners need to validate that PCI passthrough delivers near-bare-metal GPU performance, and this workload provides the sustained load to measure overhead.

### Additional Context

- **Cluster prerequisites are strict:** This workload only functions on clusters with GPU hardware, NVIDIA GPU Operator, and device plugin configured for VM passthrough. The implementation should detect missing prerequisites and fail with a clear, actionable error rather than creating a VM that can't start.
- **VM spec GPU configuration:** KubeVirt supports two GPU attachment models:
  1. **PCI passthrough** (`spec.domain.devices.gpus[].deviceName`): dedicated GPU per VM, near-bare-metal performance
  2. **vGPU** (`spec.domain.devices.gpus[].virtualGPUOptions`): shared GPU via NVIDIA GRID, multiple VMs per physical GPU
  The workload should support both, with the GPU resource request configurable via CLI/YAML.
- **Golden image approach (recommended):** Installing NVIDIA drivers and CUDA toolkit via cloud-init takes 10-20 minutes and requires internet access for repo downloads. A golden image with drivers pre-installed (`--container-disk-image`) reduces boot-to-workload time to under 2 minutes. Document both approaches, recommend golden image for production validation.
- **gpu-burn vs hashcat:**
  - `gpu-burn`: purpose-built CUDA stress test, exercises matrix multiplication (SGEMM/DGEMM), configurable precision, verifies results for correctness (detects hardware errors). Preferred for validation.
  - `hashcat`: widely available, supports CUDA and OpenCL (works with AMD GPUs too), benchmark mode exercises multiple GPU kernels. Better portability but less focused stress profile.
- **Monitoring integration:** `nvidia-smi --query-gpu=utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l 5` provides per-GPU metrics every 5 seconds. DCGM exporter (deployed by GPU Operator) exposes these as Prometheus metrics — pairs naturally with the `metrics-emitter` workload's ServiceMonitor pattern.
- **Multi-GPU scaling:** On nodes with multiple GPUs, the VM can request more than one (`nvidia.com/gpu: 2`). gpu-burn automatically detects and stresses all available GPUs. vGPU configurations can assign fractional GPU resources.
- **Error detection:** gpu-burn verifies computation results against CPU reference values. If the GPU produces incorrect results (hardware error, driver bug, thermal throttling), gpu-burn reports the error and exits non-zero. systemd's `Restart=always` restarts it, and the error is visible in `journalctl` — a useful signal for reliability validation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Workload]: gpu-compute #164

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Workload]: gpu-compute #164

Description

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions