Workload Name
gpu-compute
Workload Description
GPU stress workload that runs sustained compute operations on a passthrough or vGPU device inside a KubeVirt VM. Produces continuous GPU utilization, memory bandwidth pressure, and thermal load to validate the GPU operator, device plugin, and driver stack on OpenShift Virtualization.
NVIDIA is a major Red Hat partner and GPU passthrough is a growing use case for OpenShift Virtualization — AI/ML inference, VDI (Virtual Desktop Infrastructure), HPC, and video transcoding all run in VMs with GPU access. The GPU device plugin allocates GPUs to VM pods, the NVIDIA GPU Operator manages drivers, and vGPU (via NVIDIA GRID) enables GPU sharing across VMs. This workload validates the entire stack under sustained load, not just device attachment.
No existing virtwork workload exercises GPU hardware. CPU, memory, disk, and network are all host-side resources managed by KubeVirt's standard resource model. GPU passthrough requires PCI device assignment, IOMMU configuration, and driver installation inside the VM — a completely separate resource path that needs its own validation signal.
Tooling and Packages
- Tool:
gpu-burn (CUDA stress test) for NVIDIA GPUs, or hashcat (GPU-accelerated hash cracking) as a portable alternative
- RPM packages:
- NVIDIA driver + CUDA toolkit (installed via cloud-init or pre-baked in a golden image)
gpu-burn: built from source (small C/CUDA program) or pre-compiled binary
- Alternative:
hashcat (available in EPEL, supports CUDA and OpenCL)
- systemd service command:
- gpu-burn:
gpu_burn -d 0 (run indefinitely on all GPUs, 0 = no timeout)
- hashcat:
hashcat -b --force (benchmark mode, exercises all detected GPUs)
- Configurable parameters:
gpu-tool: which stress tool to use (default: gpu-burn)
gpu-memory-percent: percentage of GPU memory to use (default: 90)
gpu-duration: burn duration per cycle in seconds (default: 0 = indefinite)
gpu-precision: float precision — single, double, half (default: single)
VM Count Model
Single VM (like cpu, memory, disk)
Required Resources
The VM spec must request GPU resources via spec.domain.devices.gpus (for passthrough) or resource limits (nvidia.com/gpu: 1). This requires:
- NVIDIA GPU Operator installed on the cluster
- GPU device plugin configured for passthrough or vGPU
- IOMMU enabled on the host nodes
- Sufficient GPU capacity on the cluster
Cloud-Init Details
packages:
- pciutils
- gcc
- make
- git
write_files:
- path: /usr/local/bin/virtwork-gpu-setup.sh
permissions: '0755'
content: |
#!/bin/bash
set -euo pipefail
# Detect GPU
if ! lspci | grep -qi nvidia; then
echo "ERROR: No NVIDIA GPU detected" >&2
exit 1
fi
# Install NVIDIA driver and CUDA toolkit
# This assumes a RHEL/Fedora base with access to NVIDIA repos
dnf config-manager --add-repo \
https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf install -y cuda-toolkit nvidia-driver
# Build gpu-burn
git clone https://github.com/wilicc/gpu-burn.git /opt/gpu-burn
cd /opt/gpu-burn
make
# Verify GPU is functional
nvidia-smi
- path: /etc/systemd/system/virtwork-gpu-compute.service
content: |
[Unit]
Description=Virtwork GPU compute stress workload
After=multi-user.target nvidia-persistenced.service
Wants=nvidia-persistenced.service
[Service]
Type=simple
WorkingDirectory=/opt/gpu-burn
ExecStart=/opt/gpu-burn/gpu_burn 0
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
runcmd:
- /usr/local/bin/virtwork-gpu-setup.sh
- systemctl enable --now virtwork-gpu-compute.service
Use Case
- NVIDIA partnership validation: NVIDIA is a strategic Red Hat partner. Validating GPU passthrough and vGPU under sustained compute load on OpenShift Virtualization is critical for joint solutions. gpu-burn produces GPU utilization, memory bandwidth, and thermal data that NVIDIA and Red Hat engineering use to verify the stack.
- GPU Operator validation: The NVIDIA GPU Operator manages driver lifecycle, device plugin, and monitoring on OpenShift. Sustained GPU load from VMs validates that the operator correctly manages driver state, handles VM restarts (driver re-initialization), and reports accurate GPU metrics via DCGM (Data Center GPU Manager) exporter.
- AI/ML platform partners (Run:ai, Determined AI, Domino Data Lab): Partners building GPU scheduling and orchestration platforms need sustained GPU workloads in VMs to validate their scheduler's GPU utilization tracking, fair-share allocation, and preemption behavior in a KubeVirt environment.
- VDI partners (Citrix, VMware Horizon on OpenShift): Virtual desktop workloads use vGPU for graphics acceleration. Partners need sustained GPU load to validate vGPU performance isolation, memory partitioning, and frame buffer management when multiple VMs share a physical GPU.
- Monitoring partners: GPU metrics (utilization %, memory usage, temperature, power draw, ECC errors) are exposed via
nvidia-smi and DCGM exporter. Monitoring partners need a VM producing sustained GPU signals to validate their dashboards and alerting — idle GPU metrics are not useful for validation.
- HPC partners: High-performance computing workloads in VMs (molecular dynamics, CFD, financial modeling) require sustained GPU compute. Partners need to validate that PCI passthrough delivers near-bare-metal GPU performance, and this workload provides the sustained load to measure overhead.
Additional Context
- Cluster prerequisites are strict: This workload only functions on clusters with GPU hardware, NVIDIA GPU Operator, and device plugin configured for VM passthrough. The implementation should detect missing prerequisites and fail with a clear, actionable error rather than creating a VM that can't start.
- VM spec GPU configuration: KubeVirt supports two GPU attachment models:
- PCI passthrough (
spec.domain.devices.gpus[].deviceName): dedicated GPU per VM, near-bare-metal performance
- vGPU (
spec.domain.devices.gpus[].virtualGPUOptions): shared GPU via NVIDIA GRID, multiple VMs per physical GPU
The workload should support both, with the GPU resource request configurable via CLI/YAML.
- Golden image approach (recommended): Installing NVIDIA drivers and CUDA toolkit via cloud-init takes 10-20 minutes and requires internet access for repo downloads. A golden image with drivers pre-installed (
--container-disk-image) reduces boot-to-workload time to under 2 minutes. Document both approaches, recommend golden image for production validation.
- gpu-burn vs hashcat:
gpu-burn: purpose-built CUDA stress test, exercises matrix multiplication (SGEMM/DGEMM), configurable precision, verifies results for correctness (detects hardware errors). Preferred for validation.
hashcat: widely available, supports CUDA and OpenCL (works with AMD GPUs too), benchmark mode exercises multiple GPU kernels. Better portability but less focused stress profile.
- Monitoring integration:
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l 5 provides per-GPU metrics every 5 seconds. DCGM exporter (deployed by GPU Operator) exposes these as Prometheus metrics — pairs naturally with the metrics-emitter workload's ServiceMonitor pattern.
- Multi-GPU scaling: On nodes with multiple GPUs, the VM can request more than one (
nvidia.com/gpu: 2). gpu-burn automatically detects and stresses all available GPUs. vGPU configurations can assign fractional GPU resources.
- Error detection: gpu-burn verifies computation results against CPU reference values. If the GPU produces incorrect results (hardware error, driver bug, thermal throttling), gpu-burn reports the error and exits non-zero. systemd's
Restart=always restarts it, and the error is visible in journalctl — a useful signal for reliability validation.
Workload Name
gpu-compute
Workload Description
GPU stress workload that runs sustained compute operations on a passthrough or vGPU device inside a KubeVirt VM. Produces continuous GPU utilization, memory bandwidth pressure, and thermal load to validate the GPU operator, device plugin, and driver stack on OpenShift Virtualization.
NVIDIA is a major Red Hat partner and GPU passthrough is a growing use case for OpenShift Virtualization — AI/ML inference, VDI (Virtual Desktop Infrastructure), HPC, and video transcoding all run in VMs with GPU access. The GPU device plugin allocates GPUs to VM pods, the NVIDIA GPU Operator manages drivers, and vGPU (via NVIDIA GRID) enables GPU sharing across VMs. This workload validates the entire stack under sustained load, not just device attachment.
No existing virtwork workload exercises GPU hardware. CPU, memory, disk, and network are all host-side resources managed by KubeVirt's standard resource model. GPU passthrough requires PCI device assignment, IOMMU configuration, and driver installation inside the VM — a completely separate resource path that needs its own validation signal.
Tooling and Packages
gpu-burn(CUDA stress test) for NVIDIA GPUs, orhashcat(GPU-accelerated hash cracking) as a portable alternativegpu-burn: built from source (small C/CUDA program) or pre-compiled binaryhashcat(available in EPEL, supports CUDA and OpenCL)gpu_burn -d 0(run indefinitely on all GPUs, 0 = no timeout)hashcat -b --force(benchmark mode, exercises all detected GPUs)gpu-tool: which stress tool to use (default:gpu-burn)gpu-memory-percent: percentage of GPU memory to use (default: 90)gpu-duration: burn duration per cycle in seconds (default: 0 = indefinite)gpu-precision: float precision —single,double,half(default:single)VM Count Model
Single VM (like cpu, memory, disk)
Required Resources
The VM spec must request GPU resources via
spec.domain.devices.gpus(for passthrough) or resource limits (nvidia.com/gpu: 1). This requires:Cloud-Init Details
Use Case
nvidia-smiand DCGM exporter. Monitoring partners need a VM producing sustained GPU signals to validate their dashboards and alerting — idle GPU metrics are not useful for validation.Additional Context
spec.domain.devices.gpus[].deviceName): dedicated GPU per VM, near-bare-metal performancespec.domain.devices.gpus[].virtualGPUOptions): shared GPU via NVIDIA GRID, multiple VMs per physical GPUThe workload should support both, with the GPU resource request configurable via CLI/YAML.
--container-disk-image) reduces boot-to-workload time to under 2 minutes. Document both approaches, recommend golden image for production validation.gpu-burn: purpose-built CUDA stress test, exercises matrix multiplication (SGEMM/DGEMM), configurable precision, verifies results for correctness (detects hardware errors). Preferred for validation.hashcat: widely available, supports CUDA and OpenCL (works with AMD GPUs too), benchmark mode exercises multiple GPU kernels. Better portability but less focused stress profile.nvidia-smi --query-gpu=utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l 5provides per-GPU metrics every 5 seconds. DCGM exporter (deployed by GPU Operator) exposes these as Prometheus metrics — pairs naturally with themetrics-emitterworkload's ServiceMonitor pattern.nvidia.com/gpu: 2). gpu-burn automatically detects and stresses all available GPUs. vGPU configurations can assign fractional GPU resources.Restart=alwaysrestarts it, and the error is visible injournalctl— a useful signal for reliability validation.