Workload Name
disk-snapshot
Workload Description
A variant of the existing disk workload that runs fio against a data disk while a sidecar script periodically triggers CSI VolumeSnapshots, validating I/O consistency and data integrity during snapshot operations. Produces the same sustained mixed I/O signals as the disk workload (IOPS, throughput, latency) plus snapshot-specific signals: snapshot creation latency, snapshot size, and — critically — whether the snapshot captures a crash-consistent or application-consistent state while I/O is actively in flight.
This is the scenario every storage partner dreads and every storage partner must prove works: "What happens when you snapshot a disk that is actively being written to?" The existing disk workload generates fio I/O, and a partner can manually trigger snapshots during a run, but there's no automated, reproducible way to validate snapshot-under-load behavior. This workload makes that test continuous and auditable.
The sidecar script runs inside the VM (not as a Kubernetes controller) and uses kubectl or virtctl to create VolumeSnapshot CRs against the VM's data disk PVC at configurable intervals. After each snapshot, it optionally verifies the snapshot by restoring it to a temporary PVC and running a filesystem check.
Tooling and Packages
I/O workload (identical to existing disk workload):
- Tool:
fio with mixed random/sequential profiles
- RPM packages:
fio
- systemd service:
virtwork-disk.service (reused from existing disk workload)
Snapshot trigger sidecar:
- Tool:
kubectl (or oc) CLI inside the VM, authenticated via a projected ServiceAccount token
- RPM packages:
kubernetes-client or direct binary download of kubectl
- systemd service:
virtwork-disk-snapshot.service
- Command loop:
while true; do
kubectl create -f /etc/virtwork/snapshot-template.yaml
sleep $SNAPSHOT_INTERVAL
done
- Configurable parameters:
snapshot-interval: seconds between snapshot triggers (default: 300 — every 5 minutes)
snapshot-class: VolumeSnapshotClass name (required — depends on storage backend)
snapshot-retain: number of snapshots to retain before deleting oldest (default: 5)
snapshot-verify: whether to restore and fsck each snapshot (default: false — expensive)
VM Count Model
Single VM (like cpu, memory, disk)
Required Resources
The DataVolume is the same as the existing disk workload — a CDI-provisioned persistent disk for fio. The Secret holds a ServiceAccount token (or kubeconfig) that grants the VM's snapshot sidecar permission to create VolumeSnapshot CRs. Additional RBAC is needed: a ServiceAccount with create/delete/list on volumesnapshots.snapshot.storage.k8s.io in the workload namespace.
Cloud-Init Details
packages:
- fio
write_files:
- path: /etc/virtwork/snapshot-template.yaml
content: |
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
generateName: virtwork-disk-snap-
namespace: <namespace>
labels:
app.kubernetes.io/managed-by: virtwork
virtwork/component: disk-snapshot
spec:
volumeSnapshotClassName: <snapshot-class>
source:
persistentVolumeClaimName: <pvc-name>
- path: /usr/local/bin/virtwork-snapshot-sidecar.sh
permissions: '0755'
content: |
#!/bin/bash
set -euo pipefail
INTERVAL="${SNAPSHOT_INTERVAL:-300}"
RETAIN="${SNAPSHOT_RETAIN:-5}"
NAMESPACE="${SNAPSHOT_NAMESPACE:-<namespace>}"
KUBECONFIG="${KUBECONFIG:-/etc/virtwork/kubeconfig}"
export KUBECONFIG
while true; do
TIMESTAMP=$(date +%s)
echo "=== Creating snapshot at $(date -Iseconds) ==="
# Create snapshot
kubectl apply -f /etc/virtwork/snapshot-template.yaml
# Wait for snapshot to be ready
SNAP_NAME=$(kubectl get volumesnapshot -n "$NAMESPACE" \
-l virtwork/component=disk-snapshot \
--sort-by=.metadata.creationTimestamp \
-o jsonpath='{.items[-1].metadata.name}')
kubectl wait volumesnapshot/"$SNAP_NAME" -n "$NAMESPACE" \
--for=jsonpath='{.status.readyToUse}'=true \
--timeout=120s 2>/dev/null && \
echo "Snapshot $SNAP_NAME ready" || \
echo "WARNING: Snapshot $SNAP_NAME not ready within timeout"
# Prune old snapshots beyond retention limit
SNAP_COUNT=$(kubectl get volumesnapshot -n "$NAMESPACE" \
-l virtwork/component=disk-snapshot \
--no-headers 2>/dev/null | wc -l)
if [ "$SNAP_COUNT" -gt "$RETAIN" ]; then
DELETE_COUNT=$((SNAP_COUNT - RETAIN))
kubectl get volumesnapshot -n "$NAMESPACE" \
-l virtwork/component=disk-snapshot \
--sort-by=.metadata.creationTimestamp \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
head -n "$DELETE_COUNT" | \
xargs -I{} kubectl delete volumesnapshot/{} -n "$NAMESPACE"
echo "Pruned $DELETE_COUNT old snapshots"
fi
sleep "$INTERVAL"
done
- path: /etc/systemd/system/virtwork-disk-snapshot.service
content: |
[Unit]
Description=Virtwork disk snapshot trigger sidecar
After=network-online.target virtwork-disk.service
Wants=virtwork-disk.service
[Service]
Type=simple
Environment=SNAPSHOT_INTERVAL=300
Environment=SNAPSHOT_RETAIN=5
Environment=SNAPSHOT_NAMESPACE=<namespace>
ExecStart=/usr/local/bin/virtwork-snapshot-sidecar.sh
Restart=always
RestartSec=30
[Install]
WantedBy=multi-user.target
runcmd:
- curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
- chmod +x /usr/local/bin/kubectl
- systemctl enable --now virtwork-disk.service
- systemctl enable --now virtwork-disk-snapshot.service
Use Case
- Storage partners (CSI driver vendors — NetApp, Pure Storage, Dell, HPE, Red Hat ODF): The primary audience. Every CSI driver must handle VolumeSnapshot correctly while I/O is in flight. The critical questions: Does the snapshot complete without errors? Does it capture a consistent state? Does fio I/O latency spike during the snapshot? Does the snapshot size grow linearly with data written? This workload answers all four continuously and automatically.
- Backup/DR partners (Velero, Trilio, Kasten): These products use CSI VolumeSnapshots as the backup primitive. They need to validate that snapshot-based backups taken during active I/O are restorable and consistent. The
--snapshot-verify option (restore + fsck) provides the restore validation that backup products depend on.
- Data protection / CDP partners (Zerto, Portworx): Continuous data protection products need to understand snapshot overhead and I/O impact at various snapshot frequencies. Tuning
snapshot-interval from 300s down to 30s stresses the CDP pipeline and reveals performance cliffs.
- OpenShift platform validation: Validates that the CSI snapshot controller, external-snapshotter, and the storage backend work correctly together under sustained I/O from a KubeVirt VM — a less commonly tested path than pod-based snapshots.
Additional Context
- Relationship to existing
disk workload: This workload is a superset of disk. The fio configuration is identical — the same systemd service, same I/O profiles, same data disk setup. The snapshot sidecar is an additional systemd service that runs alongside fio. Implementation options:
- New workload type (
disk-snapshot): cleaner separation, explicit opt-in.
- Flag on existing
disk workload (--enable-snapshots): less code duplication, but muddies the single-responsibility of the disk workload.
Option 1 is recommended — it follows the existing pattern where each workload type is a distinct, self-contained unit.
- RBAC requirements: The snapshot sidecar needs a ServiceAccount with permissions to manage VolumeSnapshot CRs. This is additional RBAC beyond what virtwork currently creates. The implementation should:
- Create a ServiceAccount in the workload namespace
- Create a Role with
create/delete/get/list on volumesnapshots and volumesnapshotcontents
- Create a RoleBinding
- Project the ServiceAccount token into the VM (or generate a kubeconfig Secret)
- VolumeSnapshotClass is required: Unlike other workloads that work with any storage class, this workload requires a VolumeSnapshotClass to be configured on the cluster. The implementation should fail fast with a clear error if
snapshot-class is not specified or doesn't exist.
- Snapshot verification (optional): The
--snapshot-verify mode restores each snapshot to a temporary PVC, mounts it (or attaches to a temporary pod), runs fsck, and deletes the temporary PVC. This is expensive (creates an extra PVC per snapshot cycle) but provides the strongest consistency guarantee. Default to off; document as available for thorough validation.
- I/O impact measurement: The most valuable metric is fio's latency during the snapshot window vs outside it. fio's
--output-format=json captures per-second latency data. Partners can correlate latency spikes with snapshot creation timestamps to quantify snapshot overhead.
- Composability: This workload pairs with
chaos-disk (snapshot during disk-full conditions) and chaos-network (snapshot when storage network is degraded) for comprehensive storage resilience testing.
Workload Name
disk-snapshot
Workload Description
A variant of the existing
diskworkload that runs fio against a data disk while a sidecar script periodically triggers CSI VolumeSnapshots, validating I/O consistency and data integrity during snapshot operations. Produces the same sustained mixed I/O signals as thediskworkload (IOPS, throughput, latency) plus snapshot-specific signals: snapshot creation latency, snapshot size, and — critically — whether the snapshot captures a crash-consistent or application-consistent state while I/O is actively in flight.This is the scenario every storage partner dreads and every storage partner must prove works: "What happens when you snapshot a disk that is actively being written to?" The existing
diskworkload generates fio I/O, and a partner can manually trigger snapshots during a run, but there's no automated, reproducible way to validate snapshot-under-load behavior. This workload makes that test continuous and auditable.The sidecar script runs inside the VM (not as a Kubernetes controller) and uses
kubectlorvirtctlto create VolumeSnapshot CRs against the VM's data disk PVC at configurable intervals. After each snapshot, it optionally verifies the snapshot by restoring it to a temporary PVC and running a filesystem check.Tooling and Packages
I/O workload (identical to existing
diskworkload):fiowith mixed random/sequential profilesfiovirtwork-disk.service(reused from existingdiskworkload)Snapshot trigger sidecar:
kubectl(oroc) CLI inside the VM, authenticated via a projected ServiceAccount tokenkubernetes-clientor direct binary download ofkubectlvirtwork-disk-snapshot.servicesnapshot-interval: seconds between snapshot triggers (default: 300 — every 5 minutes)snapshot-class: VolumeSnapshotClass name (required — depends on storage backend)snapshot-retain: number of snapshots to retain before deleting oldest (default: 5)snapshot-verify: whether to restore and fsck each snapshot (default: false — expensive)VM Count Model
Single VM (like cpu, memory, disk)
Required Resources
The DataVolume is the same as the existing
diskworkload — a CDI-provisioned persistent disk for fio. The Secret holds a ServiceAccount token (or kubeconfig) that grants the VM's snapshot sidecar permission to create VolumeSnapshot CRs. Additional RBAC is needed: a ServiceAccount withcreate/delete/listonvolumesnapshots.snapshot.storage.k8s.ioin the workload namespace.Cloud-Init Details
Use Case
--snapshot-verifyoption (restore + fsck) provides the restore validation that backup products depend on.snapshot-intervalfrom 300s down to 30s stresses the CDP pipeline and reveals performance cliffs.Additional Context
diskworkload: This workload is a superset ofdisk. The fio configuration is identical — the same systemd service, same I/O profiles, same data disk setup. The snapshot sidecar is an additional systemd service that runs alongside fio. Implementation options:disk-snapshot): cleaner separation, explicit opt-in.diskworkload (--enable-snapshots): less code duplication, but muddies the single-responsibility of thediskworkload.Option 1 is recommended — it follows the existing pattern where each workload type is a distinct, self-contained unit.
create/delete/get/listonvolumesnapshotsandvolumesnapshotcontentssnapshot-classis not specified or doesn't exist.--snapshot-verifymode restores each snapshot to a temporary PVC, mounts it (or attaches to a temporary pod), runsfsck, and deletes the temporary PVC. This is expensive (creates an extra PVC per snapshot cycle) but provides the strongest consistency guarantee. Default to off; document as available for thorough validation.--output-format=jsoncaptures per-second latency data. Partners can correlate latency spikes with snapshot creation timestamps to quantify snapshot overhead.chaos-disk(snapshot during disk-full conditions) andchaos-network(snapshot when storage network is degraded) for comprehensive storage resilience testing.