Skip to content

[Workload]: disk-snapshot #167

Description

@mrhillsman

Workload Name

disk-snapshot

Workload Description

A variant of the existing disk workload that runs fio against a data disk while a sidecar script periodically triggers CSI VolumeSnapshots, validating I/O consistency and data integrity during snapshot operations. Produces the same sustained mixed I/O signals as the disk workload (IOPS, throughput, latency) plus snapshot-specific signals: snapshot creation latency, snapshot size, and — critically — whether the snapshot captures a crash-consistent or application-consistent state while I/O is actively in flight.

This is the scenario every storage partner dreads and every storage partner must prove works: "What happens when you snapshot a disk that is actively being written to?" The existing disk workload generates fio I/O, and a partner can manually trigger snapshots during a run, but there's no automated, reproducible way to validate snapshot-under-load behavior. This workload makes that test continuous and auditable.

The sidecar script runs inside the VM (not as a Kubernetes controller) and uses kubectl or virtctl to create VolumeSnapshot CRs against the VM's data disk PVC at configurable intervals. After each snapshot, it optionally verifies the snapshot by restoring it to a temporary PVC and running a filesystem check.

Tooling and Packages

I/O workload (identical to existing disk workload):

  • Tool: fio with mixed random/sequential profiles
  • RPM packages: fio
  • systemd service: virtwork-disk.service (reused from existing disk workload)

Snapshot trigger sidecar:

  • Tool: kubectl (or oc) CLI inside the VM, authenticated via a projected ServiceAccount token
  • RPM packages: kubernetes-client or direct binary download of kubectl
  • systemd service: virtwork-disk-snapshot.service
  • Command loop:
    while true; do
      kubectl create -f /etc/virtwork/snapshot-template.yaml
      sleep $SNAPSHOT_INTERVAL
    done
  • Configurable parameters:
    • snapshot-interval: seconds between snapshot triggers (default: 300 — every 5 minutes)
    • snapshot-class: VolumeSnapshotClass name (required — depends on storage backend)
    • snapshot-retain: number of snapshots to retain before deleting oldest (default: 5)
    • snapshot-verify: whether to restore and fsck each snapshot (default: false — expensive)

VM Count Model

Single VM (like cpu, memory, disk)

Required Resources

  • Persistent storage (DataVolume)
  • Kubernetes Service (for inter-VM communication)
  • Kubernetes Secret (for credentials or config)
  • Additional CPU/memory beyond defaults
  • GPU or special device passthrough

The DataVolume is the same as the existing disk workload — a CDI-provisioned persistent disk for fio. The Secret holds a ServiceAccount token (or kubeconfig) that grants the VM's snapshot sidecar permission to create VolumeSnapshot CRs. Additional RBAC is needed: a ServiceAccount with create/delete/list on volumesnapshots.snapshot.storage.k8s.io in the workload namespace.

Cloud-Init Details

packages:
  - fio
write_files:
  - path: /etc/virtwork/snapshot-template.yaml
    content: |
      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshot
      metadata:
        generateName: virtwork-disk-snap-
        namespace: <namespace>
        labels:
          app.kubernetes.io/managed-by: virtwork
          virtwork/component: disk-snapshot
      spec:
        volumeSnapshotClassName: <snapshot-class>
        source:
          persistentVolumeClaimName: <pvc-name>
  - path: /usr/local/bin/virtwork-snapshot-sidecar.sh
    permissions: '0755'
    content: |
      #!/bin/bash
      set -euo pipefail
      INTERVAL="${SNAPSHOT_INTERVAL:-300}"
      RETAIN="${SNAPSHOT_RETAIN:-5}"
      NAMESPACE="${SNAPSHOT_NAMESPACE:-<namespace>}"
      KUBECONFIG="${KUBECONFIG:-/etc/virtwork/kubeconfig}"

      export KUBECONFIG

      while true; do
        TIMESTAMP=$(date +%s)
        echo "=== Creating snapshot at $(date -Iseconds) ==="

        # Create snapshot
        kubectl apply -f /etc/virtwork/snapshot-template.yaml

        # Wait for snapshot to be ready
        SNAP_NAME=$(kubectl get volumesnapshot -n "$NAMESPACE" \
          -l virtwork/component=disk-snapshot \
          --sort-by=.metadata.creationTimestamp \
          -o jsonpath='{.items[-1].metadata.name}')

        kubectl wait volumesnapshot/"$SNAP_NAME" -n "$NAMESPACE" \
          --for=jsonpath='{.status.readyToUse}'=true \
          --timeout=120s 2>/dev/null && \
          echo "Snapshot $SNAP_NAME ready" || \
          echo "WARNING: Snapshot $SNAP_NAME not ready within timeout"

        # Prune old snapshots beyond retention limit
        SNAP_COUNT=$(kubectl get volumesnapshot -n "$NAMESPACE" \
          -l virtwork/component=disk-snapshot \
          --no-headers 2>/dev/null | wc -l)

        if [ "$SNAP_COUNT" -gt "$RETAIN" ]; then
          DELETE_COUNT=$((SNAP_COUNT - RETAIN))
          kubectl get volumesnapshot -n "$NAMESPACE" \
            -l virtwork/component=disk-snapshot \
            --sort-by=.metadata.creationTimestamp \
            -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
            head -n "$DELETE_COUNT" | \
            xargs -I{} kubectl delete volumesnapshot/{} -n "$NAMESPACE"
          echo "Pruned $DELETE_COUNT old snapshots"
        fi

        sleep "$INTERVAL"
      done
  - path: /etc/systemd/system/virtwork-disk-snapshot.service
    content: |
      [Unit]
      Description=Virtwork disk snapshot trigger sidecar
      After=network-online.target virtwork-disk.service
      Wants=virtwork-disk.service
      [Service]
      Type=simple
      Environment=SNAPSHOT_INTERVAL=300
      Environment=SNAPSHOT_RETAIN=5
      Environment=SNAPSHOT_NAMESPACE=<namespace>
      ExecStart=/usr/local/bin/virtwork-snapshot-sidecar.sh
      Restart=always
      RestartSec=30
      [Install]
      WantedBy=multi-user.target
runcmd:
  - curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - chmod +x /usr/local/bin/kubectl
  - systemctl enable --now virtwork-disk.service
  - systemctl enable --now virtwork-disk-snapshot.service

Use Case

  • Storage partners (CSI driver vendors — NetApp, Pure Storage, Dell, HPE, Red Hat ODF): The primary audience. Every CSI driver must handle VolumeSnapshot correctly while I/O is in flight. The critical questions: Does the snapshot complete without errors? Does it capture a consistent state? Does fio I/O latency spike during the snapshot? Does the snapshot size grow linearly with data written? This workload answers all four continuously and automatically.
  • Backup/DR partners (Velero, Trilio, Kasten): These products use CSI VolumeSnapshots as the backup primitive. They need to validate that snapshot-based backups taken during active I/O are restorable and consistent. The --snapshot-verify option (restore + fsck) provides the restore validation that backup products depend on.
  • Data protection / CDP partners (Zerto, Portworx): Continuous data protection products need to understand snapshot overhead and I/O impact at various snapshot frequencies. Tuning snapshot-interval from 300s down to 30s stresses the CDP pipeline and reveals performance cliffs.
  • OpenShift platform validation: Validates that the CSI snapshot controller, external-snapshotter, and the storage backend work correctly together under sustained I/O from a KubeVirt VM — a less commonly tested path than pod-based snapshots.

Additional Context

  • Relationship to existing disk workload: This workload is a superset of disk. The fio configuration is identical — the same systemd service, same I/O profiles, same data disk setup. The snapshot sidecar is an additional systemd service that runs alongside fio. Implementation options:
    1. New workload type (disk-snapshot): cleaner separation, explicit opt-in.
    2. Flag on existing disk workload (--enable-snapshots): less code duplication, but muddies the single-responsibility of the disk workload.
      Option 1 is recommended — it follows the existing pattern where each workload type is a distinct, self-contained unit.
  • RBAC requirements: The snapshot sidecar needs a ServiceAccount with permissions to manage VolumeSnapshot CRs. This is additional RBAC beyond what virtwork currently creates. The implementation should:
    1. Create a ServiceAccount in the workload namespace
    2. Create a Role with create/delete/get/list on volumesnapshots and volumesnapshotcontents
    3. Create a RoleBinding
    4. Project the ServiceAccount token into the VM (or generate a kubeconfig Secret)
  • VolumeSnapshotClass is required: Unlike other workloads that work with any storage class, this workload requires a VolumeSnapshotClass to be configured on the cluster. The implementation should fail fast with a clear error if snapshot-class is not specified or doesn't exist.
  • Snapshot verification (optional): The --snapshot-verify mode restores each snapshot to a temporary PVC, mounts it (or attaches to a temporary pod), runs fsck, and deletes the temporary PVC. This is expensive (creates an extra PVC per snapshot cycle) but provides the strongest consistency guarantee. Default to off; document as available for thorough validation.
  • I/O impact measurement: The most valuable metric is fio's latency during the snapshot window vs outside it. fio's --output-format=json captures per-second latency data. Partners can correlate latency spikes with snapshot creation timestamps to quantify snapshot overhead.
  • Composability: This workload pairs with chaos-disk (snapshot during disk-full conditions) and chaos-network (snapshot when storage network is degraded) for comprehensive storage resilience testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.size/XLDenotes a PR that changes 500-999 lines, ignoring generated files.workload-requestRequest for a new workload typeworkload/variantEnhancement or extension of an existing workload type.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions