[Workload]: disk-snapshot

### Workload Name

disk-snapshot

### Workload Description

A variant of the existing `disk` workload that runs fio against a data disk while a sidecar script periodically triggers CSI VolumeSnapshots, validating I/O consistency and data integrity during snapshot operations. Produces the same sustained mixed I/O signals as the `disk` workload (IOPS, throughput, latency) plus snapshot-specific signals: snapshot creation latency, snapshot size, and — critically — whether the snapshot captures a crash-consistent or application-consistent state while I/O is actively in flight.

This is the scenario every storage partner dreads and every storage partner must prove works: "What happens when you snapshot a disk that is actively being written to?" The existing `disk` workload generates fio I/O, and a partner can manually trigger snapshots during a run, but there's no automated, reproducible way to validate snapshot-under-load behavior. This workload makes that test continuous and auditable.

The sidecar script runs inside the VM (not as a Kubernetes controller) and uses `kubectl` or `virtctl` to create VolumeSnapshot CRs against the VM's data disk PVC at configurable intervals. After each snapshot, it optionally verifies the snapshot by restoring it to a temporary PVC and running a filesystem check.

### Tooling and Packages

**I/O workload (identical to existing `disk` workload):**
- Tool: `fio` with mixed random/sequential profiles
- RPM packages: `fio`
- systemd service: `virtwork-disk.service` (reused from existing `disk` workload)

**Snapshot trigger sidecar:**
- Tool: `kubectl` (or `oc`) CLI inside the VM, authenticated via a projected ServiceAccount token
- RPM packages: `kubernetes-client` or direct binary download of `kubectl`
- systemd service: `virtwork-disk-snapshot.service`
- Command loop:
  ```bash
  while true; do
    kubectl create -f /etc/virtwork/snapshot-template.yaml
    sleep $SNAPSHOT_INTERVAL
  done
  ```
- Configurable parameters:
  - `snapshot-interval`: seconds between snapshot triggers (default: 300 — every 5 minutes)
  - `snapshot-class`: VolumeSnapshotClass name (required — depends on storage backend)
  - `snapshot-retain`: number of snapshots to retain before deleting oldest (default: 5)
  - `snapshot-verify`: whether to restore and fsck each snapshot (default: false — expensive)

### VM Count Model

Single VM (like cpu, memory, disk)

### Required Resources

- [x] Persistent storage (DataVolume)
- [ ] Kubernetes Service (for inter-VM communication)
- [x] Kubernetes Secret (for credentials or config)
- [ ] Additional CPU/memory beyond defaults
- [ ] GPU or special device passthrough

The DataVolume is the same as the existing `disk` workload — a CDI-provisioned persistent disk for fio. The Secret holds a ServiceAccount token (or kubeconfig) that grants the VM's snapshot sidecar permission to create VolumeSnapshot CRs. Additional RBAC is needed: a ServiceAccount with `create`/`delete`/`list` on `volumesnapshots.snapshot.storage.k8s.io` in the workload namespace.

### Cloud-Init Details

```yaml
packages:
  - fio
write_files:
  - path: /etc/virtwork/snapshot-template.yaml
    content: |
      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshot
      metadata:
        generateName: virtwork-disk-snap-
        namespace: <namespace>
        labels:
          app.kubernetes.io/managed-by: virtwork
          virtwork/component: disk-snapshot
      spec:
        volumeSnapshotClassName: <snapshot-class>
        source:
          persistentVolumeClaimName: <pvc-name>
  - path: /usr/local/bin/virtwork-snapshot-sidecar.sh
    permissions: '0755'
    content: |
      #!/bin/bash
      set -euo pipefail
      INTERVAL="${SNAPSHOT_INTERVAL:-300}"
      RETAIN="${SNAPSHOT_RETAIN:-5}"
      NAMESPACE="${SNAPSHOT_NAMESPACE:-<namespace>}"
      KUBECONFIG="${KUBECONFIG:-/etc/virtwork/kubeconfig}"

      export KUBECONFIG

      while true; do
        TIMESTAMP=$(date +%s)
        echo "=== Creating snapshot at $(date -Iseconds) ==="

        # Create snapshot
        kubectl apply -f /etc/virtwork/snapshot-template.yaml

        # Wait for snapshot to be ready
        SNAP_NAME=$(kubectl get volumesnapshot -n "$NAMESPACE" \
          -l virtwork/component=disk-snapshot \
          --sort-by=.metadata.creationTimestamp \
          -o jsonpath='{.items[-1].metadata.name}')

        kubectl wait volumesnapshot/"$SNAP_NAME" -n "$NAMESPACE" \
          --for=jsonpath='{.status.readyToUse}'=true \
          --timeout=120s 2>/dev/null && \
          echo "Snapshot $SNAP_NAME ready" || \
          echo "WARNING: Snapshot $SNAP_NAME not ready within timeout"

        # Prune old snapshots beyond retention limit
        SNAP_COUNT=$(kubectl get volumesnapshot -n "$NAMESPACE" \
          -l virtwork/component=disk-snapshot \
          --no-headers 2>/dev/null | wc -l)

        if [ "$SNAP_COUNT" -gt "$RETAIN" ]; then
          DELETE_COUNT=$((SNAP_COUNT - RETAIN))
          kubectl get volumesnapshot -n "$NAMESPACE" \
            -l virtwork/component=disk-snapshot \
            --sort-by=.metadata.creationTimestamp \
            -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
            head -n "$DELETE_COUNT" | \
            xargs -I{} kubectl delete volumesnapshot/{} -n "$NAMESPACE"
          echo "Pruned $DELETE_COUNT old snapshots"
        fi

        sleep "$INTERVAL"
      done
  - path: /etc/systemd/system/virtwork-disk-snapshot.service
    content: |
      [Unit]
      Description=Virtwork disk snapshot trigger sidecar
      After=network-online.target virtwork-disk.service
      Wants=virtwork-disk.service
      [Service]
      Type=simple
      Environment=SNAPSHOT_INTERVAL=300
      Environment=SNAPSHOT_RETAIN=5
      Environment=SNAPSHOT_NAMESPACE=<namespace>
      ExecStart=/usr/local/bin/virtwork-snapshot-sidecar.sh
      Restart=always
      RestartSec=30
      [Install]
      WantedBy=multi-user.target
runcmd:
  - curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - chmod +x /usr/local/bin/kubectl
  - systemctl enable --now virtwork-disk.service
  - systemctl enable --now virtwork-disk-snapshot.service
```

### Use Case

- **Storage partners (CSI driver vendors — NetApp, Pure Storage, Dell, HPE, Red Hat ODF):** The primary audience. Every CSI driver must handle VolumeSnapshot correctly while I/O is in flight. The critical questions: Does the snapshot complete without errors? Does it capture a consistent state? Does fio I/O latency spike during the snapshot? Does the snapshot size grow linearly with data written? This workload answers all four continuously and automatically.
- **Backup/DR partners (Velero, Trilio, Kasten):** These products use CSI VolumeSnapshots as the backup primitive. They need to validate that snapshot-based backups taken during active I/O are restorable and consistent. The `--snapshot-verify` option (restore + fsck) provides the restore validation that backup products depend on.
- **Data protection / CDP partners (Zerto, Portworx):** Continuous data protection products need to understand snapshot overhead and I/O impact at various snapshot frequencies. Tuning `snapshot-interval` from 300s down to 30s stresses the CDP pipeline and reveals performance cliffs.
- **OpenShift platform validation:** Validates that the CSI snapshot controller, external-snapshotter, and the storage backend work correctly together under sustained I/O from a KubeVirt VM — a less commonly tested path than pod-based snapshots.

### Additional Context

- **Relationship to existing `disk` workload:** This workload is a *superset* of `disk`. The fio configuration is identical — the same systemd service, same I/O profiles, same data disk setup. The snapshot sidecar is an additional systemd service that runs alongside fio. Implementation options:
  1. **New workload type** (`disk-snapshot`): cleaner separation, explicit opt-in.
  2. **Flag on existing `disk` workload** (`--enable-snapshots`): less code duplication, but muddies the single-responsibility of the `disk` workload.
  Option 1 is recommended — it follows the existing pattern where each workload type is a distinct, self-contained unit.
- **RBAC requirements:** The snapshot sidecar needs a ServiceAccount with permissions to manage VolumeSnapshot CRs. This is additional RBAC beyond what virtwork currently creates. The implementation should:
  1. Create a ServiceAccount in the workload namespace
  2. Create a Role with `create`/`delete`/`get`/`list` on `volumesnapshots` and `volumesnapshotcontents`
  3. Create a RoleBinding
  4. Project the ServiceAccount token into the VM (or generate a kubeconfig Secret)
- **VolumeSnapshotClass is required:** Unlike other workloads that work with any storage class, this workload requires a VolumeSnapshotClass to be configured on the cluster. The implementation should fail fast with a clear error if `snapshot-class` is not specified or doesn't exist.
- **Snapshot verification (optional):** The `--snapshot-verify` mode restores each snapshot to a temporary PVC, mounts it (or attaches to a temporary pod), runs `fsck`, and deletes the temporary PVC. This is expensive (creates an extra PVC per snapshot cycle) but provides the strongest consistency guarantee. Default to off; document as available for thorough validation.
- **I/O impact measurement:** The most valuable metric is fio's latency during the snapshot window vs outside it. fio's `--output-format=json` captures per-second latency data. Partners can correlate latency spikes with snapshot creation timestamps to quantify snapshot overhead.
- **Composability:** This workload pairs with `chaos-disk` (snapshot during disk-full conditions) and `chaos-network` (snapshot when storage network is degraded) for comprehensive storage resilience testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Workload]: disk-snapshot #167

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Workload]: disk-snapshot #167

Description

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions