[Workload]: chaos-migration

### Workload Name

chaos-migration

### Workload Description

Chaos workload that triggers live migrations on other virtwork-managed VMs at random intervals, testing partner product resilience to KubeVirt live migration events. Runs inside a dedicated "controller" VM and uses `virtctl` to initiate migrations on target VMs selected by label, producing a continuous stream of migration events — start, in-progress, completed, or failed — across the VM fleet.

This is a chaos variant in the same family as `chaos-disk`, `chaos-network`, and `chaos-process`, but operates at the *platform level* rather than inside a single VM. The existing chaos workloads inject faults within a VM (disk fill, network disruption, process kill). `chaos-migration` injects a platform-level disruption — live migration — that affects the VM's network identity, storage attachment, and node placement. Every partner product that runs alongside KubeVirt VMs must handle migrations correctly: monitoring agents must maintain metric continuity, storage drivers must handle disk re-attachment, network products must re-route traffic, and backup products must handle VMs moving between nodes.

Live migration is not a rare event on OpenShift — it's triggered by node drain (`oc adm drain`), cluster upgrades, resource rebalancing, and descheduler policies. Partners cannot claim KubeVirt support without validating migration resilience.

### Tooling and Packages

- Tool: `virtctl` CLI (KubeVirt client) for triggering migrations, `kubectl` for label-based VM discovery
- RPM packages: none — `virtctl` and `kubectl` are single binary downloads
- systemd service command: `/usr/local/bin/virtwork-chaos-migration.sh`
- Behavior loop:
  1. Discover virtwork-managed VMs in the target namespace via label selector (`app.kubernetes.io/managed-by=virtwork`)
  2. Exclude self (the chaos-migration VM) from targets
  3. Select a random target VM
  4. Trigger migration: `virtctl migrate <vm-name> -n <namespace>`
  5. Wait for migration to complete (or timeout)
  6. Log result (success, failure, duration)
  7. Sleep for configurable interval (randomized within a range)
  8. Repeat
- Configurable parameters:
  - `migration-interval-min`: minimum seconds between migrations (default: 120)
  - `migration-interval-max`: maximum seconds between migrations (default: 600)
  - `migration-timeout`: seconds to wait for a migration to complete (default: 300)
  - `migration-target-labels`: additional label selector to filter target VMs (default: all virtwork VMs)
  - `migration-exclude-components`: workload components to exclude from migration (default: none — but `chaos-migration` always excludes itself)
  - `migration-dry-run`: log which VM would be migrated without triggering (default: false)

### VM Count Model

Single VM (like cpu, memory, disk)

This workload is a "controller" VM — it doesn't generate resource pressure itself, it orchestrates disruptions on other VMs. It requires at least one other virtwork workload running in the same namespace to have migration targets. Intended to be deployed alongside other workloads: `virtwork run --workloads cpu,disk,database,chaos-migration`.

### Required Resources

- [ ] Persistent storage (DataVolume)
- [ ] Kubernetes Service (for inter-VM communication)
- [x] Kubernetes Secret (for credentials or config)
- [ ] Additional CPU/memory beyond defaults
- [ ] GPU or special device passthrough

The Secret holds a kubeconfig or ServiceAccount token with permissions to `get`/`list` VirtualMachines, `get`/`list` VirtualMachineInstances, and `create` VirtualMachineInstanceMigrations in the target namespace. The controller VM needs minimal resources (1 CPU, 512Mi memory) — it's running shell scripts, not generating load.

### Cloud-Init Details

```yaml
write_files:
  - path: /usr/local/bin/virtwork-chaos-migration.sh
    permissions: '0755'
    content: |
      #!/bin/bash
      set -euo pipefail
      NAMESPACE="${MIGRATION_NAMESPACE:-<namespace>}"
      INTERVAL_MIN="${MIGRATION_INTERVAL_MIN:-120}"
      INTERVAL_MAX="${MIGRATION_INTERVAL_MAX:-600}"
      TIMEOUT="${MIGRATION_TIMEOUT:-300}"
      EXCLUDE="${MIGRATION_EXCLUDE_SELF:-<self-vm-name>}"
      KUBECONFIG="${KUBECONFIG:-/etc/virtwork/kubeconfig}"
      DRY_RUN="${MIGRATION_DRY_RUN:-false}"

      export KUBECONFIG

      random_interval() {
        echo $(( RANDOM % (INTERVAL_MAX - INTERVAL_MIN + 1) + INTERVAL_MIN ))
      }

      while true; do
        # Discover running VMIs (exclude self)
        TARGETS=$(kubectl get vmi -n "$NAMESPACE" \
          -l app.kubernetes.io/managed-by=virtwork \
          --field-selector=status.phase=Running \
          -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
          grep -v "^${EXCLUDE}$" | grep -v "chaos-migration")

        if [ -z "$TARGETS" ]; then
          echo "$(date -Iseconds) No migration targets found, waiting..."
          sleep 60
          continue
        fi

        # Select random target
        TARGET=$(echo "$TARGETS" | shuf -n1)
        WAIT=$(random_interval)

        if [ "$DRY_RUN" = "true" ]; then
          echo "$(date -Iseconds) DRY-RUN: would migrate $TARGET (next in ${WAIT}s)"
          sleep "$WAIT"
          continue
        fi

        echo "$(date -Iseconds) Triggering migration of $TARGET"
        START_TIME=$(date +%s)

        # Trigger migration
        if virtctl migrate "$TARGET" -n "$NAMESPACE" 2>&1; then
          # Wait for migration to complete
          MIGRATION_NAME=$(kubectl get vmim -n "$NAMESPACE" \
            --sort-by=.metadata.creationTimestamp \
            -o jsonpath='{.items[-1].metadata.name}' 2>/dev/null)

          if [ -n "$MIGRATION_NAME" ]; then
            kubectl wait vmim/"$MIGRATION_NAME" -n "$NAMESPACE" \
              --for=jsonpath='{.status.phase}'=Succeeded \
              --timeout="${TIMEOUT}s" 2>/dev/null && \
              RESULT="succeeded" || RESULT="timeout-or-failed"
          else
            RESULT="no-migration-object"
          fi

          END_TIME=$(date +%s)
          DURATION=$((END_TIME - START_TIME))
          echo "$(date -Iseconds) Migration of $TARGET: $RESULT (${DURATION}s)"
        else
          echo "$(date -Iseconds) Migration trigger failed for $TARGET"
        fi

        echo "$(date -Iseconds) Next migration in ${WAIT}s"
        sleep "$WAIT"
      done
  - path: /etc/systemd/system/virtwork-chaos-migration.service
    content: |
      [Unit]
      Description=Virtwork chaos migration controller
      After=network-online.target
      Wants=network-online.target
      [Service]
      Type=simple
      Environment=MIGRATION_NAMESPACE=<namespace>
      Environment=MIGRATION_INTERVAL_MIN=120
      Environment=MIGRATION_INTERVAL_MAX=600
      Environment=MIGRATION_TIMEOUT=300
      Environment=MIGRATION_EXCLUDE_SELF=<self-vm-name>
      ExecStart=/usr/local/bin/virtwork-chaos-migration.sh
      Restart=always
      RestartSec=30
      [Install]
      WantedBy=multi-user.target
runcmd:
  - curl -Lo /usr/local/bin/virtctl https://github.com/kubevirt/kubevirt/releases/download/v1.7.0/virtctl-v1.7.0-linux-amd64
  - chmod +x /usr/local/bin/virtctl
  - curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - chmod +x /usr/local/bin/kubectl
  - systemctl enable --now virtwork-chaos-migration.service
```

### Use Case

- **Monitoring/Observability partners (Datadog, Dynatrace, Sysdig, Prometheus ecosystem):** The #1 validation scenario. When a VM migrates, its metrics should show continuity — no gaps, no duplicate data points, correct node/host attribution before and after migration. Most monitoring agents track the host/node the VM runs on; migration changes this. Partners need repeated, unpredictable migrations to validate that their agents handle the topology change correctly. A single manual migration proves it works once; `chaos-migration` proves it works continuously.
- **Storage partners (CSI drivers):** Live migration involves detaching and re-attaching the VM's storage on the destination node. For ReadWriteOnce (RWO) PVCs, this is a complex handoff. For ReadWriteMany (RWX) or shared storage, migration should be transparent. Partners need sustained migrations under active I/O (from `disk` or `database` workloads) to validate that their CSI driver handles the storage handoff without data loss or corruption.
- **Network partners (CNI, OVN-Kubernetes, SDN):** Migration moves the VM's network identity (pod IP, MAC address) to a new node. In-flight connections may be disrupted. Network partners need repeated migrations to validate that their CNI correctly handles IP re-assignment, that Services continue routing to the migrated VM, and that network policies remain enforced post-migration.
- **Backup/DR partners:** A VM that migrates during a backup window may cause backup failures (PVC attachment changes, VSS freeze/thaw disrupted). Partners need to validate that their backup product handles concurrent migration events gracefully — either pausing the backup, retrying, or completing without corruption.
- **Platform engineering / SRE tooling:** Validates that cluster operations (node drain, upgrade, rebalance) work correctly when VMs are running realistic workloads. The randomized migration interval simulates the unpredictable timing of real operational events.

### Additional Context

- **Relationship to existing chaos workloads:** This follows the same pattern as `chaos-disk`, `chaos-network`, and `chaos-process`:
  - `chaos-disk`: injects disk-level faults *inside* a VM
  - `chaos-network`: injects network faults *inside* a VM
  - `chaos-process`: kills processes *inside* a VM
  - `chaos-migration`: triggers platform-level disruption *from* a VM to *other* VMs
  The key difference is that `chaos-migration` operates externally on other VMs rather than internally on itself. This requires cluster API access (kubeconfig/ServiceAccount) that the other chaos workloads don't need.
- **RBAC requirements:** The chaos-migration VM needs a ServiceAccount with:
  - `get`, `list` on `virtualmachines.kubevirt.io` and `virtualmachineinstances.kubevirt.io`
  - `create` on `virtualmachineinstancemigrations.kubevirt.io`
  - Scoped to the workload namespace via Role + RoleBinding
  This is additional RBAC beyond what virtwork currently creates. The implementation should create these resources as part of the workload's `RequiresService()` / resource setup path.
- **Self-exclusion:** The chaos-migration VM must never migrate itself — migrating the controller disrupts the migration schedule. The script excludes its own VM name and any VM with "chaos-migration" in the name from the target list.
- **Multi-node requirement:** Live migration requires at least two schedulable nodes. On single-node clusters, migration will fail. The implementation should detect single-node clusters and either skip with a warning or fail fast with a clear message.
- **Migration frequency tuning:** The default range (120-600s, randomized) produces 6-30 migrations per hour across the fleet. For aggressive testing, set `migration-interval-min=30` and `migration-interval-max=60` to get near-continuous migrations. For background chaos, use the defaults.
- **Composability:** This workload is designed to run alongside other workloads — it's the "chaos layer" on top of the "signal layer." The most valuable deployment pattern:
  ```
  virtwork run --workloads cpu,memory,disk,database,network,chaos-migration
  ```
  This produces sustained workload signals *and* continuous migration disruptions, validating that partner products maintain signal accuracy through platform events.
- **Audit integration:** Migration events (target VM, start time, result, duration) logged to stdout/journald are captured in the VM's journal. Consider also recording migration events in virtwork's audit database via a future audit API — but stdout logging is sufficient for initial implementation.
- **virtctl version pinning:** The cloud-init downloads a specific virtctl version. This should match the cluster's KubeVirt version. Consider making the virtctl version configurable or detecting it from the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Workload]: chaos-migration #168

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Workload]: chaos-migration #168

Description

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions