Skip to content

[Workload]: chaos-migration #168

Description

@mrhillsman

Workload Name

chaos-migration

Workload Description

Chaos workload that triggers live migrations on other virtwork-managed VMs at random intervals, testing partner product resilience to KubeVirt live migration events. Runs inside a dedicated "controller" VM and uses virtctl to initiate migrations on target VMs selected by label, producing a continuous stream of migration events — start, in-progress, completed, or failed — across the VM fleet.

This is a chaos variant in the same family as chaos-disk, chaos-network, and chaos-process, but operates at the platform level rather than inside a single VM. The existing chaos workloads inject faults within a VM (disk fill, network disruption, process kill). chaos-migration injects a platform-level disruption — live migration — that affects the VM's network identity, storage attachment, and node placement. Every partner product that runs alongside KubeVirt VMs must handle migrations correctly: monitoring agents must maintain metric continuity, storage drivers must handle disk re-attachment, network products must re-route traffic, and backup products must handle VMs moving between nodes.

Live migration is not a rare event on OpenShift — it's triggered by node drain (oc adm drain), cluster upgrades, resource rebalancing, and descheduler policies. Partners cannot claim KubeVirt support without validating migration resilience.

Tooling and Packages

  • Tool: virtctl CLI (KubeVirt client) for triggering migrations, kubectl for label-based VM discovery
  • RPM packages: none — virtctl and kubectl are single binary downloads
  • systemd service command: /usr/local/bin/virtwork-chaos-migration.sh
  • Behavior loop:
    1. Discover virtwork-managed VMs in the target namespace via label selector (app.kubernetes.io/managed-by=virtwork)
    2. Exclude self (the chaos-migration VM) from targets
    3. Select a random target VM
    4. Trigger migration: virtctl migrate <vm-name> -n <namespace>
    5. Wait for migration to complete (or timeout)
    6. Log result (success, failure, duration)
    7. Sleep for configurable interval (randomized within a range)
    8. Repeat
  • Configurable parameters:
    • migration-interval-min: minimum seconds between migrations (default: 120)
    • migration-interval-max: maximum seconds between migrations (default: 600)
    • migration-timeout: seconds to wait for a migration to complete (default: 300)
    • migration-target-labels: additional label selector to filter target VMs (default: all virtwork VMs)
    • migration-exclude-components: workload components to exclude from migration (default: none — but chaos-migration always excludes itself)
    • migration-dry-run: log which VM would be migrated without triggering (default: false)

VM Count Model

Single VM (like cpu, memory, disk)

This workload is a "controller" VM — it doesn't generate resource pressure itself, it orchestrates disruptions on other VMs. It requires at least one other virtwork workload running in the same namespace to have migration targets. Intended to be deployed alongside other workloads: virtwork run --workloads cpu,disk,database,chaos-migration.

Required Resources

  • Persistent storage (DataVolume)
  • Kubernetes Service (for inter-VM communication)
  • Kubernetes Secret (for credentials or config)
  • Additional CPU/memory beyond defaults
  • GPU or special device passthrough

The Secret holds a kubeconfig or ServiceAccount token with permissions to get/list VirtualMachines, get/list VirtualMachineInstances, and create VirtualMachineInstanceMigrations in the target namespace. The controller VM needs minimal resources (1 CPU, 512Mi memory) — it's running shell scripts, not generating load.

Cloud-Init Details

write_files:
  - path: /usr/local/bin/virtwork-chaos-migration.sh
    permissions: '0755'
    content: |
      #!/bin/bash
      set -euo pipefail
      NAMESPACE="${MIGRATION_NAMESPACE:-<namespace>}"
      INTERVAL_MIN="${MIGRATION_INTERVAL_MIN:-120}"
      INTERVAL_MAX="${MIGRATION_INTERVAL_MAX:-600}"
      TIMEOUT="${MIGRATION_TIMEOUT:-300}"
      EXCLUDE="${MIGRATION_EXCLUDE_SELF:-<self-vm-name>}"
      KUBECONFIG="${KUBECONFIG:-/etc/virtwork/kubeconfig}"
      DRY_RUN="${MIGRATION_DRY_RUN:-false}"

      export KUBECONFIG

      random_interval() {
        echo $(( RANDOM % (INTERVAL_MAX - INTERVAL_MIN + 1) + INTERVAL_MIN ))
      }

      while true; do
        # Discover running VMIs (exclude self)
        TARGETS=$(kubectl get vmi -n "$NAMESPACE" \
          -l app.kubernetes.io/managed-by=virtwork \
          --field-selector=status.phase=Running \
          -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
          grep -v "^${EXCLUDE}$" | grep -v "chaos-migration")

        if [ -z "$TARGETS" ]; then
          echo "$(date -Iseconds) No migration targets found, waiting..."
          sleep 60
          continue
        fi

        # Select random target
        TARGET=$(echo "$TARGETS" | shuf -n1)
        WAIT=$(random_interval)

        if [ "$DRY_RUN" = "true" ]; then
          echo "$(date -Iseconds) DRY-RUN: would migrate $TARGET (next in ${WAIT}s)"
          sleep "$WAIT"
          continue
        fi

        echo "$(date -Iseconds) Triggering migration of $TARGET"
        START_TIME=$(date +%s)

        # Trigger migration
        if virtctl migrate "$TARGET" -n "$NAMESPACE" 2>&1; then
          # Wait for migration to complete
          MIGRATION_NAME=$(kubectl get vmim -n "$NAMESPACE" \
            --sort-by=.metadata.creationTimestamp \
            -o jsonpath='{.items[-1].metadata.name}' 2>/dev/null)

          if [ -n "$MIGRATION_NAME" ]; then
            kubectl wait vmim/"$MIGRATION_NAME" -n "$NAMESPACE" \
              --for=jsonpath='{.status.phase}'=Succeeded \
              --timeout="${TIMEOUT}s" 2>/dev/null && \
              RESULT="succeeded" || RESULT="timeout-or-failed"
          else
            RESULT="no-migration-object"
          fi

          END_TIME=$(date +%s)
          DURATION=$((END_TIME - START_TIME))
          echo "$(date -Iseconds) Migration of $TARGET: $RESULT (${DURATION}s)"
        else
          echo "$(date -Iseconds) Migration trigger failed for $TARGET"
        fi

        echo "$(date -Iseconds) Next migration in ${WAIT}s"
        sleep "$WAIT"
      done
  - path: /etc/systemd/system/virtwork-chaos-migration.service
    content: |
      [Unit]
      Description=Virtwork chaos migration controller
      After=network-online.target
      Wants=network-online.target
      [Service]
      Type=simple
      Environment=MIGRATION_NAMESPACE=<namespace>
      Environment=MIGRATION_INTERVAL_MIN=120
      Environment=MIGRATION_INTERVAL_MAX=600
      Environment=MIGRATION_TIMEOUT=300
      Environment=MIGRATION_EXCLUDE_SELF=<self-vm-name>
      ExecStart=/usr/local/bin/virtwork-chaos-migration.sh
      Restart=always
      RestartSec=30
      [Install]
      WantedBy=multi-user.target
runcmd:
  - curl -Lo /usr/local/bin/virtctl https://github.com/kubevirt/kubevirt/releases/download/v1.7.0/virtctl-v1.7.0-linux-amd64
  - chmod +x /usr/local/bin/virtctl
  - curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - chmod +x /usr/local/bin/kubectl
  - systemctl enable --now virtwork-chaos-migration.service

Use Case

  • Monitoring/Observability partners (Datadog, Dynatrace, Sysdig, Prometheus ecosystem): The [Feature]: Add container image build and release workflow with semantic versioning #1 validation scenario. When a VM migrates, its metrics should show continuity — no gaps, no duplicate data points, correct node/host attribution before and after migration. Most monitoring agents track the host/node the VM runs on; migration changes this. Partners need repeated, unpredictable migrations to validate that their agents handle the topology change correctly. A single manual migration proves it works once; chaos-migration proves it works continuously.
  • Storage partners (CSI drivers): Live migration involves detaching and re-attaching the VM's storage on the destination node. For ReadWriteOnce (RWO) PVCs, this is a complex handoff. For ReadWriteMany (RWX) or shared storage, migration should be transparent. Partners need sustained migrations under active I/O (from disk or database workloads) to validate that their CSI driver handles the storage handoff without data loss or corruption.
  • Network partners (CNI, OVN-Kubernetes, SDN): Migration moves the VM's network identity (pod IP, MAC address) to a new node. In-flight connections may be disrupted. Network partners need repeated migrations to validate that their CNI correctly handles IP re-assignment, that Services continue routing to the migrated VM, and that network policies remain enforced post-migration.
  • Backup/DR partners: A VM that migrates during a backup window may cause backup failures (PVC attachment changes, VSS freeze/thaw disrupted). Partners need to validate that their backup product handles concurrent migration events gracefully — either pausing the backup, retrying, or completing without corruption.
  • Platform engineering / SRE tooling: Validates that cluster operations (node drain, upgrade, rebalance) work correctly when VMs are running realistic workloads. The randomized migration interval simulates the unpredictable timing of real operational events.

Additional Context

  • Relationship to existing chaos workloads: This follows the same pattern as chaos-disk, chaos-network, and chaos-process:
    • chaos-disk: injects disk-level faults inside a VM
    • chaos-network: injects network faults inside a VM
    • chaos-process: kills processes inside a VM
    • chaos-migration: triggers platform-level disruption from a VM to other VMs
      The key difference is that chaos-migration operates externally on other VMs rather than internally on itself. This requires cluster API access (kubeconfig/ServiceAccount) that the other chaos workloads don't need.
  • RBAC requirements: The chaos-migration VM needs a ServiceAccount with:
    • get, list on virtualmachines.kubevirt.io and virtualmachineinstances.kubevirt.io
    • create on virtualmachineinstancemigrations.kubevirt.io
    • Scoped to the workload namespace via Role + RoleBinding
      This is additional RBAC beyond what virtwork currently creates. The implementation should create these resources as part of the workload's RequiresService() / resource setup path.
  • Self-exclusion: The chaos-migration VM must never migrate itself — migrating the controller disrupts the migration schedule. The script excludes its own VM name and any VM with "chaos-migration" in the name from the target list.
  • Multi-node requirement: Live migration requires at least two schedulable nodes. On single-node clusters, migration will fail. The implementation should detect single-node clusters and either skip with a warning or fail fast with a clear message.
  • Migration frequency tuning: The default range (120-600s, randomized) produces 6-30 migrations per hour across the fleet. For aggressive testing, set migration-interval-min=30 and migration-interval-max=60 to get near-continuous migrations. For background chaos, use the defaults.
  • Composability: This workload is designed to run alongside other workloads — it's the "chaos layer" on top of the "signal layer." The most valuable deployment pattern:
    virtwork run --workloads cpu,memory,disk,database,network,chaos-migration
    
    This produces sustained workload signals and continuous migration disruptions, validating that partner products maintain signal accuracy through platform events.
  • Audit integration: Migration events (target VM, start time, result, duration) logged to stdout/journald are captured in the VM's journal. Consider also recording migration events in virtwork's audit database via a future audit API — but stdout logging is sufficient for initial implementation.
  • virtctl version pinning: The cloud-init downloads a specific virtctl version. This should match the cluster's KubeVirt version. Consider making the virtctl version configurable or detecting it from the cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/awaiting-more-evidenceLowest priority. Possibly useful, but not yet enough support to actually get it done.size/XLDenotes a PR that changes 500-999 lines, ignoring generated files.workload-requestRequest for a new workload typeworkload/tier-2High impact, introduces new patterns or requires domain knowledge.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions