You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chaos workload that triggers live migrations on other virtwork-managed VMs at random intervals, testing partner product resilience to KubeVirt live migration events. Runs inside a dedicated "controller" VM and uses virtctl to initiate migrations on target VMs selected by label, producing a continuous stream of migration events — start, in-progress, completed, or failed — across the VM fleet.
This is a chaos variant in the same family as chaos-disk, chaos-network, and chaos-process, but operates at the platform level rather than inside a single VM. The existing chaos workloads inject faults within a VM (disk fill, network disruption, process kill). chaos-migration injects a platform-level disruption — live migration — that affects the VM's network identity, storage attachment, and node placement. Every partner product that runs alongside KubeVirt VMs must handle migrations correctly: monitoring agents must maintain metric continuity, storage drivers must handle disk re-attachment, network products must re-route traffic, and backup products must handle VMs moving between nodes.
Live migration is not a rare event on OpenShift — it's triggered by node drain (oc adm drain), cluster upgrades, resource rebalancing, and descheduler policies. Partners cannot claim KubeVirt support without validating migration resilience.
Tooling and Packages
Tool: virtctl CLI (KubeVirt client) for triggering migrations, kubectl for label-based VM discovery
RPM packages: none — virtctl and kubectl are single binary downloads
systemd service command: /usr/local/bin/virtwork-chaos-migration.sh
Behavior loop:
Discover virtwork-managed VMs in the target namespace via label selector (app.kubernetes.io/managed-by=virtwork)
Exclude self (the chaos-migration VM) from targets
Sleep for configurable interval (randomized within a range)
Repeat
Configurable parameters:
migration-interval-min: minimum seconds between migrations (default: 120)
migration-interval-max: maximum seconds between migrations (default: 600)
migration-timeout: seconds to wait for a migration to complete (default: 300)
migration-target-labels: additional label selector to filter target VMs (default: all virtwork VMs)
migration-exclude-components: workload components to exclude from migration (default: none — but chaos-migration always excludes itself)
migration-dry-run: log which VM would be migrated without triggering (default: false)
VM Count Model
Single VM (like cpu, memory, disk)
This workload is a "controller" VM — it doesn't generate resource pressure itself, it orchestrates disruptions on other VMs. It requires at least one other virtwork workload running in the same namespace to have migration targets. Intended to be deployed alongside other workloads: virtwork run --workloads cpu,disk,database,chaos-migration.
Required Resources
Persistent storage (DataVolume)
Kubernetes Service (for inter-VM communication)
Kubernetes Secret (for credentials or config)
Additional CPU/memory beyond defaults
GPU or special device passthrough
The Secret holds a kubeconfig or ServiceAccount token with permissions to get/list VirtualMachines, get/list VirtualMachineInstances, and create VirtualMachineInstanceMigrations in the target namespace. The controller VM needs minimal resources (1 CPU, 512Mi memory) — it's running shell scripts, not generating load.
Cloud-Init Details
write_files:
- path: /usr/local/bin/virtwork-chaos-migration.shpermissions: '0755'content: | #!/bin/bash set -euo pipefail NAMESPACE="${MIGRATION_NAMESPACE:-<namespace>}" INTERVAL_MIN="${MIGRATION_INTERVAL_MIN:-120}" INTERVAL_MAX="${MIGRATION_INTERVAL_MAX:-600}" TIMEOUT="${MIGRATION_TIMEOUT:-300}" EXCLUDE="${MIGRATION_EXCLUDE_SELF:-<self-vm-name>}" KUBECONFIG="${KUBECONFIG:-/etc/virtwork/kubeconfig}" DRY_RUN="${MIGRATION_DRY_RUN:-false}" export KUBECONFIG random_interval() { echo $(( RANDOM % (INTERVAL_MAX - INTERVAL_MIN + 1) + INTERVAL_MIN )) } while true; do # Discover running VMIs (exclude self) TARGETS=$(kubectl get vmi -n "$NAMESPACE" \ -l app.kubernetes.io/managed-by=virtwork \ --field-selector=status.phase=Running \ -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \ grep -v "^${EXCLUDE}$" | grep -v "chaos-migration") if [ -z "$TARGETS" ]; then echo "$(date -Iseconds) No migration targets found, waiting..." sleep 60 continue fi # Select random target TARGET=$(echo "$TARGETS" | shuf -n1) WAIT=$(random_interval) if [ "$DRY_RUN" = "true" ]; then echo "$(date -Iseconds) DRY-RUN: would migrate $TARGET (next in ${WAIT}s)" sleep "$WAIT" continue fi echo "$(date -Iseconds) Triggering migration of $TARGET" START_TIME=$(date +%s) # Trigger migration if virtctl migrate "$TARGET" -n "$NAMESPACE" 2>&1; then # Wait for migration to complete MIGRATION_NAME=$(kubectl get vmim -n "$NAMESPACE" \ --sort-by=.metadata.creationTimestamp \ -o jsonpath='{.items[-1].metadata.name}' 2>/dev/null) if [ -n "$MIGRATION_NAME" ]; then kubectl wait vmim/"$MIGRATION_NAME" -n "$NAMESPACE" \ --for=jsonpath='{.status.phase}'=Succeeded \ --timeout="${TIMEOUT}s" 2>/dev/null && \ RESULT="succeeded" || RESULT="timeout-or-failed" else RESULT="no-migration-object" fi END_TIME=$(date +%s) DURATION=$((END_TIME - START_TIME)) echo "$(date -Iseconds) Migration of $TARGET: $RESULT (${DURATION}s)" else echo "$(date -Iseconds) Migration trigger failed for $TARGET" fi echo "$(date -Iseconds) Next migration in ${WAIT}s" sleep "$WAIT" done
- path: /etc/systemd/system/virtwork-chaos-migration.servicecontent: | [Unit] Description=Virtwork chaos migration controller After=network-online.target Wants=network-online.target [Service] Type=simple Environment=MIGRATION_NAMESPACE=<namespace> Environment=MIGRATION_INTERVAL_MIN=120 Environment=MIGRATION_INTERVAL_MAX=600 Environment=MIGRATION_TIMEOUT=300 Environment=MIGRATION_EXCLUDE_SELF=<self-vm-name> ExecStart=/usr/local/bin/virtwork-chaos-migration.sh Restart=always RestartSec=30 [Install] WantedBy=multi-user.targetruncmd:
- curl -Lo /usr/local/bin/virtctl https://github.com/kubevirt/kubevirt/releases/download/v1.7.0/virtctl-v1.7.0-linux-amd64
- chmod +x /usr/local/bin/virtctl
- curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
- chmod +x /usr/local/bin/kubectl
- systemctl enable --now virtwork-chaos-migration.service
Use Case
Monitoring/Observability partners (Datadog, Dynatrace, Sysdig, Prometheus ecosystem): The [Feature]: Add container image build and release workflow with semantic versioning #1 validation scenario. When a VM migrates, its metrics should show continuity — no gaps, no duplicate data points, correct node/host attribution before and after migration. Most monitoring agents track the host/node the VM runs on; migration changes this. Partners need repeated, unpredictable migrations to validate that their agents handle the topology change correctly. A single manual migration proves it works once; chaos-migration proves it works continuously.
Storage partners (CSI drivers): Live migration involves detaching and re-attaching the VM's storage on the destination node. For ReadWriteOnce (RWO) PVCs, this is a complex handoff. For ReadWriteMany (RWX) or shared storage, migration should be transparent. Partners need sustained migrations under active I/O (from disk or database workloads) to validate that their CSI driver handles the storage handoff without data loss or corruption.
Network partners (CNI, OVN-Kubernetes, SDN): Migration moves the VM's network identity (pod IP, MAC address) to a new node. In-flight connections may be disrupted. Network partners need repeated migrations to validate that their CNI correctly handles IP re-assignment, that Services continue routing to the migrated VM, and that network policies remain enforced post-migration.
Backup/DR partners: A VM that migrates during a backup window may cause backup failures (PVC attachment changes, VSS freeze/thaw disrupted). Partners need to validate that their backup product handles concurrent migration events gracefully — either pausing the backup, retrying, or completing without corruption.
Platform engineering / SRE tooling: Validates that cluster operations (node drain, upgrade, rebalance) work correctly when VMs are running realistic workloads. The randomized migration interval simulates the unpredictable timing of real operational events.
Additional Context
Relationship to existing chaos workloads: This follows the same pattern as chaos-disk, chaos-network, and chaos-process:
chaos-disk: injects disk-level faults inside a VM
chaos-network: injects network faults inside a VM
chaos-process: kills processes inside a VM
chaos-migration: triggers platform-level disruption from a VM to other VMs
The key difference is that chaos-migration operates externally on other VMs rather than internally on itself. This requires cluster API access (kubeconfig/ServiceAccount) that the other chaos workloads don't need.
RBAC requirements: The chaos-migration VM needs a ServiceAccount with:
get, list on virtualmachines.kubevirt.io and virtualmachineinstances.kubevirt.io
create on virtualmachineinstancemigrations.kubevirt.io
Scoped to the workload namespace via Role + RoleBinding
This is additional RBAC beyond what virtwork currently creates. The implementation should create these resources as part of the workload's RequiresService() / resource setup path.
Self-exclusion: The chaos-migration VM must never migrate itself — migrating the controller disrupts the migration schedule. The script excludes its own VM name and any VM with "chaos-migration" in the name from the target list.
Multi-node requirement: Live migration requires at least two schedulable nodes. On single-node clusters, migration will fail. The implementation should detect single-node clusters and either skip with a warning or fail fast with a clear message.
Migration frequency tuning: The default range (120-600s, randomized) produces 6-30 migrations per hour across the fleet. For aggressive testing, set migration-interval-min=30 and migration-interval-max=60 to get near-continuous migrations. For background chaos, use the defaults.
Composability: This workload is designed to run alongside other workloads — it's the "chaos layer" on top of the "signal layer." The most valuable deployment pattern:
virtwork run --workloads cpu,memory,disk,database,network,chaos-migration
This produces sustained workload signals and continuous migration disruptions, validating that partner products maintain signal accuracy through platform events.
Audit integration: Migration events (target VM, start time, result, duration) logged to stdout/journald are captured in the VM's journal. Consider also recording migration events in virtwork's audit database via a future audit API — but stdout logging is sufficient for initial implementation.
virtctl version pinning: The cloud-init downloads a specific virtctl version. This should match the cluster's KubeVirt version. Consider making the virtctl version configurable or detecting it from the cluster.
Workload Name
chaos-migration
Workload Description
Chaos workload that triggers live migrations on other virtwork-managed VMs at random intervals, testing partner product resilience to KubeVirt live migration events. Runs inside a dedicated "controller" VM and uses
virtctlto initiate migrations on target VMs selected by label, producing a continuous stream of migration events — start, in-progress, completed, or failed — across the VM fleet.This is a chaos variant in the same family as
chaos-disk,chaos-network, andchaos-process, but operates at the platform level rather than inside a single VM. The existing chaos workloads inject faults within a VM (disk fill, network disruption, process kill).chaos-migrationinjects a platform-level disruption — live migration — that affects the VM's network identity, storage attachment, and node placement. Every partner product that runs alongside KubeVirt VMs must handle migrations correctly: monitoring agents must maintain metric continuity, storage drivers must handle disk re-attachment, network products must re-route traffic, and backup products must handle VMs moving between nodes.Live migration is not a rare event on OpenShift — it's triggered by node drain (
oc adm drain), cluster upgrades, resource rebalancing, and descheduler policies. Partners cannot claim KubeVirt support without validating migration resilience.Tooling and Packages
virtctlCLI (KubeVirt client) for triggering migrations,kubectlfor label-based VM discoveryvirtctlandkubectlare single binary downloads/usr/local/bin/virtwork-chaos-migration.shapp.kubernetes.io/managed-by=virtwork)virtctl migrate <vm-name> -n <namespace>migration-interval-min: minimum seconds between migrations (default: 120)migration-interval-max: maximum seconds between migrations (default: 600)migration-timeout: seconds to wait for a migration to complete (default: 300)migration-target-labels: additional label selector to filter target VMs (default: all virtwork VMs)migration-exclude-components: workload components to exclude from migration (default: none — butchaos-migrationalways excludes itself)migration-dry-run: log which VM would be migrated without triggering (default: false)VM Count Model
Single VM (like cpu, memory, disk)
This workload is a "controller" VM — it doesn't generate resource pressure itself, it orchestrates disruptions on other VMs. It requires at least one other virtwork workload running in the same namespace to have migration targets. Intended to be deployed alongside other workloads:
virtwork run --workloads cpu,disk,database,chaos-migration.Required Resources
The Secret holds a kubeconfig or ServiceAccount token with permissions to
get/listVirtualMachines,get/listVirtualMachineInstances, andcreateVirtualMachineInstanceMigrations in the target namespace. The controller VM needs minimal resources (1 CPU, 512Mi memory) — it's running shell scripts, not generating load.Cloud-Init Details
Use Case
chaos-migrationproves it works continuously.diskordatabaseworkloads) to validate that their CSI driver handles the storage handoff without data loss or corruption.Additional Context
chaos-disk,chaos-network, andchaos-process:chaos-disk: injects disk-level faults inside a VMchaos-network: injects network faults inside a VMchaos-process: kills processes inside a VMchaos-migration: triggers platform-level disruption from a VM to other VMsThe key difference is that
chaos-migrationoperates externally on other VMs rather than internally on itself. This requires cluster API access (kubeconfig/ServiceAccount) that the other chaos workloads don't need.get,listonvirtualmachines.kubevirt.ioandvirtualmachineinstances.kubevirt.iocreateonvirtualmachineinstancemigrations.kubevirt.ioThis is additional RBAC beyond what virtwork currently creates. The implementation should create these resources as part of the workload's
RequiresService()/ resource setup path.migration-interval-min=30andmigration-interval-max=60to get near-continuous migrations. For background chaos, use the defaults.