Skip to content

[Workload]: migration-stress #159

Description

@mrhillsman

Workload Name

migration-stress

Workload Description

High memory-dirty-rate workload purpose-built to stress KubeVirt live migration. Continuously writes to memory pages at a configurable rate, forcing the live migration pre-copy phase to iterate more rounds before convergence. Produces a sustained, measurable memory dirtying pattern that makes the VM actively hard to migrate — the exact scenario that storage, networking, and monitoring partners need to validate their products against during migration events.

This is uniquely KubeVirt/CNV — no equivalent exists in generic VM workload tooling. Live migration is a first-class operation in OpenShift Virtualization (triggered by node drain, maintenance, rebalancing), and partner products must continue functioning correctly while VMs migrate. The existing memory workload uses stress-ng to allocate and hold memory at 80% pressure — it tests memory capacity, not memory mutation rate. migration-stress targets a completely different axis: how fast pages are being dirtied, which directly determines migration difficulty and duration.

Tooling and Packages

  • Tool: stress-ng with the vm stressor in aggressive write mode, or a custom mmap/write loop
  • RPM packages: stress-ng
  • systemd service command: stress-ng --vm 2 --vm-bytes 75% --vm-method write64 --vm-keep --aggressive
    • --vm 2: two VM stressor workers
    • --vm-bytes 75%: each worker maps 75% of available memory
    • --vm-method write64: write 64-bit values (high dirty rate)
    • --vm-keep: keep mappings (re-dirty same pages rather than remap)
    • --aggressive: maximize throughput
  • Alternative: custom C/shell program using mmap(MAP_PRIVATE|MAP_ANONYMOUS) + sequential write loop for maximum dirty rate control
  • Configurable parameters:
    • dirty-rate-target: target MB/s of memory dirtying (default: unbounded — as fast as possible)
    • vm-workers: number of stressor workers (default: 2)
    • vm-bytes-percent: percentage of memory to map per worker (default: 75)
    • vm-method: stress-ng vm method — write64 (fast), write1024 (faster), zero (fastest) (default: write64)

VM Count Model

Single VM (like cpu, memory, disk)

Required Resources

  • Persistent storage (DataVolume)
  • Kubernetes Service (for inter-VM communication)
  • Kubernetes Secret (for credentials or config)
  • Additional CPU/memory beyond defaults
  • GPU or special device passthrough

Memory should be sized larger than defaults (e.g., 4Gi+) to produce meaningful migration difficulty. With only 1Gi of memory, even a high dirty rate converges quickly. At 4-8Gi with aggressive dirtying, migration takes noticeably longer and may require multiple pre-copy rounds.

Cloud-Init Details

packages:
  - stress-ng
write_files:
  - path: /etc/systemd/system/virtwork-migration-stress.service
    content: |
      [Unit]
      Description=Virtwork migration stress workload (high memory dirty rate)
      After=multi-user.target
      [Service]
      Type=simple
      ExecStart=/usr/bin/stress-ng \
        --vm 2 \
        --vm-bytes 75%% \
        --vm-method write64 \
        --vm-keep \
        --aggressive \
        --metrics-brief
      Restart=always
      RestartSec=5
      [Install]
      WantedBy=multi-user.target
runcmd:
  - systemctl enable --now virtwork-migration-stress.service

Use Case

  • Storage partners (CSI drivers): During live migration, the storage layer must handle the VM's disk being accessed from a new node. A VM that is actively dirtying memory at high rates creates longer migration windows and more complex storage handoff scenarios. Partners need to validate that their CSI driver maintains I/O consistency during extended migration — not just the quick migrations that idle VMs produce.
  • Network partners (CNI, SDN, OVN-Kubernetes): Live migration moves a VM's network identity (IP, MAC) to a new node. A VM with active memory writes produces a longer migration window during which network traffic must be correctly routed to both the source and destination. Partners need sustained migrations to validate network cutover correctness and downtime measurement.
  • Monitoring/Observability partners: Partners need to validate that their agents correctly handle metric continuity during migration — no gaps, no duplicate data points, correct host/node attribution before and after migration. A VM that migrates quickly (idle VM, <1 second) doesn't exercise this. A VM that takes 30-60 seconds to migrate because of dirty memory exposes real-world monitoring gaps.
  • Platform engineering / Node lifecycle: Validates that oc adm drain with live migration works correctly under realistic conditions. An idle cluster drains instantly; a cluster with migration-stress VMs reveals the real-world drain duration and any timeout issues.
  • Red Hat CNV engineering: Provides a reproducible, tunable migration difficulty benchmark. Adjusting vm-bytes-percent and vm-method produces a spectrum from "easy migration" to "convergence-challenged migration" — useful for regression testing migration improvements.

Additional Context

  • This workload is distinct from the existing memory workload in purpose and behavior:
    • memory: allocates and holds memory at target pressure — tests memory capacity and OOM behavior
    • migration-stress: continuously writes to memory pages — tests migration convergence and duration
    • A VM can run both simultaneously (memory pressure + high dirty rate), but they test different things.
  • The key metric to observe is migration duration: virtctl migrate <vm-name> followed by watching oc get vmim (VirtualMachineInstanceMigration) for the migration to complete. With this workload active, migration should take measurably longer than an idle VM.
  • Consider pairing this workload with a monitoring dashboard that shows: migration start time, dirty page rate during pre-copy iterations, convergence point, total migration duration, and post-migration workload resumption. This is the demo that sells the value to partners.
  • stress-ng's --vm-method options provide a tunable dirty rate spectrum:
    • write64: moderate dirty rate (~2-5 GB/s depending on hardware)
    • write1024: higher dirty rate
    • zero: maximum dirty rate (memset to zero)
    • This lets users tune migration difficulty from "slightly harder than idle" to "convergence-challenged."
  • The --vm-keep flag is critical — without it, stress-ng remaps memory on each iteration, which tests page fault handling rather than page dirtying. --vm-keep ensures the same pages are repeatedly dirtied, which is what stresses the migration pre-copy algorithm.
  • Memory recommendation: 4Gi minimum, 8Gi for meaningful migration stress. At 2Gi with write64, migration typically converges in 1-2 pre-copy rounds. At 8Gi, it may take 5+ rounds or require post-copy migration fallback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.size/LDenotes a PR that changes 100-499 lines, ignoring generated files.workload-requestRequest for a new workload typeworkload/tier-2High impact, introduces new patterns or requires domain knowledge.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions