[Workload]: migration-stress

### Workload Name

migration-stress

### Workload Description

High memory-dirty-rate workload purpose-built to stress KubeVirt live migration. Continuously writes to memory pages at a configurable rate, forcing the live migration pre-copy phase to iterate more rounds before convergence. Produces a sustained, measurable memory dirtying pattern that makes the VM actively hard to migrate — the exact scenario that storage, networking, and monitoring partners need to validate their products against during migration events.

This is uniquely KubeVirt/CNV — no equivalent exists in generic VM workload tooling. Live migration is a first-class operation in OpenShift Virtualization (triggered by node drain, maintenance, rebalancing), and partner products must continue functioning correctly while VMs migrate. The existing `memory` workload uses stress-ng to allocate and hold memory at 80% pressure — it tests memory capacity, not memory mutation rate. `migration-stress` targets a completely different axis: how fast pages are being dirtied, which directly determines migration difficulty and duration.

### Tooling and Packages

- Tool: stress-ng with the `vm` stressor in aggressive write mode, or a custom mmap/write loop
- RPM packages: `stress-ng`
- systemd service command: `stress-ng --vm 2 --vm-bytes 75% --vm-method write64 --vm-keep --aggressive`
  - `--vm 2`: two VM stressor workers
  - `--vm-bytes 75%`: each worker maps 75% of available memory
  - `--vm-method write64`: write 64-bit values (high dirty rate)
  - `--vm-keep`: keep mappings (re-dirty same pages rather than remap)
  - `--aggressive`: maximize throughput
- Alternative: custom C/shell program using `mmap(MAP_PRIVATE|MAP_ANONYMOUS)` + sequential write loop for maximum dirty rate control
- Configurable parameters:
  - `dirty-rate-target`: target MB/s of memory dirtying (default: unbounded — as fast as possible)
  - `vm-workers`: number of stressor workers (default: 2)
  - `vm-bytes-percent`: percentage of memory to map per worker (default: 75)
  - `vm-method`: stress-ng vm method — `write64` (fast), `write1024` (faster), `zero` (fastest) (default: `write64`)

### VM Count Model

Single VM (like cpu, memory, disk)

### Required Resources

- [ ] Persistent storage (DataVolume)
- [ ] Kubernetes Service (for inter-VM communication)
- [ ] Kubernetes Secret (for credentials or config)
- [x] Additional CPU/memory beyond defaults
- [ ] GPU or special device passthrough

Memory should be sized larger than defaults (e.g., 4Gi+) to produce meaningful migration difficulty. With only 1Gi of memory, even a high dirty rate converges quickly. At 4-8Gi with aggressive dirtying, migration takes noticeably longer and may require multiple pre-copy rounds.

### Cloud-Init Details

```yaml
packages:
  - stress-ng
write_files:
  - path: /etc/systemd/system/virtwork-migration-stress.service
    content: |
      [Unit]
      Description=Virtwork migration stress workload (high memory dirty rate)
      After=multi-user.target
      [Service]
      Type=simple
      ExecStart=/usr/bin/stress-ng \
        --vm 2 \
        --vm-bytes 75%% \
        --vm-method write64 \
        --vm-keep \
        --aggressive \
        --metrics-brief
      Restart=always
      RestartSec=5
      [Install]
      WantedBy=multi-user.target
runcmd:
  - systemctl enable --now virtwork-migration-stress.service
```

### Use Case

- **Storage partners (CSI drivers):** During live migration, the storage layer must handle the VM's disk being accessed from a new node. A VM that is actively dirtying memory at high rates creates longer migration windows and more complex storage handoff scenarios. Partners need to validate that their CSI driver maintains I/O consistency during extended migration — not just the quick migrations that idle VMs produce.
- **Network partners (CNI, SDN, OVN-Kubernetes):** Live migration moves a VM's network identity (IP, MAC) to a new node. A VM with active memory writes produces a longer migration window during which network traffic must be correctly routed to both the source and destination. Partners need sustained migrations to validate network cutover correctness and downtime measurement.
- **Monitoring/Observability partners:** Partners need to validate that their agents correctly handle metric continuity during migration — no gaps, no duplicate data points, correct host/node attribution before and after migration. A VM that migrates quickly (idle VM, <1 second) doesn't exercise this. A VM that takes 30-60 seconds to migrate because of dirty memory exposes real-world monitoring gaps.
- **Platform engineering / Node lifecycle:** Validates that `oc adm drain` with live migration works correctly under realistic conditions. An idle cluster drains instantly; a cluster with migration-stress VMs reveals the real-world drain duration and any timeout issues.
- **Red Hat CNV engineering:** Provides a reproducible, tunable migration difficulty benchmark. Adjusting `vm-bytes-percent` and `vm-method` produces a spectrum from "easy migration" to "convergence-challenged migration" — useful for regression testing migration improvements.

### Additional Context

- This workload is distinct from the existing `memory` workload in purpose and behavior:
  - `memory`: allocates and *holds* memory at target pressure — tests memory capacity and OOM behavior
  - `migration-stress`: continuously *writes* to memory pages — tests migration convergence and duration
  - A VM can run both simultaneously (memory pressure + high dirty rate), but they test different things.
- The key metric to observe is migration duration: `virtctl migrate <vm-name>` followed by watching `oc get vmim` (VirtualMachineInstanceMigration) for the migration to complete. With this workload active, migration should take measurably longer than an idle VM.
- Consider pairing this workload with a monitoring dashboard that shows: migration start time, dirty page rate during pre-copy iterations, convergence point, total migration duration, and post-migration workload resumption. This is the demo that sells the value to partners.
- stress-ng's `--vm-method` options provide a tunable dirty rate spectrum:
  - `write64`: moderate dirty rate (~2-5 GB/s depending on hardware)
  - `write1024`: higher dirty rate
  - `zero`: maximum dirty rate (memset to zero)
  - This lets users tune migration difficulty from "slightly harder than idle" to "convergence-challenged."
- The `--vm-keep` flag is critical — without it, stress-ng remaps memory on each iteration, which tests page fault handling rather than page dirtying. `--vm-keep` ensures the same pages are repeatedly dirtied, which is what stresses the migration pre-copy algorithm.
- Memory recommendation: 4Gi minimum, 8Gi for meaningful migration stress. At 2Gi with write64, migration typically converges in 1-2 pre-copy rounds. At 8Gi, it may take 5+ rounds or require post-copy migration fallback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Workload]: migration-stress #159

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Workload]: migration-stress #159

Description

Workload Name

Workload Description

Tooling and Packages

VM Count Model

Required Resources

Cloud-Init Details

Use Case

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions