Skip to content

maci0/katamaran

Repository files navigation

katamaran logo

katamaran

Zero-drop live migration for Kata Containers

Zero-packet-drop live migration for Kata Containers.

TL;DR

make                # builds all katamaran binaries into bin/

# Destination node (run first)
sudo ./bin/katamaran --mode dest --qmp /run/vc/vm/<id>/extra-monitor.sock --tap tap0_kata

# Source node
sudo ./bin/katamaran --mode source --qmp /run/vc/vm/<id>/extra-monitor.sock \
  --dest-ip <dest-node-ip> --vm-ip <pod-ip>

Three-phase migration: storage (NBD drive-mirror) → compute (RAM pre-copy) → network (IPIP/GRE tunnel + sch_plug qdisc). Packets arriving during the VM pause are buffered and flushed on resume — zero drops. Add --shared-storage with Ceph/NFS to skip the storage phase entirely.


Supports both local storage (NBD drive-mirror) and shared storage (Ceph, NFS — skip mirroring with --shared-storage).

Traditional QEMU live migration assumes shared storage. In Kubernetes with Kata Containers, pods typically use local virtio-blk disks — meaning the entire block device must be migrated alongside RAM and network state. katamaran orchestrates all three phases in the correct order while guaranteeing zero in-flight packet drops during the cutover.

Like a catamaran glides between two hulls, katamaran glides your VM between two nodes — smoothly, with nothing lost overboard.


Table of Contents

See also: Installation Guide · Usage Guide · Testing Guide · User Stories · Dashboard · Roadmap


Getting Started

This section walks you through building katamaran, setting up a two-node cluster with Kata Containers, and running your first live migration — step by step.

Tutorial Requirements

In addition to the runtime prerequisites (QEMU 6.2+, Kata 3.x, iproute2, Go 1.26+), the tutorial requires:

  • Linux host with KVM (/dev/kvm must exist)
  • minikube, kubectl, helm installed
  • podman (or Docker)
  • ~30 GB free disk, ~20 GB free RAM (for two KVM nodes)

Verify KVM and nested virtualization:

ls /dev/kvm                                         # must exist
cat /sys/module/kvm_intel/parameters/nested          # Y or 1 (Intel)
cat /sys/module/kvm_amd/parameters/nested            # 1 (AMD)

1. Build katamaran

git clone https://github.com/maci0/katamaran.git
cd katamaran
make
./bin/katamaran --help

Run the smoke tests (no VMs required):

make smoke    # validates compilation, CLI behavior, project structure

2. Create a Two-Node Minikube Cluster

minikube start -p katamaran-demo \
  --nodes 2 \
  --driver=kvm2 \
  --memory=8192 \
  --cpus=4 \
  --container-runtime=containerd \
  --cni=calico

# Wait for both nodes to be ready
kubectl wait --for=condition=Ready node --all --timeout=120s

3. Install Kata Containers

helm upgrade --install kata-deploy \
  oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
  --version 3.27.0 \
  --namespace kube-system \
  --create-namespace \
  --set shims.disableAll=true \
  --set shims.qemu.enabled=true \
  --wait=false

# Wait for kata-deploy to finish installing on both nodes
kubectl -n kube-system rollout status daemonset/kata-deploy --timeout=600s

# Verify the kata-qemu RuntimeClass exists
kubectl get runtimeclass kata-qemu

4. Deploy katamaran on Both Nodes

Build the container image and deploy via DaemonSet. With the Kata 3.27 layout shown above, this installs the katamaran binary, enables the Kata QMP extra-monitor socket, and loads the required kernel modules (ipip, ip6_tunnel, ip_gre, ip6_gre, sch_plug) on both nodes:

make image
minikube -p katamaran-demo image load katamaran.tar
kubectl apply -f deploy/daemonset.yaml
kubectl -n kube-system rollout status daemonset/katamaran-deploy --timeout=120s

5. Deploy a Test Workload

Instead of relying on an automated black-box script, let's deploy a Kata VM pod manually using the provided demo manifest:

kubectl apply -f demo/nginx-kata.yaml

Wait for the pod to become ready:

kubectl wait --for=condition=Ready pod/nginx-kata --timeout=60s

The manifest includes a NodePort service on port 30081. You can test reaching the NGINX container directly from your host machine:

# Get the IP of the node
NODE_IP=$(kubectl get node katamaran-demo -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}')
curl http://$NODE_IP:30081

(Note: If you are using Docker Desktop on macOS/Windows, or if your hypervisor NAT doesn't bridge NodePorts directly, you can use minikube service nginx-service --url to create a localhost tunnel instead).

6. Run the Migration via Kubernetes Jobs

To orchestrate the migration, katamaran uses two Kubernetes Jobs — one on the destination node and one on the source node. The canonical Job manifests live in internal/orchestrator/templates/ (embedded into the binaries). For ad-hoc shell-driven runs, deploy/migrate.sh renders those templates with envsubst, applies them via kubectl, and waits for completion.

The current shell path can run in pod-picker mode: the source job resolves the Kata sandbox at runtime, captures the source QEMU command line, and the destination job replays it with -incoming defer, so no placeholder destination Kata pod is required:

./deploy/migrate.sh \
  --source-node katamaran-demo \
  --dest-node   katamaran-demo-m02 \
  --pod-name    nginx-kata \
  --pod-namespace default \
  --dest-ip     "$(kubectl get node katamaran-demo-m02 -o jsonpath='{.status.addresses[?(@.type==\"InternalIP\")].address}')" \
  --image       localhost/katamaran:dev \
  --shared-storage \
  --replay-cmdline

Production paths submit the same templates through the in-cluster Native orchestrator embedded in the dashboard or in katamaran-mgr (CRD controller) — see kubectl apply -f deploy/migration-example.yaml for the CRD path.

Tip: For a faster setup (~30s instead of ~5min), use Kind + Podman instead of minikube:

./scripts/e2e.sh --provider kind --ping-proof
./scripts/e2e.sh --teardown --provider kind

Architecture Overview

Migration proceeds in three sequential phases:

sequenceDiagram
    participant S as Source Node
    participant D as Destination Node

    rect rgb(59, 130, 246, 0.1)
    Note over S,D: Phase 1 — Storage Mirroring
    S->>D: NBD drive-mirror (background sync)
    D-->>S: Block device ready
    Note over S: VM keeps running
    end

    rect rgb(16, 185, 129, 0.1)
    Note over S,D: Phase 2 — Compute Migration
    S->>D: RAM pre-copy (TCP)
    Note over S: VM pauses (STOP)
    Note over D: VM resumes (RESUME)
    end

    rect rgb(245, 158, 11, 0.1)
    Note over S,D: Phase 3 — Network Cutover
    S->>D: IPIP/GRE tunnel (redirect traffic)
    Note over D: sch_plug unplug (flush buffered pkts)
    end
Loading

Phase 1 — Storage Mirroring (NBD + drive-mirror)

The destination QEMU starts an NBD server exporting the target block device. The source issues a drive-mirror QMP command that copies every block to the remote NBD target in the background while the VM keeps running. Dirty blocks are re-synced continuously until the mirror reports ready (fully synchronized).

Phase 2 — Compute Migration (RAM Pre-Copy & Final Incremental Copy)

Once storage is synchronized, the source starts standard QEMU RAM pre-copy migration (migrate QMP command) with auto-converge enabled. QEMU iteratively copies dirty RAM pages while the VM continues to run.

To achieve true "zero downtime" perception, katamaran configures QEMU with a 25 ms default downtime limit and uncaps the migration bandwidth to 10 GB/s. The downtime limit is configurable with --downtime or can be derived from RTT with --auto-downtime; QEMU keeps iterating until the remaining dirty RAM can be transferred within that budget.

Once the remaining dirty RAM set is small enough to transfer within the configured downtime budget, the VM pauses (emitting the STOP event). At this very last bit, QEMU performs a final incremental copy of the remaining dirty RAM pages and device state. Only after this final copy completes does the destination VM resume (emitting the RESUME event).

Phase 3 — Zero-Drop Network Cutover (tc sch_plug + IP Tunnel)

The critical downtime window — between STOP on the source and RESUME on the destination — is where packets would normally be lost. katamaran eliminates this:

  1. Source side: Immediately after STOP, an IP tunnel is created pointing at the destination node. The tunnel encapsulation is selected by --tunnel-mode: with the default ipip, an IPIP tunnel is used for IPv4 (mode ipip) and an ip6tnl tunnel for IPv6 (mode ip6ip6); with gre, a GRE tunnel is used for IPv4 (mode gre) and an ip6gre tunnel for IPv6. GRE is recommended on cloud VPCs (AWS, GCP, Azure) where IPIP (IP protocol 4/41) is often blocked by security groups, while GRE (IP protocol 47) is widely permitted. A host route for the VM IP is added through the tunnel, forwarding any packets that arrive at the (now stale) source to the destination.
  2. Destination side: A tc sch_plug qdisc on the destination tap interface buffers all arriving packets (including those forwarded through the tunnel). The qdisc is installed in pass-through mode (release_indefinite) and switched to buffering (block) before waiting for RESUME. When the VM resumes, the queue is unplugged with release_indefinite, flushing all buffered packets into the now-running VM in order. QEMU's announce-self QMP command then broadcasts Gratuitous ARP using the guest's actual MAC address, ensuring switches learn the correct port binding immediately.

The result: packets that arrive during the switchover are queued, not dropped. After the CNI control plane converges (seconds later), new traffic flows directly to the destination and the tunnel is torn down.

Concurrency, Safety & State Handling (Phase 1/2)

To ensure absolute safety during orchestration, katamaran implements strict concurrency constraints designed to avoid race conditions and resource leaks:

  1. Context Cancellation Trade-offs: When the main context cancels (e.g. from SIGINT or a timeout), katamaran does not immediately close the QMP connection. Instead, it uses context.AfterFunc to shorten the socket deadline, cleanly interrupting any blocking reads without causing a data race. This keeps the QMP connection alive just long enough to execute deferred cleanup commands (like migrate-cancel or block-job-cancel) before exit.
  2. Cancellation-Detached Cleanup: Operations running in defer blocks use context.WithoutCancel. This detaches the cleanup step from the main cancellation tree (so it isn't instantly aborted) but preserves critical values like logging traces or metrics attached to the original context.
  3. Sequential Polling vs Concurrent Races: Instead of spawning background goroutines that listen for asynchronous QMP events while concurrently polling status endpoints, katamaran executes a unified sequential polling loop. This explicitly eliminates the risk of concurrent state access issues across all migration phases, avoiding missed STOP events or silent QEMU failures.

Prerequisites

Component Minimum Version Notes
QEMU 6.2+ Must support drive-mirror, nbd-server-start, announce-self, QMP
Kata Containers 3.x QMP socket must be accessible
iproute2 any tc (sch_plug qdisc) + ip tunnel (IPIP/GRE/ip6tnl/ip6gre)
Go 1.26+ Install system-wide

For CNI compatibility details (OVN-Kubernetes, Cilium, Calico, Flannel, and others), see Networking: CNI Compatibility under Kubernetes Integration.


Project Structure

go.mod                          # Go module (github.com/maci0/katamaran)
Makefile                        # Build, test, fuzz, image targets
Dockerfile                      # Multi-stage container image build
Dockerfile.dashboard            # Dashboard container image build
Dockerfile.factory              # VM factory container image build
Dockerfile.mgr                  # Migration controller container image build
.dockerignore                   # Build context exclusions
.github/
  dependabot.yml                # Dependabot config (GitHub Actions + Go modules, weekly)
  workflows/
    ci.yml                      # GitHub Actions CI (lint, test, fuzz seeds, build, Docker)
cmd/
  katamaran/
    main.go                     # Thin wrapper around internal/katamaran.Run
    main_test.go                # CLI validation and flag behavior tests
  dashboard/
    main.go                     # Dashboard CLI wrapper
    README.md                   # Dashboard usage guide
  katamaran-orchestrator/
    main.go                     # Structured JSON-in / NDJSON-out orchestrator CLI
    request.go                  # JSON Request decoding helpers
    status.go                   # NDJSON status emission helpers
    main_test.go                # CLI validation tests
  katamaran-mgr/
    main.go                     # Migration CRD controller entrypoint
    debug.go                    # /healthz, /readyz, /metrics, /debug/vars handlers
    kubeconfig.go               # Out-of-cluster kubeconfig loader
    main_test.go                # Controller CLI helper tests
  katamaran-factory/
    main.go                     # Kata VM cache gRPC server entrypoint
    sandbox_config.go           # Reads VMConfig + AgentConfig from sandbox persist.json
    main_test.go                # Factory CLI tests
internal/
  buildinfo/
    buildinfo.go                # Build version variable (overridden via ldflags)
  controller/
    reconciler.go               # Migration CRD reconcile loop and status patching
    reconciler_test.go          # Controller reconciliation tests
  dashboard/
    index.html                  # Embedded dashboard frontend (dark theme, Chart.js)
    doc.go                      # Package-level overview
    server.go                   # Dashboard web server and route table
    server_test.go              # Dashboard endpoint, middleware, and security tests
    loadgen.go                  # Ping and HTTP load generator handlers
    metrics.go                  # expvar counters and duration buckets
    middleware.go               # HTTP middleware (logging, recovery, CSRF, security headers)
    migrate.go                  # Migration orchestration handler
    response.go                 # JSON response + form-POST parsing helpers
    types.go                    # Dashboard state and API response types
    validation.go               # Input validation and SSRF prevention
  factory/
    server.go                   # Kata VM cache gRPC server implementation
    watcher.go                  # migration-meta.json directory watcher
    cachepb/                    # Generated CacheService protobuf bindings
  katamaran/
    cli.go                      # Primary source/destination CLI implementation
  logging/
    logging.go                  # Logging setup helpers (SetupLogger)
    logging_test.go             # Logging tests
  migration/
    config.go                   # SourceConfig / DestConfig types, shared constants, and QEMU URI helpers
    config_test.go              # Config unit tests
    validation.go               # Tap-interface / netns / drive-id validators
    cmdlinefetch.go             # Pod-log apiserver fetcher for replayed QEMU cmdlines
    cmdlinefetch_test.go        # Pod-log fetcher unit tests
    dest.go                     # Destination-side migration logic
    dest_test.go                # Destination unit tests
    destspawn.go                # Spawns the dest QEMU + virtiofsd in --replay-cmdline mode
    destspawn_test.go           # Dest QEMU spawner unit tests
    exec.go                     # External command execution (runCmd, runCmdInNetns)
    exec_test.go                # Exec unit tests
    podresolve.go               # Resolves pod IP / sandbox UUID / QEMU PID via apiserver + procfs
    podresolve_test.go          # Pod-resolver unit tests
    qmp_recording_test.go       # QMP command recording helpers for migration tests
    source.go                   # Source-side migration logic and polling
    source_test.go              # Source unit tests
    tunnel.go                   # IP tunnel setup/teardown (IPIP/GRE/ip6ip6/ip6gre)
    tunnel_test.go              # Tunnel unit tests
  orchestrator/
    orchestrator.go             # Public orchestrator interface and shared helpers
    types.go                    # Request, status, and migration ID types
    id.go                       # MigrationID generator
    validation.go               # Request validation (Validate, ValidateSafeArgValue)
    cmdline.go                  # hostPath layout for the captured-cmdline replay flow
    discovery*.go               # Kubernetes pod/node discovery boundary
    native*.go                  # client-go implementation that submits migration Jobs
    templates/                  # Embedded source/destination Job manifests
  qmp/
    client.go                   # QMP client (connect, execute, wait for events)
    client_test.go              # QMP client unit tests
    fuzz_test.go                # Fuzz tests for QMP protocol parsing (6 targets)
    types.go                    # QMP protocol types and command argument structs
  qmptest/
    qmptest.go                  # Shared test helpers for faking a QMP server
deploy/
  dashboard.yaml                # Dashboard Kubernetes Deployment + ClusterIP Service
  daemonset.yaml                # DaemonSet for node setup (binary, kernel modules, QMP config when present)
  migration-example.yaml        # Sample Migration CR (kubectl apply -f to start a migration)
  migrate.sh                    # Manual-testing shell wrapper around the Job templates
                                #   under internal/orchestrator/templates/. Production paths
                                #   submit those templates through the Native orchestrator.
config/crd/
  migration.yaml                # Migration CRD definition + status subresource
  manager.yaml                  # katamaran-mgr ServiceAccount + ClusterRole + Deployment + PDB
docs/
  INSTALL.md                    # Installation guide (binary, container, DaemonSet)
  USAGE.md                      # Usage guide (CLI and Kubernetes Jobs)
  TESTING.md                    # Test environment guide
  STORIES.md                    # User stories
  ROADMAP.md                    # Project roadmap (short/medium/long term)
  logo.png                      # Project logo
demo/
  nginx-kata.yaml               # Example Kata Containers pod with NGINX + NodePort
scripts/                        # Test and operational scripts
  test.sh                       # Smoke tests (no VMs required)
  cleanup.sh                    # Cluster cleanup helper
  minikube-test.sh              # Single-node Kata QMP smoke test (requires KVM)
  e2e.sh                        # Unified E2E live migration test harness
  sweep.sh                      # Parameter sweep tool for migration tuning
  lib.sh                        # Shared utility functions for scripts
  build-minikube-iso.sh         # Custom minikube ISO builder
  build-minikube-modules.sh     # Kernel module builder for minikube
  manifests/                    # E2E test manifests and templates
    kata-pod.yaml               # Kata Containers pod template
    kind-config.yaml            # Kind cluster configuration
    kind-config-tcg.yaml        # Kind cluster configuration (TCG/software-emulation variant)
    kind-config-nocni.yaml      # Kind cluster configuration (CNI disabled for Cilium/Flannel)
    kind-config-nocni-tcg.yaml  # Kind cluster configuration (no CNI + TCG)
    nfs-pv.yaml                 # NFS PersistentVolume template
    nfs-server.yaml             # NFS server pod template
    pod-src.yaml                # Source pod manifest for E2E tests
    pod-dest.yaml               # Destination pod manifest for E2E tests

Usage

katamaran provides two modes (source and dest) to coordinate migration.

For full details on CLI flags, direct usage, shared storage mode, IPv6, Cloud VPC configuration, and Kubernetes Job-based orchestration, please see the Usage Guide.


Why Sequential Pre-Copy?

A natural question: why not mirror storage and RAM in parallel?

The drive-mirror operation generates substantial network I/O — it copies the entire block device (often tens of GB) over the wire. Running RAM pre-copy simultaneously would cause two problems:

  1. Buffer overflow on the network path. Both streams compete for bandwidth. RAM pre-copy is latency-sensitive (dirty pages must be re-sent each round). When storage mirroring saturates the link, RAM rounds take longer, more pages get re-dirtied, and convergence stalls — or the migration fails entirely.
  2. Wasted bandwidth from redundant RAM retransmission. While storage is still syncing, the VM keeps running and dirtying RAM. Each pre-copy round re-sends those dirty pages. If storage sync takes 5 minutes, that's 5 minutes of RAM rounds that will largely be invalidated. By waiting for storage to reach ready, we start RAM pre-copy on a quiet network with the shortest possible convergence path.

The sequential approach — storage first, then RAM — minimizes total migration time and keeps the final downtime window (the STOPRESUME gap) as short as possible.


Kubernetes Integration

katamaran is designed as a low-level migration primitive. In a production Kubernetes cluster, it would be invoked by a higher-level controller (e.g., a CRD operator) that orchestrates the full lifecycle: selecting a target node, preparing the destination VM, invoking katamaran on both sides, and updating Kubernetes state afterward.

This section explores which storage and networking stacks are compatible, what the ideal setup looks like, and the open integration points.

Current Deployment Flow

Production paths run through the in-cluster Native orchestrator (client-go) embedded in either the dashboard or the katamaran-mgr controller. Both consume the same orchestrator.Request type and submit identical Job manifests, so behaviour is identical between the operator-driven (CRD) and human-driven (UI) entry points. The legacy deploy/migrate.sh shell harness is kept for ad-hoc CLI runs and CI smoke only.

flowchart LR
    subgraph entry["Entry points"]
      U1[kubectl apply Migration CR]
      U2[Dashboard UI / curl /api/migrate]
    end
    subgraph control["Control plane"]
      MGR[katamaran-mgr<br/>2 replicas, leader-elected]
      DASH[Dashboard]
    end
    NATIVE[Native orchestrator<br/>internal/orchestrator/native.go]
    JOBS[(source + dest Jobs<br/>katamaran-source-&lt;id&gt;<br/>katamaran-dest-&lt;id&gt;)]
    SRC[Source pod<br/>--mode source]
    DEST[Dest pod<br/>--mode dest, --replay-cmdline-from-pod]
    QEMU_SRC[(Source QEMU<br/>kata sandbox)]
    QEMU_DST[(Dest QEMU<br/>spawned by katamaran)]

    U1 --> MGR --> NATIVE
    U2 --> DASH --> NATIVE
    NATIVE --> JOBS
    JOBS --> SRC
    JOBS --> DEST
    SRC -- migrate-set-parameters / migrate --> QEMU_SRC
    DEST -- spawns + migrate-incoming --> QEMU_DST
    QEMU_SRC -. tcp:&lt;dest&gt;:4444 multifd RAM stream .-> QEMU_DST
Loading

Status flows back the other way: source emits structured KATAMARAN_PROGRESS / KATAMARAN_RESULT / KATAMARAN_DOWNTIME_LIMIT markers via stdout; the orchestrator tails those on the source pod's log and turns them into StatusUpdate events that the dashboard renders as a progress bar and katamaran-mgr patches onto .status of the Migration CR.

Storage: CSI Driver Compatibility

The storage strategy depends on whether the cluster uses shared storage (both nodes see the same block device) or local storage (each node has its own disk).

CSI Driver Storage Type katamaran Mode Notes
Ceph RBD (rbd.csi.ceph.com) Shared block --shared-storage Ideal. Both nodes mount the same RBD image. No data transfer needed — only RAM + network state migrate. Requires ReadWriteMany or controlled handoff (unmap on source, map on dest).
CephFS (cephfs.csi.ceph.com) Shared filesystem --shared-storage Works if the VM's rootfs is on a CephFS-backed virtio-fs or virtiofs mount. Less common for block-level VM disks.
NFS (nfs.csi.k8s.io) Shared filesystem --shared-storage Simple but slower. NFS latency can affect VM disk I/O during and after migration. Acceptable for low-IOPS workloads.
Longhorn (driver.longhorn.io) Replicated local NBD drive-mirror Longhorn volumes are node-local with network replication. katamaran mirrors the block device via NBD, then the Longhorn controller can adopt the replica on the destination.
OpenEBS Mayastor (io.openebs.csi-mayastor) Replicated local NBD drive-mirror or --shared-storage Mayastor NVMe-oF targets can be re-exported to the destination node, potentially allowing shared-storage mode. Otherwise, NBD drive-mirror works.
TopoLVM (topolvm.io) Strict local NBD drive-mirror Purely local LVM. The entire block device must be mirrored. Best for small disks or infrequent migrations.
Local Path Provisioner Strict local NBD drive-mirror No replication. Full block copy required. Suitable for dev/test.

Shared Storage: The Fast Path

With Ceph RBD, migration skips the most time-consuming phase entirely. The flow becomes:

sequenceDiagram
    participant S as Source Node
    participant RBD as Ceph RBD
    participant D as Destination Node

    D->>RBD: Open same RBD image (read-only)
    Note over D: katamaran --mode dest --shared-storage
    D->>D: Install qdisc, wait for RESUME

    Note over S: katamaran --mode source --shared-storage
    S->>D: RAM pre-copy only (no storage mirror)
    Note over S: VM pauses (STOP)
    Note over D: VM resumes (RESUME)
    S-->>RBD: Unmap image
    D->>RBD: Promote to read-write
    S->>D: IPIP tunnel + sch_plug flush
Loading

Total migration time is dominated by RAM pre-copy convergence — typically seconds for a 4 GB VM with moderate dirty page rate.

Local Storage: The Full Pipeline

With Longhorn or local disks, all three phases run:

sequenceDiagram
    participant S as Source Node
    participant D as Destination Node

    rect rgb(59, 130, 246, 0.1)
    Note over S,D: Phase 1 — minutes to hours (scales with disk size)
    S->>D: NBD drive-mirror (entire block device)
    Note over S: VM keeps running
    end

    rect rgb(16, 185, 129, 0.1)
    Note over S,D: Phase 2 — seconds (after storage sync)
    S->>D: RAM pre-copy
    end

    rect rgb(245, 158, 11, 0.1)
    Note over S,D: Phase 3 — milliseconds
    S->>D: Network cutover (tunnel + qdisc flush)
    end
Loading

The NBD mirror runs in the background while the VM stays live, but total wall-clock time scales with disk size and write rate.

Networking: CNI Compatibility

The network cutover (Phase 3) must work with the cluster's CNI plugin. The key requirement is that the VM's pod IP must remain reachable during the gap between source STOP and destination RESUME, plus the time for the CNI to update its routing/forwarding tables.

CNI Compatibility How It Works Convergence Time
OVN-Kubernetes ★★★ Excellent OVN's southbound DB updates the port-chassis binding. The logical switch port moves to the destination node automatically. GARP + OVN's own MAC learning provide near-instant convergence. Tested via e2e.sh --cni ovn. < 1s
Kube-OVN ★★★ Excellent Separate OVN-based CNI (by Alauda). Same port-chassis rebinding via OVN southbound DB. Additional features like subnets and VPCs. Not tested but expected to work identically. < 1s
Cilium ★★★ Excellent eBPF datapath. After migration, the destination node's Cilium agent detects the new endpoint and installs eBPF maps. The IPIP tunnel covers the gap. Cilium's IPAM can be configured to preserve pod IPs across nodes with cluster-pool mode. 1–3s
Calico ★★☆ Good BGP route propagation. The destination node advertises the pod IP via BGP. The IPIP tunnel bridges the gap until all peers converge. Calico IPAM must allow the pod IP to exist on the destination node (use --ipam=host-local with a shared pool, not per-node blocks). 2–5s
Flannel ★★☆ Good VXLAN FDB entries. The destination node must update the VXLAN forwarding database. GARP handles L2, but Flannel's flanneld may take a few seconds to update FDB entries on all nodes. The IPIP tunnel covers the gap. 2–5s
Antrea ★★☆ Good OVS-based. Similar to OVN-Kubernetes but with its own controller. Port migration requires the Antrea agent to update OVS flows on the destination. GARP + IP tunnel cover the gap. 1–3s
Multus (meta-CNI) Depends Multus delegates to underlying CNIs. Compatibility depends on the primary and secondary CNI plugins. Each interface may need its own migration strategy. Varies

IP Preservation

Important

The most critical requirement for network routing is that the VM's pod IP must survive migration.

This means the IPAM must allow the same IP to be assigned on the destination node. Per-node IP blocks (the default for several CNIs) are problematic because the pod IP belongs to the source node's CIDR and is invalid on the destination.

CNI Default IPAM Migration-Safe Config
OVN-Kubernetes Cluster-wide Works out of the box — IPs are not tied to nodes
Kube-OVN Cluster-wide Works out of the box
Cilium cluster-pool Works out of the box with default cluster-pool IPAM mode. Avoid kubernetes IPAM mode (per-node CIDRs)
Calico Per-node blocks Requires config change. Use ipipMode: Always with a shared IPPool and disable blockSize-based per-node allocation, or use Calico's IPAM with nat-outgoing: false and a flat pool
Flannel Per-node /24 Problematic. Flannel allocates a /24 per node from the pod CIDR. No built-in mechanism to preserve IPs across nodes. Requires external IPAM or static IP annotation
Antrea Per-node Requires config change. Configure NodeIPAM with a shared pool or use externalIPAM
kindnet Per-node Not suitable for live migration (no cross-node IP support)

The Ideal Setup

For production live migration with minimal downtime and operational complexity:

┌──────────────┬──────────────────────────────────────────────┐
│              │              Ideal Stack                     │
├──────────────┼──────────────────────────────────────────────┤
│ Runtime      │ Kata Containers 3.x + QEMU 8+               │
│ Storage CSI  │ Ceph RBD (rbd.csi.ceph.com)                 │
│ Storage Mode │ --shared-storage (skip NBD mirror)           │
│ CNI          │ OVN-Kubernetes or Cilium                     │
│ IPAM         │ Cluster-wide pool (not per-node)             │
│ Kernel       │ 5.15+ (sch_plug, IPIP, KVM)                 │
│ Network      │ 25 Gbps+ node-to-node (for RAM pre-copy)    │
│ Orchestrator │ CRD operator (manages lifecycle)             │
└──────────────┴──────────────────────────────────────────────┘

Why this stack:

  1. Ceph RBD eliminates the storage mirroring phase entirely. Migration becomes RAM-only, completing in seconds instead of minutes.
  2. OVN-Kubernetes or Cilium provide the fastest network convergence. OVN's centralized southbound DB updates port bindings atomically. Cilium's eBPF datapath reconverges without waiting for BGP propagation.
  3. Cluster-wide IPAM ensures the pod IP is valid on any node, avoiding the per-node CIDR problem.
  4. 25 Gbps+ networking helps the final dirty page flush (the actual downtime-causing transfer) complete within the configured downtime budget.

Integration Architecture (Operator-Driven)

The in-cluster controller manages migrations with the Migration CRD:

apiVersion: katamaran.io/v1alpha1
kind: Migration
metadata:
  name: migrate-nginx-pod
  namespace: default
spec:
  sourcePod:
    namespace: default
    name: nginx-kata
  destNode: worker-02
  image: localhost/katamaran:dev
  sharedStorage: true
  replayCmdline: true
  downtimeMS: 25

The controller's reconciliation loop:

flowchart TD
    A[Migration CR created] --> B[Resolve source node and destination IP]
    B --> C[Submit source and destination Jobs]
    C --> D["katamaran --mode source"]
    C --> E["katamaran --mode dest"]
    D --> F{Migration result}
    E --> F
    F -->|Success| G[Patch Migration status]
    G --> H[Optional source cleanup / VM adoption]
    F -->|Failure| I[migrate-cancel]
    I --> J[Patch error status and clean up Jobs]
Loading

Open Questions for Production

  • Pod checkpoint/restore: Should the operator snapshot the pod spec and container state for rollback?
  • Live migration scheduling: Which node to pick? Factors: resource headroom, storage locality, network topology, anti-affinity rules.
  • Preemption: Can a migration be preempted mid-flight if the destination node runs out of resources? This requires migrate-cancel QMP support (already available in QEMU).
  • Encryption: NBD traffic and RAM migration traffic are currently unencrypted. For cross-rack or cross-AZ migration, WireGuard or IPsec tunnels should wrap the migration streams.
  • Observability: Storage sync percentage and dirty-page rate are not yet exported as controller metrics.

Dashboard

A web UI for orchestrating migrations, visualizing ping latency (zero-drop proof), and running HTTP load generators during cutover. Includes a pod-picker that auto-discovers kata-qemu pods + nodes and a cmdline-replay mode that spawns the destination QEMU itself (no kata pod required on the dest node). See cmd/dashboard/README.md for the full UI flow + screenshots.

Dashboard pod picker

Deploying the Dashboard

# 1. Build the image (from repository root)
make dashboard

# 2. Load it into your cluster (if using minikube/kind)
minikube image load dashboard.tar

# 3. Deploy the manifests
kubectl apply -f deploy/dashboard.yaml

Using the Dashboard

Once deployed, the dashboard is exposed via a ClusterIP service on port 8080.

  1. Access the UI: Run kubectl port-forward -n kube-system svc/katamaran-dashboard 8080:8080 and open http://localhost:8080.
  2. Pick a Source Pod and Dest Node from the dropdowns (auto-populated from /api/pods and /api/nodes). The hidden vm_ip and dest_ip form fields auto-fill from the selection. For full zero-config dest spawning, also enable replay_cmdline=true (recommended — see scripted example in the dashboard README).
  3. Start Load Generation: Click ICMP Ping or HTTP Load. A live Chart.js graph plots latency.
  4. Migrate: Click Start Migration. The real-time log viewer streams the orchestrator's progress.
  5. Observe Zero-Drop: As the migration crosses the configured downtime window (25 ms by default), you will see a latency spike on the chart (representing the buffered packets) but zero dropped packets.

For manual override of any auto-derived value, expand the Advanced (override auto-discovery) disclosure in the form.


Migration CRD (Operator Path)

For GitOps / Argo / declarative workflows that prefer kubectl apply over a UI, katamaran ships a Migration Custom Resource and a small in-cluster controller (katamaran-mgr) that reconciles it through the same Native orchestrator the dashboard uses. Behaviour is identical — the only difference is the entry point.

# Build + load the controller image
make mgr
minikube image load mgr.tar

# Install the CRD + controller (one-time)
kubectl apply -f config/crd/migration.yaml
kubectl apply -f config/crd/manager.yaml

# Submit a migration
kubectl apply -f deploy/migration-example.yaml

# Watch the phase column
kubectl get migration -w
# NAME     SOURCE      DEST           PHASE        AGE
# demo-1   kata-demo   kata-worker-b  transferring 12s
# demo-1   kata-demo   kata-worker-b  succeeded    38s

The CR's .status carries the same migrationID, phase, startedAt, completedAt, and error fields that the dashboard surfaces — so external systems can wait on a Migration the same way they wait on a Job.


Testing

katamaran includes a comprehensive test suite ranging from native Go fuzzing to multi-node live migration tests proving zero packet drops across various CNIs (OVN-Kubernetes, Cilium, Calico, Flannel) and storage backends.

For instructions on running the test suite, verifying zero-drop behavior, and fuzzing the QMP protocol, please see the Testing Guide.


Future Ideas

Cross-Cluster Migration

Migrate a Kata VM pod from one Kubernetes cluster to another — not just between nodes within a single cluster. This would enable use cases like cluster upgrades, cloud-region failover, and hybrid-cloud burst.

  • Federation-aware orchestration: A higher-level controller that discovers destination clusters (via Cluster API, Admiralty, or manual config) and negotiates resource reservations before starting migration
  • Cross-cluster networking: Establish a migration data path between clusters — WireGuard mesh, Submariner, or Cilium ClusterMesh — to carry NBD, RAM pre-copy, and tunnel traffic across cluster boundaries
  • IP address handoff: Pod IP will change across clusters (different pod CIDRs). Requires a DNS-based or service-mesh-based identity layer (e.g. Istio, Linkerd) so clients reconnect transparently after migration
  • Storage replication across clusters: For non-shared-storage setups, NBD drive-mirror must traverse the inter-cluster link. For shared storage, both clusters need access to the same Ceph/NFS pool (stretched cluster or async replication with final sync)
  • Credential and secret migration: ServiceAccount tokens, mounted secrets, and ConfigMaps must be recreated or mirrored on the destination cluster before the VM resumes
  • RBAC and admission policy alignment: The destination cluster must accept the pod's SecurityContext, RuntimeClass, and resource requests — mismatches cause the destination pod to be rejected
  • Multi-phase cutover: Source cluster keeps serving traffic via the IPIP/GRE tunnel while the destination cluster's ingress, DNS, and service entries converge — then the tunnel is torn down

Multi-NIC Pod Migration (Multus)

Kata Containers supports Multus CNI for attaching multiple network interfaces to a pod — including SR-IOV passthrough via VFIO. Migrating multi-NIC pods adds complexity since each interface needs its own cutover handling.

  • Per-interface tunnel setup: Each network attachment needs its own IPIP/GRE tunnel and sch_plug qdisc during cutover — katamaran currently assumes a single tap interface
  • SR-IOV / VFIO passthrough: Passthrough devices cannot be live-migrated (hardware-bound). Requires detach-on-source, re-attach-on-destination with a brief connectivity gap on that interface, or fallback to virtio-net for migratable NICs
  • Mixed interface types: A pod might combine a primary virtio-net (migratable) with a secondary SR-IOV NIC (non-migratable). Migration logic must handle each interface type differently
  • NetworkAttachmentDefinition replay: Destination must have matching NetworkAttachmentDefinition CRs and available device resources (e.g. SR-IOV VFs) on the target node
  • IPAM coordination across interfaces: Each Multus interface may use a different IPAM — the primary CNI's cluster-wide pool plus per-interface static or DHCP assignments that must be preserved or re-acquired
  • QEMU device topology: Additional NICs appear as hotplugged PCI devices in the guest VM. The destination QEMU must reconstruct the same PCI topology (device IDs, bus addresses) for the guest to recognize its interfaces after resume

About

Live migration for Kata Containers

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors